Making Pulgarcito machine readable: How to start with basic NLP analysis from a digital image

IMPACT Centre of CompetenceNote: The work described in this blog post was undertaken thanks to the collaboration with the IMPACT Centre of Competence (www.digitisation.eu).

Digitization of archival and historical material can be problematic for researchers due to a number of issues. One of such issues is the presence of gaps and empty spaces around and in between text. It became apparent in my recent analysis of the most frequent words in Pulgarcito a Cuban illustrated and literary journal for kids wrote between 1919-1920. Pulgarcito was digitized and is available online: http://imagenes.sld.cu/download/pulgarcito/volumen-2.pdf.

The journal consists of rich material: drawings, photographs, fairy tales, comic strips, legends, poems, fables, anecdotes, paintings for children. It is a very interesting material as the text throughout the publication is typed, handwritten and drawn. This makes the digitized publication in PDF quite challenging.

The aim of the task was to use NLP tools in the text analysis in an image-based book digitalization, with texts including also hand-written texts.

After trying a commercial OCR product, the results were very poor, so I decided to approach the IMPACT Centre of Competence (www.digitisation.eu) for support. I needed an OCR system that would deliver good quality text recognition results in a machine-readable format (e.g. XML, TXT). They promptly answered my enquiry and just over a week later I received the journal in plain text format and in XML (both with some OCR errors).

The example below illustrates the conversion of a hand-written and typed text from the PDF into a plain text format. Have a look how many different “A”s can be found on this page:

CUANDO UN N1NO
«5Í POEAA?
M$&ECE, UMBETRAfO
conminas y cía

Most of the sentences were recognized with some errors, for example the word “NIÑO” (child) was identified as “N1NO”, or the word “UN RETRATO” (a picture) was not split and resulted in “UMBETRAfO”. Finally, the last line was not detected at all.

As expected, better results were given where the text was typed. The example below illustrates it well.

NoBin_pul-074.txt

Bin_pul-074.txt

vos pajaritos! Tienen, a su modo, las mismas atenciones, cariños y
cuidados que tiene el hombre con sus hijos.

Sienten a su modo lo mismo que vuestros padres sienten por us-

tedes; por eso es tan inhumano destruir esos nidos o encerrar a cual-

quier pájaro en una jaula que por ser muy dorada, no dejará de ser

una prisión para él, nacido para cantar libremente ccmo un poeta

del ensueño que volase entre el cielo y la tierra. Al contrario. Fa-

bricad vosotros mismos nidos, e instalad pequeñas fuentes en vues-

tro jardín. Tendréis así todos los pájaros y todos los cantos. Y cuan-

do llegue la época de las crías, regad motitas de algodón, como ha-

cen en los grandes parques los niños de otras ciudades. No olvidéis

que estos amigos alados tienen, como vosotros, su hogar, sus hijos,

la dulce encantadora libertad por la cual han venido luchando to-

dos los hombres desde que la tierra recibió; allá, en la noche de los

tiempos, el primer beso del sol.

vos pajaritos! Tienen, a su modo, las mismas atenciones, cariños y

cuidados que tiene el hombre con sus hijos.

Sienten a su modo lo mismo que vuestros padres sienten por us-

tedes; por eso es tan inhumano destruir esos nidos o encerrar a cual-

quier pájaro en una jaula que por ser muy dorada, no dejará de ser

una prisión para él, nacido para cantar libremente ccmo un poeta

del ensueño que volase entre el cielo y la tierra. Al contrario. Fa-

bricad vosotros mismos nidos, e instalad pequeñas fuentes en. vues-

tro jardín. Tendréis así todos los pájaros y todos los cantos. Y cuan-

do llegue la época de las crías, regad motitas de algodón, como ha-

cen en los grandes parques los niños de otras ciudades. No olvidéis

que estos amigos alados tienen, como vosotros, su hogar, sus hijos,

la dulce encantadora libertad por la ‘cual han venido luchando to-

dos los hombres desde que la tierra recibió; allá, en la noche de los

tiempos, el primer beso del sol.

…a veces discuten acaloradamente entre sí…

O-O-O-O’O’O-O – $-0.0-0.0-0-0 -0*0

The results above were achieved by the IMPACT by using the following methods:

All PDF images were extracted using a tool pdfimages in Linux.
The digitization was done with the FineReader 11 SDK version.
The OCR FineReader 11 SDK version with Spanish language and different types of letters was used with normal and handprinted output in ALTO XML and Text Unicode Defaults.

Once we had the image-based digitized publication book in a txt format, we used ANALHITZA (Otegi et al. 2017). It is a tool created in collaboration with the Spanish CLARIN K-centre to extract words and frecuencies, identify proper nouns (NERC) and extract some word sequences (n-grams), among other things.

The text analysis results were as follows:

Freq.	Nouns	Freq.	Adjectives
255	niño	160	bueno
194	año	124	gran
159	hombre	99	grande
154	día	75	nuevo
148	padre	62	viejo
148	rey	57	blanco
134	hijo	51	pobre
131	vez	48	mayor
114	libro	48	largo
106	casa	45	azul
103	tiempo	44	mejor

A sample of the NERC (LOC means “location”, PER stand for “person”):

Freq.	W1	Type
8	alemania	LOC
2	dinamarca	LOC
2	alejandro	PER
1	16 de mayo de 1703	DATE
1	cataluña	LOC

After we extracted the most frequent bigrams (“P” for pronoum, “D” determiner, “C” connector, “V” verb, “N” noun):

Freq.	w1	Cat	w2	Cat
846	de	P	el	D
692	en	P	el	D
565	a	P	el	D
388	y	C	el	D
245	de	P	su	D
229	por	P	el	D
226	el	D	que	Q
224	todo	D	el	D
206	con	P	el	D
204	a	P	su	D
202	que	C	el	D
201	de	P	uno	D
165	ser	V	el	D
151	el	D	niño	N

After that we used Voyant Tools (Sinclair and Rockwell, 2016) to get visualizations of the data in order to achieve a more user-friendly representation of the data. The result was a word cloud of the entire book:

A further analysis of the word «niña» (Key Word in Context or KWIC) extracted with Voyant Tools, can be used to show how the girls were characterized in 1920 or to learn the cohesion between the gender (feminine) of the article and the noun:

Left	Term	Right
tenía, a su vez, una	niña	, que era dulce y bon
las excelentes cualidades de aquella	niña	. La encomendó las tareas más
pies a cabeza. La pobre	niña	todo lo sufría con paciencia
g n ■w- canzaria. La	niña	perdió uno de sus zapatos
meses regalaremos al niño o	niña	que mayor número de ellas
ha pensado mucho en la	niña	! El dice que siempre que
y escribe mejor- Y la	niña	se va, se va despacio
tropieza con todo! Pero la	niña	no se ha des- pertado
de olor: y es una	niña	de sombrero colorado, que trae
hoy en casa por mi	niña	”, le dijo su padre, “y

The analysis described above shows that there are still many errors and one should carefully check the extracted text, and correct to obtain a more reliable data. The overall task was very fast and efficient and proved to ask interesting research questions. The next steps are to use the Programing Historian publications and see if the text can be cleaned of all OCR errors using regular expressions (Turner-O’Hara 2013):
https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions

REFERENCES

Otegi, A. Imaz, O. Díaz de Ilarraza, A. Iruskieta, M. Uria, L. 2017.ANALHITZA: a tool to extract linguistic information from large corpora in Humanities research. Procesamiento del Lenguaje Natural 58: 77-84.

Pulgarcito Volumen No 2 – No 1 – 1920. URL: http://iiif.sld.cu/coleccion/07/06/2017/pulgarcito-volumen-no-2-no-1-1920 [January 10, 2019]

Sinclair, S. Rockwell, G. «Voyant tools.» URL: http://voyant-tools. org/ [September 5, 2016] (2016).

Turner-O’Hara, L. 2013. Cleaning OCR’d text with Regular Expressions. URL: https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions [January 10, 2019]

04/02/2019

Displaying 1 - 2 of 2

Grammars and language models

EDGK

Rule-based Dependency Grammar for Basque

BERTeus

BERT language model for Basque

Displaying 1 - 20 of 20

Tools and services

Averell Averell is a Python library and command line interface to download and to standardize corpora from ten multi-lingual poetry repositories	Jollyjumper Jollyjumper is our enjambment detection Python library for Spanish	Rantanplan Rantanplan is a Python library for the automated scansion of Spanish poetry	PoetryLab app PoetryLab: An Open Source Toolkit for the Analysis of Spanish Poetry Corpora
PDMapping Tool for documenting and analyzing speakers' judgments about spatial and sociocultural linguistic variation.	Ferramenta On-Line de ExpeRimentación PerceptivA (FOLErPa) FOLERPA is an online tool for carrying out perceptual experiments.	Cartografía dos apelidos de Galicia Research tool for the study of the geographical distribution of surnames in Galicia.	Vocabulary analyzer Web Service This web service calculates different lexicometric measures and displays them graphically (tokens, types, hapaxes & type/token ratio).
Ngram Statistics de Pedersen Pedersen's Ngram Statistics Package	UPF Freeling-based part-of-speech tagger. This is the UPF Freeling-based part-of-speech tagger.	Análisis de relaciones de dependencias This WS performs dependency parsing using Bohnet's graph-based Parser. The input is text in plain text or CoNLL format. The languages supported are English and Spanish.	Freeling Named Entity Recognition - NER Freeling-based Named Entity Recognition - NER
WSD-IXA Word-Sense Disambiguation	Ixa pipes Multilingual NLP tools	ixaKat A modular chain of Natural Language Processing tools for Basque	Maltixa Statistical Syntactic analyzer for Basque
Eustagger Morphosyntactic tagger for Basque	Xuxen Spelling and grammar checker for Basque	BASYQUE A web application to analyse syntactic variation of Basque dialects	Analhitza Category analyzer

You are here

Making Pulgarcito machine readable: How to start with basic NLP analysis from a digital image

Grammars and language models

Tools and services