Making Pulgarcito machine readable: How to start with basic NLP analysis from a digital image

IMPACT Centre of CompetenceNote: The work described in this blog post was undertaken thanks to the collaboration with the IMPACT Centre of Competence (www.digitisation.eu).

Digitization of archival and historical material can be problematic for researchers due to a number of issues. One of such issues is the presence of gaps and empty spaces around and in between text. It became apparent in my recent analysis of the most frequent words in Pulgarcito a Cuban illustrated and literary journal for kids wrote between 1919-1920. Pulgarcito was digitized and is available online: http://imagenes.sld.cu/download/pulgarcito/volumen-2.pdf.

The journal consists of rich material: drawings, photographs, fairy tales, comic strips, legends, poems, fables, anecdotes, paintings for children. It is a very interesting material as the text throughout the publication is typed, handwritten and drawn. This makes the digitized publication in PDF quite challenging.

The aim of the task was to use NLP tools in the text analysis in an image-based book digitalization, with texts including also hand-written texts.

After trying a commercial OCR product, the results were very poor, so I decided to approach the IMPACT Centre of Competence (www.digitisation.eu) for support. I needed an OCR system that would deliver good quality text recognition results in a machine-readable format (e.g. XML, TXT). They promptly answered my enquiry and just over a week later I received the journal in plain text format and in XML (both with some OCR errors).

The example below illustrates the conversion of a hand-written and typed text from the PDF into a plain text format. Have a look how many different “A”s can be found on this page:

 

 

 

 

 

 

CUANDO UN N1NO
«5Í POEAA?
M$&ECE, UMBETRAfO
conminas y cía

 

 

Most of the sentences were recognized with some errors, for example the word “NIÑO” (child) was identified as “N1NO”, or the word “UN RETRATO” (a picture) was not split and resulted in “UMBETRAfO”. Finally, the last line was not detected at all.

As expected, better results were given where the text was typed. The example below illustrates it well.

NoBin_pul-074.txt Bin_pul-074.txt
vos pajaritos! Tienen, a su modo, las mismas atenciones, cariños y
cuidados que tiene el hombre con sus hijos.

 

Sienten a su modo lo mismo que vuestros padres sienten por us-

tedes; por eso es tan inhumano destruir esos nidos o encerrar a cual-

quier pájaro en una jaula que por ser muy dorada, no dejará de ser

una prisión para él, nacido para cantar libremente ccmo un poeta

del ensueño que volase entre el cielo y la tierra. Al contrario. Fa-

bricad vosotros mismos nidos, e instalad pequeñas fuentes en vues-

tro jardín. Tendréis así todos los pájaros y todos los cantos. Y cuan-

do llegue la época de las crías, regad motitas de algodón, como ha-

cen en los grandes parques los niños de otras ciudades. No olvidéis

que estos amigos alados tienen, como vosotros, su hogar, sus hijos,

la dulce encantadora libertad por la cual han venido luchando to-

dos los hombres desde que la tierra recibió; allá, en la noche de los

tiempos, el primer beso del sol.

vos pajaritos! Tienen, a su modo, las mismas atenciones, cariños y

 

4

cuidados que tiene el hombre con sus hijos.

Sienten a su modo lo mismo que vuestros padres sienten por us-

tedes; por eso es tan inhumano destruir esos nidos o encerrar a cual-

quier pájaro en una jaula que por ser muy dorada, no dejará de ser

una prisión para él, nacido para cantar libremente ccmo un poeta

del ensueño que volase entre el cielo y la tierra. Al contrario. Fa-

bricad vosotros mismos nidos, e instalad pequeñas fuentes en. vues-

tro jardín. Tendréis así todos los pájaros y todos los cantos. Y cuan-

do llegue la época de las crías, regad motitas de algodón, como ha-

cen en los grandes parques los niños de otras ciudades. No olvidéis

que estos amigos alados tienen, como vosotros, su hogar, sus hijos,

la dulce encantadora libertad por la ‘cual han venido luchando to-

dos los hombres desde que la tierra recibió; allá, en la noche de los

tiempos, el primer beso del sol.

…a veces discuten acaloradamente entre sí…

O-O-O-O’O’O-O – $-0.0-0.0-0-0 -0*0

The results above were achieved by the IMPACT by using the following methods:

  • All PDF images were extracted using a tool pdfimages in Linux.
  • The digitization was done with the FineReader 11 SDK version.
  • The OCR FineReader 11 SDK version with Spanish language and different types of letters was used with normal and handprinted output in ALTO XML and Text Unicode Defaults.

Once we had the image-based digitized publication book in a txt format, we used ANALHITZA (Otegi et al. 2017). It is a tool created in collaboration with the Spanish CLARIN K-centre to extract words and frecuencies, identify proper nouns (NERC) and extract some word sequences (n-grams), among other things.

The text analysis results were as follows:

Freq. Nouns Freq. Adjectives
255 niño 160 bueno
194 año 124 gran
159 hombre 99 grande
154 día 75 nuevo
148 padre 62 viejo
148 rey 57 blanco
134 hijo 51 pobre
131 vez 48 mayor
114 libro 48 largo
106 casa 45 azul
103 tiempo 44 mejor

A sample of the NERC (LOC means “location”, PER stand for “person”):

Freq. W1 Type
8 alemania LOC
2 dinamarca LOC
2 alejandro PER
1 16 de mayo de 1703 DATE
1 cataluña LOC

After we extracted the most frequent bigrams (“P” for pronoum, “D” determiner, “C” connector, “V” verb, “N” noun):

Freq. w1 Cat w2 Cat
846 de P el D
692 en P el D
565 a P el D
388 y C el D
245 de P su D
229 por P el D
226 el D que Q
224 todo D el D
206 con P el D
204 a P su D
202 que C el D
201 de P uno D
165 ser V el D
151 el D niño N

After that we used Voyant Tools (Sinclair and Rockwell, 2016) to get visualizations of the data in order to achieve a more user-friendly representation of the data. The result was a word cloud of the entire book:

A further analysis of the word «niña» (Key Word in Context or KWIC) extracted with Voyant Tools, can be used to show how the girls were characterized in 1920 or to learn the cohesion between the gender (feminine) of the article and the noun:

Left Term Right
tenía, a su vez, una niña , que era dulce y bon
las excelentes cualidades de aquella niña . La encomendó las tareas más
pies a cabeza. La pobre niña todo lo sufría con paciencia
g n ■w- canzaria. La niña perdió uno de sus zapatos
meses regalaremos al niño o niña que mayor número de ellas
ha pensado mucho en la niña ! El dice que siempre que
y escribe mejor- Y la niña se va, se va despacio
tropieza con todo! Pero la niña no se ha des- pertado
de olor: y es una niña de sombrero colorado, que trae
hoy en casa por mi niña ”, le dijo su padre, “y

The analysis described above shows that there are still many errors and one should carefully check the extracted text, and correct to obtain a more reliable data. The overall task was very fast and efficient and proved to ask interesting research questions. The next steps are to use the Programing Historian publications and see if the text can be cleaned of all OCR errors using regular expressions (Turner-O’Hara 2013):
https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions

REFERENCES

Otegi, A. Imaz, O. Díaz de Ilarraza, A. Iruskieta, M. Uria, L. 2017.ANALHITZA: a tool to extract linguistic information from large corpora in Humanities research. Procesamiento del Lenguaje Natural 58: 77-84.

Pulgarcito Volumen No 2 – No 1 – 1920. URL: http://iiif.sld.cu/coleccion/07/06/2017/pulgarcito-volumen-no-2-no-1-1920 [January 10, 2019]

Sinclair, S. Rockwell, G. «Voyant tools.» URL: http://voyant-tools. org/ [September 5, 2016] (2016).

Turner-O’Hara, L. 2013. Cleaning OCR’d text with Regular Expressions. URL: https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions [January 10, 2019]

 

04/02/2019
Displaying 1 - 2 of 2

Grammars and language models

EDGK
Rule-based Dependency Grammar for Basque

BERTeus
BERT language model for Basque
Displaying 1 - 20 of 20

Tools and services

Averell
Averell is a Python library and command line interface to download and to standardize corpora from ten multi-lingual poetry repositories
Jollyjumper
Jollyjumper is our enjambment detection Python library for Spanish
Rantanplan
Rantanplan is a Python library for the automated scansion of Spanish poetry
PoetryLab app
PoetryLab: An Open Source Toolkit for the Analysis of Spanish Poetry Corpora
PDMapping
Tool for documenting and analyzing speakers' judgments about spatial and sociocultural linguistic variation.
Ferramenta On-Line de ExpeRimentación PerceptivA (FOLErPa)
FOLERPA is an online tool for carrying out perceptual experiments.
Cartografía dos apelidos de Galicia
Research tool for the study of the geographical distribution of surnames in Galicia.
Vocabulary analyzer Web Service
This web service calculates different lexicometric measures and displays them graphically (tokens, types, hapaxes & type/token ratio).
Ngram Statistics de Pedersen
Pedersen's Ngram Statistics Package
UPF Freeling-based part-of-speech tagger.
This is the UPF Freeling-based part-of-speech tagger.
Análisis de relaciones de dependencias
This WS performs dependency parsing using Bohnet's graph-based Parser. The input is text in plain text or CoNLL format. The languages supported are English and Spanish.
Freeling Named Entity Recognition - NER
Freeling-based Named Entity Recognition - NER
WSD-IXA
Word-Sense Disambiguation
Ixa pipes
Multilingual NLP tools
ixaKat
A modular chain of Natural Language Processing tools for Basque
Maltixa
Statistical Syntactic analyzer for Basque

Eustagger
Morphosyntactic tagger for Basque

Xuxen
Spelling and grammar checker for Basque
BASYQUE
A web application to analyse syntactic variation of Basque dialects
Analhitza
Category analyzer