Making Pulgarcito machine readable: How to start with basic NLP analysis from a digital image
IMPACT Centre of CompetenceNote: The work described in this blog post was undertaken thanks to the collaboration with the IMPACT Centre of Competence (www.digitisation.eu).
Digitization of archival and historical material can be problematic for researchers due to a number of issues. One of such issues is the presence of gaps and empty spaces around and in between text. It became apparent in my recent analysis of the most frequent words in Pulgarcito a Cuban illustrated and literary journal for kids wrote between 1919-1920. Pulgarcito was digitized and is available online: http://imagenes.sld.cu/download/pulgarcito/volumen-2.pdf.
The journal consists of rich material: drawings, photographs, fairy tales, comic strips, legends, poems, fables, anecdotes, paintings for children. It is a very interesting material as the text throughout the publication is typed, handwritten and drawn. This makes the digitized publication in PDF quite challenging.
The aim of the task was to use NLP tools in the text analysis in an image-based book digitalization, with texts including also hand-written texts.
After trying a commercial OCR product, the results were very poor, so I decided to approach the IMPACT Centre of Competence (www.digitisation.eu) for support. I needed an OCR system that would deliver good quality text recognition results in a machine-readable format (e.g. XML, TXT). They promptly answered my enquiry and just over a week later I received the journal in plain text format and in XML (both with some OCR errors).
The example below illustrates the conversion of a hand-written and typed text from the PDF into a plain text format. Have a look how many different “A”s can be found on this page:
CUANDO UN N1NO
«5Í POEAA?
M$&ECE, UMBETRAfO
conminas y cía
Most of the sentences were recognized with some errors, for example the word “NIÑO” (child) was identified as “N1NO”, or the word “UN RETRATO” (a picture) was not split and resulted in “UMBETRAfO”. Finally, the last line was not detected at all.
As expected, better results were given where the text was typed. The example below illustrates it well.
NoBin_pul-074.txt | Bin_pul-074.txt |
vos pajaritos! Tienen, a su modo, las mismas atenciones, cariños y cuidados que tiene el hombre con sus hijos.
Sienten a su modo lo mismo que vuestros padres sienten por us- tedes; por eso es tan inhumano destruir esos nidos o encerrar a cual- quier pájaro en una jaula que por ser muy dorada, no dejará de ser una prisión para él, nacido para cantar libremente ccmo un poeta del ensueño que volase entre el cielo y la tierra. Al contrario. Fa- bricad vosotros mismos nidos, e instalad pequeñas fuentes en vues- tro jardín. Tendréis así todos los pájaros y todos los cantos. Y cuan- do llegue la época de las crías, regad motitas de algodón, como ha- cen en los grandes parques los niños de otras ciudades. No olvidéis que estos amigos alados tienen, como vosotros, su hogar, sus hijos, la dulce encantadora libertad por la cual han venido luchando to- dos los hombres desde que la tierra recibió; allá, en la noche de los tiempos, el primer beso del sol. |
vos pajaritos! Tienen, a su modo, las mismas atenciones, cariños y
4 cuidados que tiene el hombre con sus hijos. Sienten a su modo lo mismo que vuestros padres sienten por us- tedes; por eso es tan inhumano destruir esos nidos o encerrar a cual- quier pájaro en una jaula que por ser muy dorada, no dejará de ser una prisión para él, nacido para cantar libremente ccmo un poeta del ensueño que volase entre el cielo y la tierra. Al contrario. Fa- bricad vosotros mismos nidos, e instalad pequeñas fuentes en. vues- tro jardín. Tendréis así todos los pájaros y todos los cantos. Y cuan- do llegue la época de las crías, regad motitas de algodón, como ha- cen en los grandes parques los niños de otras ciudades. No olvidéis que estos amigos alados tienen, como vosotros, su hogar, sus hijos, la dulce encantadora libertad por la ‘cual han venido luchando to- dos los hombres desde que la tierra recibió; allá, en la noche de los tiempos, el primer beso del sol. …a veces discuten acaloradamente entre sí… O-O-O-O’O’O-O – $-0.0-0.0-0-0 -0*0 |
The results above were achieved by the IMPACT by using the following methods:
- All PDF images were extracted using a tool pdfimages in Linux.
- The digitization was done with the FineReader 11 SDK version.
- The OCR FineReader 11 SDK version with Spanish language and different types of letters was used with normal and handprinted output in ALTO XML and Text Unicode Defaults.
Once we had the image-based digitized publication book in a txt format, we used ANALHITZA (Otegi et al. 2017). It is a tool created in collaboration with the Spanish CLARIN K-centre to extract words and frecuencies, identify proper nouns (NERC) and extract some word sequences (n-grams), among other things.
The text analysis results were as follows:
Freq. | Nouns | Freq. | Adjectives |
255 | niño | 160 | bueno |
194 | año | 124 | gran |
159 | hombre | 99 | grande |
154 | día | 75 | nuevo |
148 | padre | 62 | viejo |
148 | rey | 57 | blanco |
134 | hijo | 51 | pobre |
131 | vez | 48 | mayor |
114 | libro | 48 | largo |
106 | casa | 45 | azul |
103 | tiempo | 44 | mejor |
A sample of the NERC (LOC means “location”, PER stand for “person”):
Freq. | W1 | Type |
8 | alemania | LOC |
2 | dinamarca | LOC |
2 | alejandro | PER |
1 | 16 de mayo de 1703 | DATE |
1 | cataluña | LOC |
After we extracted the most frequent bigrams (“P” for pronoum, “D” determiner, “C” connector, “V” verb, “N” noun):
Freq. | w1 | Cat | w2 | Cat |
846 | de | P | el | D |
692 | en | P | el | D |
565 | a | P | el | D |
388 | y | C | el | D |
245 | de | P | su | D |
229 | por | P | el | D |
226 | el | D | que | Q |
224 | todo | D | el | D |
206 | con | P | el | D |
204 | a | P | su | D |
202 | que | C | el | D |
201 | de | P | uno | D |
165 | ser | V | el | D |
151 | el | D | niño | N |
After that we used Voyant Tools (Sinclair and Rockwell, 2016) to get visualizations of the data in order to achieve a more user-friendly representation of the data. The result was a word cloud of the entire book:
A further analysis of the word «niña» (Key Word in Context or KWIC) extracted with Voyant Tools, can be used to show how the girls were characterized in 1920 or to learn the cohesion between the gender (feminine) of the article and the noun:
Left | Term | Right |
tenía, a su vez, una | niña | , que era dulce y bon |
las excelentes cualidades de aquella | niña | . La encomendó las tareas más |
pies a cabeza. La pobre | niña | todo lo sufría con paciencia |
g n ■w- canzaria. La | niña | perdió uno de sus zapatos |
meses regalaremos al niño o | niña | que mayor número de ellas |
ha pensado mucho en la | niña | ! El dice que siempre que |
y escribe mejor- Y la | niña | se va, se va despacio |
tropieza con todo! Pero la | niña | no se ha des- pertado |
de olor: y es una | niña | de sombrero colorado, que trae |
hoy en casa por mi | niña | ”, le dijo su padre, “y |
The analysis described above shows that there are still many errors and one should carefully check the extracted text, and correct to obtain a more reliable data. The overall task was very fast and efficient and proved to ask interesting research questions. The next steps are to use the Programing Historian publications and see if the text can be cleaned of all OCR errors using regular expressions (Turner-O’Hara 2013):
https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions
Otegi, A. Imaz, O. Díaz de Ilarraza, A. Iruskieta, M. Uria, L. 2017.ANALHITZA: a tool to extract linguistic information from large corpora in Humanities research. Procesamiento del Lenguaje Natural 58: 77-84.
Pulgarcito Volumen No 2 – No 1 – 1920. URL: http://iiif.sld.cu/coleccion/07/06/2017/pulgarcito-volumen-no-2-no-1-1920 [January 10, 2019]
Sinclair, S. Rockwell, G. «Voyant tools.» URL: http://voyant-tools. org/ [September 5, 2016] (2016).
Turner-O’Hara, L. 2013. Cleaning OCR’d text with Regular Expressions. URL: https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions [January 10, 2019]
Tools and services
Averell
Averell is a Python library and command line interface to download and to standardize corpora from ten multi-lingual poetry repositories |
Jollyjumper
Jollyjumper is our enjambment detection Python library for Spanish |
Rantanplan
Rantanplan is a Python library for the automated scansion of Spanish poetry |
PoetryLab app
PoetryLab: An Open Source Toolkit for the Analysis of Spanish Poetry Corpora |
PDMapping
Tool for documenting and analyzing speakers' judgments about spatial and sociocultural linguistic variation. |
Ferramenta On-Line de ExpeRimentación PerceptivA (FOLErPa)
FOLERPA is an online tool for carrying out perceptual experiments. |
Cartografía dos apelidos de Galicia
Research tool for the study of the geographical distribution of surnames in Galicia. |
Vocabulary analyzer Web Service
This web service calculates different lexicometric measures and displays them graphically (tokens, types, hapaxes & type/token ratio). |
Ngram Statistics de Pedersen
Pedersen's Ngram Statistics Package |
UPF Freeling-based part-of-speech tagger.
This is the UPF Freeling-based part-of-speech tagger. |
Análisis de relaciones de dependencias
This WS performs dependency parsing using Bohnet's graph-based Parser. The input is text in plain text or CoNLL format. The languages supported are English and Spanish. |
Freeling Named Entity Recognition - NER
Freeling-based Named Entity Recognition - NER |
WSD-IXA
Word-Sense Disambiguation |
Ixa pipes
Multilingual NLP tools |
ixaKat
A modular chain of Natural Language Processing tools for Basque |
Maltixa
Statistical Syntactic analyzer for Basque |
Eustagger
Morphosyntactic tagger for Basque |
Xuxen
Spelling and grammar checker for Basque |
BASYQUE
A web application to analyse syntactic variation of Basque dialects |
Analhitza
Category analyzer |