Product – Ixa Group. Language Technology. https://www.ehu.eus/ehusfera/ixa News from the Ixa Group in the University of the Basque Country Fri, 02 Oct 2020 12:16:39 +0000 en-US hourly 1 https://wordpress.org/?v=5.6.4 IXAmBERT: Good news for languages with few resources! https://www.ehu.eus/ehusfera/ixa/2020/09/30/ixambert-good-news-for-languages-with-few-resources/ https://www.ehu.eus/ehusfera/ixa/2020/09/30/ixambert-good-news-for-languages-with-few-resources/#respond Wed, 30 Sep 2020 18:10:32 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2767 Good news for languages with few resources! Pre-trained Basque monolingual and multilingual language models have proven to be very useful in NLP tasks for Basque! Even they have been created with a 500 times smaller corpus than the English one and with a 80 times smaller wikipedia.

 

An example of Conversational Question [...]]]> Good news for languages with few resources!
Pre-trained Basque monolingual and multilingual language models have proven to be very useful in NLP tasks for Basque!
Even they have been created with a 500 times smaller corpus than the English one and with a 80 times smaller wikipedia.

 

An example of Conversational Question Answering, and its  transcription to English.

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across
most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque.

Last April we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora (crawled news articles from online newspapers) produced much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. this work was presented in the paper entitled “Give your Text Representation Models some Love: the Case for Basque“. The composition of the Basque Media Corpus (BMC) used in that experiment was as follows:

Source Text type Million tokens
Basque Wikipedia Enciclopedia 35M
Berria newspaper News 81M
EiTB News 28M
Argia magazine News 16M
Local news sites News 224.6M

Take into account that the original BERT language model for English was trained using Google books corpus  that contains 155 billion words in American English, 34 billion words in British English. The  English corpus  is almost  500 times bigger than the Basque one.

 

Agerri


San Vicente

Campos

Otegi

Barrena

Saralegi

Soroa

E. Agirre

An example of a dialogue where there are many references in the questions to previous answers in the dialogue.

 

 

 

 

 

 

 

 

 

 

 

Now, in September we have published IXAmBERT,  a multilingual language model pretrained for English, Spanish and Basque. And we have successfully experimented with it in a Basque Conversational Question Answering system. This transfer experiments could be already performed with Google’s official mBERT model, but as it covers that many languages, Basque is not very well represented. In order to create this new multilingual model that contains just English, Spanish and Basque, we have followed the same configuration as in the BERTeus model presented in April. We re-use the same corpus of the monolingual Basque model and add the English and Spanish Wikipedia with 2.5G and 650M tokens respectively. The size of these wikipedias is 80 and 20 times bigger than the Basque one.

The good news is that this model has been successfully used to transfer knowledge from English to Basque in a conversational Question/Answering system, as reported in the paper Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque. In the paper, the new language model called IXAmBERT performed better than mBERT when transferring knowledge from English to Basque, as shown in the following table:

Model Zero-shot Transfer learning
Baseline 28.7 28.7
mBERT 31.5 37.4
IXAmBERT 38.9 41.2
mBERT + history 33.3 28.7
IXAmBERT + history 40.7 40.0

This table shows the results on a Basque Conversational Question Answering (CQA) dataset. Zero-shot means that the model is fine-tuned using using QuaC, an English CQA dataset. In the Transfer Learning setting the model is first fine-tuned on QuaC, and then on a Basque CQA dataset.

These works set a new state-of-the-art in those tasks for Basque.
All benchmarks and models used in this work are publicly available: https://huggingface.co/ixa-ehu/ixambert-base-cased

]]> https://www.ehu.eus/ehusfera/ixa/2020/09/30/ixambert-good-news-for-languages-with-few-resources/feed/ 0
Eloína Miyares, the mother of the Cuban Basic School Dictionary https://www.ehu.eus/ehusfera/ixa/2015/08/07/eloina-miyares-the-mother-of-the-cuban-basic-school-dictionary/ https://www.ehu.eus/ehusfera/ixa/2015/08/07/eloina-miyares-the-mother-of-the-cuban-basic-school-dictionary/#comments Fri, 07 Aug 2015 16:36:55 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2225 Eloína Miyares Bermúdez died on July 26 in Santiago de Cuba. She was the linguist who led the creation of the Diccionario Básico Escolar / Cuban Basic School Dictionary . See and consult here its electronic version .

She created with Vitelio Ruiz Hernández a new method to teach orthography and pronunciation at school. [...]]]> Eloína Miyares Bermúdez died on July 26 in Santiago de Cuba. She was the linguist who led the creation of the Diccionario Básico Escolar / Cuban Basic School Dictionary . See and consult here its electronic version .

She created with Vitelio Ruiz Hernández a new method to teach orthography and pronunciation at school. They also participated in the creation of the Centro de Lingüística Aplicada (CLA) in Santiago de Cuba, and in the organization of the 14 editions of the Simposium de Comunicación Social, an international congress on Linguistics, Lexicography and Computational Linguistics..

Here is the video Eloína sent us in 2014 for the 25th anniversary of  IXA group.

https://www.youtube.com/watch?v=WvB1fp1Uis0

Thanks for your work, Eloína!


Eloína Miyares Bermúdez in Wikipedia
Eloína Miyares Bermúdez in Ecured.

]]> https://www.ehu.eus/ehusfera/ixa/2015/08/07/eloina-miyares-the-mother-of-the-cuban-basic-school-dictionary/feed/ 1
10.000 downloads for Mitzuli translator app https://www.ehu.eus/ehusfera/ixa/2015/06/18/10-000-downloads-for-mitzuli-translator-app/ https://www.ehu.eus/ehusfera/ixa/2015/06/18/10-000-downloads-for-mitzuli-translator-app/#comments Thu, 18 Jun 2015 19:10:23 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2186 Do you have Mitzuli app on your Android phone? This app allows you to translate text, audio and images between 50 language pairs, it’s free and… it was created by Mikel Artetxe, a member of IXA Group and student in our HAP-LAP master’s programme!

And now it has more than 10.000 downloads, in one month! [...]]]> Do you have Mitzuli app on your Android phone?
This app allows you to translate text, audio and images between 50 language pairs, it’s free and… it was created by Mikel Artetxe, a member of IXA Group and  student in our HAP-LAP master’s programme!

And now it has more than 10.000 downloads, in one month!
Thanks and congratulations, Mikel!

 

Some news:

]]> https://www.ehu.eus/ehusfera/ixa/2015/06/18/10-000-downloads-for-mitzuli-translator-app/feed/ 1
The Basque WordNet semantic dictionary is a “public resource” now https://www.ehu.eus/ehusfera/ixa/2014/06/13/the-basque-wordnet-semantic-dictionary-is-a-public-resource-now/ https://www.ehu.eus/ehusfera/ixa/2014/06/13/the-basque-wordnet-semantic-dictionary-is-a-public-resource-now/#comments Fri, 13 Jun 2014 13:13:04 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2039 Machines need computing tools that are more powerful than conventional dictionaries for tasks like information extraction, disambiguation of word meanings, etc. This is in fact the function of the Euskal WordNet application —developed by the IXA Group (UPV/EHU)— which can already be consulted and downloaded free of charge.

This is the first Lexical Knowledge Base [...]]]> Machines need computing tools that are more powerful than conventional dictionaries for tasks like information extraction, disambiguation of word meanings, etc. This is in fact the function of the Euskal WordNet application —developed by the IXA Group (UPV/EHU)— which can already be consulted and downloaded free of charge.

This is the first Lexical Knowledge Base (LKB) developed for the Basque language: a “semantic dictionary” or “store” that compiles and organises lexical and semantic information. “It’s like a database, but the difference is that it not only gathers the usual information of a dictionary —the meanings of words and their corresponding definitions and examples—, it also links the concepts with each other,” pointed out Eneko Agirre, an IXA Group computer programmer.

If we look up the entry hatz (“finger”, “digit” or “toe” in Basque), the result is as follows: “Each of the five appendages at the end of human hands and feet.” That is what the term means. But apart from this information, we can get much more: the finger/toe is an appendage of the body; the thumb is a finger; fingers are part of the hand; hands, in turn, are part of the arm and fingers are used to touch objects, etc. In short: all the concepts are interrelated hierarchically. Every concept is also related to its equivalents in other languages: digit, hatz, dedo, dixito and dit.

EuskalWordnet_hatz_eleanitza

Consulting the word hatz in Basque WordNet.

This database is tremendously useful in various fields, like machine translation, information extraction, disambiguation of word meanings and for question-answer systems. In machine translation, for example, the system has to understand which word it is translating, a task for which it needs a “semantic dictionary” of this type. “For a quality translation, it is necessary to be able to distinguish the most appropriate meaning from among the various ones,” stressed Agirre.

“Our aim (within the framework of QTLeap European project) is to improve the quality of machine translations by using WordNet,” he pointed out.

Over the 2014-2015 academic year, the university Master’s degree in Language Analysis and Processing (LAP) that the IXA Group will be running at the UPV/EHU will be studying the Basque WordNet and other language technologies used to develop similar applications.

Master’s in Language Analysis and Processing (LAP)

The aim of the University Master’s in Language Analysis and Processing is to analyse language and to learn about the techniques and applications available for processing it with the help of the computer.

This Master’s has been organised by the UPV/EHU’s IXA Group and is geared towards anybody who combines linguistics and computing: philologists and linguistics experts, computing and telecommunications engineers, mathematicians, translators, etc. To apply for it, it is enough to be in possession of a University degree, have some experience and display some interest in the subject.

The Master’s will take one year and a half and the classes will be held at the Computing Faculty of the UPV/EHU-University of the Basque Country. It will be possible to spread it over two or three academic years (to cater for professionals who are working).

The pre-registration period is already open, and applications will be accepted until June 30. For further information on the Master’s, please check out http://ixa.si.ehu.es/master/.

]]> https://www.ehu.eus/ehusfera/ixa/2014/06/13/the-basque-wordnet-semantic-dictionary-is-a-public-resource-now/feed/ 1
The Basque Language in the digital age (META-NET White Paper) https://www.ehu.eus/ehusfera/ixa/2013/03/25/basque-white-paper/ https://www.ehu.eus/ehusfera/ixa/2013/03/25/basque-white-paper/#comments Mon, 25 Mar 2013 14:06:36 +0000 http://www.ehu.eus/ehusfera/ixa/?p=1505 META-NET is a Network of Excellence (consisting of 60 research centres from 34 countries) dedicated to building the technological foundations of a multilingual European information society.

The META-NET Language White Paper Series “Languages in the European Information Society” reports on the state of each European language with respect to Language Technology and explains the most [...]]]> META-NET is a Network of Excellence (consisting of 60 research centres from 34 countries) dedicated to building the technological foundations of a multilingual European information society.

The META-NET Language White Paper Series “Languages in the European Information Society” reports on the state of each European language with respect to Language Technology and explains the most urgent risks and chances.  The Basque White Paper has been recently published. It has been created as a collaboration between the University of the Basque Country (Aholab and IXA Group), Elhuyar Foundation, the Basque Government’s Department of Language Policy, and the Bayonne Research Institute. Inmaculada Hernáez from Aholab has coordinated the edition of the book.

Inmaculada Hernáez, Eva Navas, Igor Odriozola, Kepa Sarasola, Arantza Diaz de Ilarraza, Igor Leturia, Araceli Diaz de Lezana, Beñat Oihartzabal, Jasone Salaberria  2012
The Basque language in the digital age / Euskara aro digitalean
METANET White Paper Series.
Georg Rehm, Hans Uszkoreit (editors). Springer.

METANET_White_Papers_Basque.pdf

This is a paragraph extracted from the executive summary of the book:

“In the field of language technology, the Basque language shows a number of products, technologies and resources. There are application tools for speech synthesis, speech recognition, spelling correction, and grammar checking. There are also some applications for automatic translation, mainly between Spanish and Basque. […] As this series of white papers demonstrate, there is a dramatic difference between Europe’s member states in terms of both the maturity of the research and in the state of readiness with respect to language solutions. One of the major conclusions is that Basque is one of the EU languages that still needs further research before truly effective language technology solutions are ready for everyday use. At the same time, there are good prospects for achieving an outstanding position in this important technology area. This development of high-quality language technology for Basque is urgent and of utmost importance for the preservation for a minority language as Basque.”

The White Paper Series covers other 29 European Languages: Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian (bokmål), Norwegian (nynorsk), Polish, Portuguese, Romanian, Serbian, Slovak, Slovene, Spanish, Swedish.

Besides those 30 volumes, META-NET has also created a web page with the Key Results and Cross-Language Comparison. Four tables illustrate the state of technology for the languages discussed. We can see in those tables that although Basque is a lower resourced and European non-official language, its position in three of those tables is better that those of several European official languages. Of course this promising position of Basque is the result of more than twenty years work in language technology for Basque and the consequence of the coordinated collaboration between universities, research centres, industry and institutions. There is still a lot of work to be done in language technology for Basque, but its present situation is not the worst.

Basque_whitepaper

]]> https://www.ehu.eus/ehusfera/ixa/2013/03/25/basque-white-paper/feed/ 1
Arbel digitala: a tool for writing verses in Basque is on line https://www.ehu.eus/ehusfera/ixa/2013/02/18/bertso-arbel-digitala/ https://www.ehu.eus/ehusfera/ixa/2013/02/18/bertso-arbel-digitala/#comments Mon, 18 Feb 2013 12:07:31 +0000 http://www.ehu.eus/ehusfera/ixa/?p=1457

Three members of the IXA Group (Manex Agirrezabal, Bertol Arrieta and Iñaki Alegria), in collaboration with the Association of Friends of Bertsolaritza (AFB, Bertsozale Elkartea) have developed a new product named Arbel digitala to train verse-makers, including language technology tools and verse-making. This new product was presented last January in the Koldo Mitxelena Library [...]]]> Arbel_digitala

Three members of the IXA Group (Manex Agirrezabal, Bertol Arrieta and Iñaki Alegria), in collaboration with the Association of Friends of Bertsolaritza (AFB, Bertsozale Elkartea) have developed a new product named Arbel digitala to train verse-makers, including language technology tools and verse-making. This new product was presented last January in the Koldo Mitxelena Library by Manex, Bertol, and the members of the AFB Aritz Zerain and Ixiar Eizagirre.

The tool has different capabilities:

  • different stanzas and melodies accessible from a database,
  • rhyme and synonym search engine,
  • syllable counter…

This application is more powerful than Bertsolarixa, a previous system created some years ago.  These capabilities were also used some months ago when they created a robot verse-maker.

If you want to know more about the Arbel digitala tool, you can go directly with this link. Try it, and maybe, you’ll write an incredible verse with this artificial inspiration!

This new in several media: Berria, bertso-eskolak.com, Diario Vasco, Hamaika TV

]]> https://www.ehu.eus/ehusfera/ixa/2013/02/18/bertso-arbel-digitala/feed/ 1