General – Ixa Group. Language Technology.

IXAmBERT: Good news for languages with few resources!

Kepa Sarasola — Wed, 30 Sep 2020 18:10:32 +0000

Good news for languages with few resources!
Pre-trained Basque monolingual and multilingual language models have proven to be very useful in NLP tasks for Basque!
Even they have been created with a 500 times smaller corpus than the English one and with a 80 times smaller wikipedia.

An example of Conversational Question Answering, and its transcription to English.

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across
most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque.

Last April we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora (crawled news articles from online newspapers) produced much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. this work was presented in the paper entitled “Give your Text Representation Models some Love: the Case for Basque“. The composition of the Basque Media Corpus (BMC) used in that experiment was as follows:

Source	Text type	Million tokens
Basque Wikipedia	Enciclopedia	35M
Berria newspaper	News	81M
EiTB	News	28M
Argia magazine	News	16M
Local news sites	News	224.6M

Take into account that the original BERT language model for English was trained using Google books corpus that contains 155 billion words in American English, 34 billion words in British English. The English corpus is almost 500 times bigger than the Basque one.

Agerri

San Vicente

Campos

Otegi

Barrena

Saralegi

Soroa

E. Agirre

An example of a dialogue where there are many references in the questions to previous answers in the dialogue.

Now, in September we have published IXAmBERT, a multilingual language model pretrained for English, Spanish and Basque. And we have successfully experimented with it in a Basque Conversational Question Answering system. This transfer experiments could be already performed with Google’s official mBERT model, but as it covers that many languages, Basque is not very well represented. In order to create this new multilingual model that contains just English, Spanish and Basque, we have followed the same configuration as in the BERTeus model presented in April. We re-use the same corpus of the monolingual Basque model and add the English and Spanish Wikipedia with 2.5G and 650M tokens respectively. The size of these wikipedias is 80 and 20 times bigger than the Basque one.

The good news is that this model has been successfully used to transfer knowledge from English to Basque in a conversational Question/Answering system, as reported in the paper Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque. In the paper, the new language model called IXAmBERT performed better than mBERT when transferring knowledge from English to Basque, as shown in the following table:

Model	Zero-shot	Transfer learning
Baseline	28.7	28.7
mBERT	31.5	37.4
IXAmBERT	38.9	41.2
mBERT + history	33.3	28.7
IXAmBERT + history	40.7	40.0

This table shows the results on a Basque Conversational Question Answering (CQA) dataset. Zero-shot means that the model is fine-tuned using using QuaC, an English CQA dataset. In the Transfer Learning setting the model is first fine-tuned on QuaC, and then on a Basque CQA dataset.

These works set a new state-of-the-art in those tasks for Basque.
All benchmarks and models used in this work are publicly available: https://huggingface.co/ixa-ehu/ixambert-base-cased

PhD Thesis: Unsupervised Machine Translation (Mikel Artetxe, 2020/07/29)

Kepa Sarasola — Tue, 28 Jul 2020 13:37:29 +0000

Title: Unsupervised Machine Translation
/ Itzulpen automatiko gainbegiratu gabea

Non: Teleconference: https://eu.bbcollab.com/guest/b22b606d9ae74bc5b3e067821c897617
Faculty of informatics (UPV/EHU) Ada Lovelace room
Date: July 29, 2020, Wednesday, 11:00
Author: Mikel Artetxe Zurutuza
Supervisors: Eneko Agirre & Gorka Labaka
Languages: Basque (motivation, state of the art) and English (second half, papers, conclusions, ~11:30…)

https://github.com/artetxem

Abstract:

The advent of neural sequence-to-sequence models has led to impressive progress in machine translation, with large improvements in standard benchmarks and the first solid claims of human parity in certain settings. Nevertheless, existing systems require strong supervision in the form of parallel corpora, typically consisting of several million sentence pairs. Such a requirement greatly departs from the way in which humans acquire language, and poses a major practical problem for the vast majority of low-resource
language pairs.

The goal of this thesis is to remove the dependency on parallel data altogether, relying on nothing but monolingual corpora to train unsupervised machine translation systems. For that purpose, our approach first aligns separately trained word representations in
different languages based on their structural similarity, and uses them to initialize either a neural or a statistical machine translation system, which is further trained through back-translation.

Mikel Artetxe publications related to his PhD work:

Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre (2020)
A Call for More Rigor in Unsupervised Cross-lingual Learning
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Mikel Artetxe, Gorka Labaka, Eneko Agirre (2019)
Bilingual Lexicon Induction through Unsupervised Machine Translation

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5002-5007.
Mikel Artetxe, Gorka Labaka, Eneko Agirre (2019)
An Effective Approach to Unsupervised Machine Translation
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 194-203.
Mikel Artetxe, Gorka Labaka, Eneko Agirre (2018)
Unsupervised Statistical Machine Translation
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3632–3642, Brussels, Belgium, October-November. Association for Computational Linguistics.
Mikel Artetxe, Gorka Labaka, Eneko Agirre (2018)
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
Mikel Artetxe, Gorka Labaka, Eneko Agirre, Kyunghyun Cho (2018)
Unsupervised Neural Machine Translation
Sixth International Conference on Learning Representations (ICLR 2018)
Mikel Artetxe, Gorka Labaka, Eneko Agirre (2018)
Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) pages 5012-5019.
Mikel Artetxe, Gorka Labaka, Eneko Agirre (2017)
Learning bilingual word embeddings with (almost) no bilingual data
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
Mikel Artetxe, Gorka Labaka, Eneko Agirre (2016)
Learning principled bilingual mappings of word embeddings while preserving monolingual invariance
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289–2294. Austin, Texas. ISBN: 978-1-945626-25-8

The Ixa research group has been awarded in the artificial intelligence competition promoted by the US government related to COVID-19 disease

ITZIAR GONZALEZ DIOS — Thu, 07 May 2020 12:37:36 +0000

The competition CORD-19 (COVID-19 Open Research Dataset Challenge) has been organized by several organizations such as Allen Institute for AI, Chan Zuckerberg Initiative, Georgetown University, Microsoft Research, National Institutes of Health and The White House Office of Science and Technology Policy. The organization has made available to the global research community more than 50,000 scientific articles on COVID-19, SARS-CoV-2 and other coronavirus. At the same time, they issue a call to action to artificial intelligence researchers to apply the recent advances in natural language processing, in order to help scientists fighting COVID-19 disease to find necessary information in the scientific literature.

In the first phase of the competition there were 10 awards, and the system developed in the Ixa group of the HITZ centre has been awarded with one of them. Researchers from the University of the Basque Country Arantxa Otegi and Jon Ander Campos and professors Eneko Agirre and Aitor Soroa participated in the development of the system. The developed system finds answers to high priority questions from experts related to COVID-19 disease and the SARS-CoV-2 virus analyzing the aforementioned scientific articles. Thus, this system is useful for finding answers to questions such as the history of coronavirus, the transmission and diagnosis of the virus, the prevention measures in the contact between humans and animals and the lessons of previous epidemiological studies. The results of the system have been evaluated by a group of experts from the NIH of the United States and it has been selected as the system that has best answered a set of questions on the topic “What do we know about diagnostics and surveillance?”. The answers given by the system can be seen here.

See here some examples

Five papers accepted at 58th annual meeting of the Association for Computational Linguistics

ITZIAR GONZALEZ DIOS — Tue, 05 May 2020 07:21:01 +0000

The members of the Ixa group and their collaborators will present five papers at 58th annual meeting of the Association for Computational Linguistics (ACL). ACL is one of the most important conferences on Natural Language Processing. It was to be held in July in Seattle, but this year it will be online.

Following, we present the accepted papers:

– Selecting Backtranslated Data from Multiple Sources for improved Neural Machine Translation (Xabier Soto, Dimitar Shterionov, Alberto Poncelas, Andy Way): We analyse the impact that data backtranslated with diverse systems has on eu-es and de-en clinical domain NMT, and employ data selection (DS) to optimise the synthetic corpus. We further rescore the output of DS by considering the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora.

– On the Cross-lingual Transferability of Monolingual Representations (Mikel Artetxe, Sebastian Ruder, Dani Yogatama): We challenge common beliefs of why multilingual BERT works by showing that a monolingual BERT model can also be transferred to new languages at the lexical level.

– A Call for More Rigor in Unsupervised Cross-lingual Learning (Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre): In this position paper, we review motivations, definition, approaches and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them.

– DoQA – Accessing Domain-Specific FAQs via Conversational QA (Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, Eneko Agirre): We present DoQA, a dataset for accessing FAQs via conversational Question Answering, showing that it is possible to build high quality conversational QA systems for accessing FAQs without in-domain training data.

– A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation (Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, Mark Cieliebak): We introduce a novel methodology to efficiently construct a corpus for question answering over structured data, with threefold manual annotation speed gains compared to previous schemes such as Spider. Our method also produces fine-grained alignment of query tokens to parsing operations. We train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.

Congratulations to all the authors!

Eneko Agirre won for the third consecutive year the Google prize

Kepa Sarasola — Wed, 01 Apr 2020 11:31:36 +0000

Eneko Agirre won again a Google prize last March. He is one of the few researchers who has obtained the Google Faculty Research Award on three occasions. The $62,000 prize will fund the project ‘Conversational Question Answering agents that learn after deployment’ to develop user dialogue systems, chatbots and artificial intelligence.

Eneko Agirre, member of Ixa Group and professor at the Faculty of Computer Science of the UPV/EHU, is the director of the newly created HiTZ Research Center. The other 6 colleagues in the project are professors Aitor Soroa and Gorka Azkune, researcher Arantxa Otegi, doctoral student Jon Ander Campos, student of Master in Language Analysis and Processing Aitor Agirre and student of Degree in Computer Science Eduardo Vallejo.

Although the project focuses mainly on English dialogues (questions about cooking and food), they are also working with Basque dialogues. For this purpose, last year Ixa Group launched a campaign to recruit volunteers for the collection of interviews in Basque. The campaign was succesfull and many personal interviews were collected in Basque (http://ixa.eus/lagundu).

La Inteligencia Artificial no es ciencia-ficción
– eitb (Ganbara) –

GOOGLE PREMIA A ENEKO AGIRRE, PROFESOR DE INFORMÁTICA DE LA UPV
– Estrategia empresarial –

Google premia una investigación española sobre procesamiento del lenguaje
– España buenas noticias –

EHUko irakasle batek Google erakundearen ikerketa sarietako bat lortu du
– Europapress –

EHUko irakasle batek Google erakundearen ikerketa sarietako bat lortu du
– El Correo –

Eneko Agirre UPV/EHUko irakasleak Google-ren sari bat lortu du, 62.000 eurokoa
– naiz –

UPV/EHUko Eneko Agirre irakasleak Google erakundearen ikerketa sarietako bat lortu du
– Campusa –

EHUko Informatika irakasle batek Google erakundearen ikerketa sarietako bat lortu du
– Bilbao24h –

“Itzulbide” project: a tool for normalizing the use of Basque in clinical histories

ITZIAR GONZALEZ DIOS — Tue, 01 Oct 2019 13:41:29 +0000

The use of machine translation tools between languages in today’s society is common and widespread. Our Ixa group of the University of the Basque Country (UPV/EHU) has extensive experience in the Natural Language Processing for Basque. In this context, UPV-EHU and Osakidetza (The official Organization for Health in the Basque Country) in 2019 saw the opportunity to develop a tool adapted to the clinical field by using the new technological conditions (use of the successful paradigm of neural networks in machine translation) and also by taking profit of the new professional conditions (increase of bilingual staff who want to work in Basque and significant number of new young doctors trained in Basque at the university).

Translation is not, not even the development of automatic translators, the final objective in Basque Country official plans, but a potentially useful tool to get it. The objective of the Basque Country official plans, as well as that of Osakidetza, is to increase the presence and use of Basque language in its everyday clinical histories, and it must be demonstrated whether this tool will contribute to this goal. In fact, Itzulbide has been launched as a research project based on the hypothesis that if the general domain MT system is taught to translate in the clinical field, in the future we will have a fast and reliable translation tool. Within a few years it will be seen whether this hypothesis is fulfilled.

The project began in June 2019 and the promoters of this project (Ixa Group of the UPV/EHU and the Osakidetza Itzulbide working group) have begun to carry out the open presentations of the center to center project to clarify the opinions and doubts of the professionals and collect the contributions of the professionals. At the time of the writing of this text, 68 professionals from different specialties and categories collaborated in the project, creating bilingual clinical texts. Encouragement and thanks to all the participants!

The “Itzulbide” Automatic Translator project does not prevent or condition the other complementary specific language objectives and normalization measures currently included in the Osakidetza’s Basque Language Plan.

If its usefulness is demonstrated, the tool will be integrated into the information system of Osakidetza, but in addition, the development of this tool could extend to the entire healthcare community (professionals of public and private companies, pharmacists, university students and professors, and non-university health, residents, professional associations) and to the geographical scope of the Basque language. It can also be a help tool for professionals who are learning Basque. In summary, the possible use of Itzulbide could go beyond clinical history.

A project of this type can generate doubts, but we will test and measure whether this tool brings us closer to the objectives of the Basque Country’s Language Plan, give an opportunity to Itzulbide.

EHU-Ixa and Itzulbide-Osakidetza

Meeting of LINGUATEC project in Donostia (2019-02-21)

Kepa Sarasola — Tue, 26 Feb 2019 18:29:28 +0000

LINGUATEC project: Development of cross-border cooperation and knowledge transfer in language technologies.

LINGUATEC is an European project funded by FEDER via POCTEFA (Programa INTERREG V-A España-Francia-Andorra). The partners are the followings:

Elhuyar Fundazioa
Lo Congrès Permanent de la Lenga Occitana
Universidad Del País Vasco / Euskal Herriko Unibertsitatea (Ixa Taldea)
CNRS (CENTRE National de la Recherche Scientifique) – Delegation Regionale Midi-Pyrenees
Euskaltzaindia – Real Academia de la Lengua Vasca
Sociedad De Promoción y Gestión del Turismo Aragonés

The main objective in Linguatec is to develop, test and disseminate new innovative linguistic resources, tools and solutions for a better digitalization level of the Aragonian, Basque and Occitan languages. As a result, we will obtain, among others, (1) a road map of Aragonian Digitalization, (2) new monolingual and bilingual lexicons and morphosyntactic and syntactic analysers for Occitan, (3) a Northern Basque speech recognition system and several linguistic tools as well as (4) new innovative solutions for Aragonian, Basque and Occitan.

These cross-border cooperation will allow the transfer of knowledge and to develop linguistic solutions with a potential market uptake, benefiting language professionals, easing access to multilingual contents, and fostering the development of a cross-border language tech cluster.

After one year work, last Wednesday we had a project meeting in Donostia organized by Euskaltzaindia. Ixa Group presented the progress in the creation of an improved Neuronal Machine Translation system for the pair Spanish-Basque.

Best Thesis Award in SEPLN (Aitor Gonzalez, 2018-09-13)

Kepa Sarasola — Mon, 05 Nov 2018 16:25:04 +0000

Last September Aitor Gonzalez Agirre was awarded with the best MSc thesis Award 2018 by the SEPLN association. Congratulations to Aitor and to his supervisors Eneko Agirre and German Rigau.

Aitor is now working at the Barcelona Supercomputing Center.

The abstract of his thesis entitled “Computational Models for Semantic Textual Similarity” is the following:

Measuring semantic similarity between textual items (words, sentences, paragraphs or even documents) is a very important research area in Natural Language Processing (NLP). It has many practical applications in other NLP tasks such as Word Sense Disambiguation, Textual Entailment, Paraphrase detection, Machine Translation, Summarization, Information Retrieval or Question Answering.

The overarching goal of this thesis is to advance on computational models of meaning and their evaluation. To achieve this goal we define two tasks and develop state-of-the-art systems that tackle both tasks: Semantic Textual Similarity (STS) and Typed Similarity.

STS aims to measure the degree of semantic equivalence between two sentences by assigning graded similarity values. This graded similarity captures the notion of intermediate shades of similarity ranging from pairs of text that differ only in minor nuanced aspects of meaning, in relatively important differences,

down to pairs that share only some details or that only have in common being about the same topic. In the scope of this research, we have collected pairs of sentences to construct datasets for STS, a total of 15,436 pairs of sentences, being by far the largest collection of data for STS.

Using these new datasets for STS we have designed, constructed and evaluated a new approach to combine knowledge-based and corpus-based methods using a cube. This new system for STS is on par with state-of-the-art approaches that make use of Machine Learning (ML) without using any of it, but ML can be used

on this system, improving the results.

Typed Similarity tries to identify the type of relation that holds between a pair of similar items in a digital library. Being able to provide a reason why items are similar has applications in recommendation, personalization, and search. We investigate the problem within the context of Europeana, a large digital library containing items related to cultural heritage. A range of types of similarity in this collection were identified and a set of 1,500 pairs of items from the collection were annotated using crowdsourcing.

Finally, we present three systems capable of resolving the Typed Similarity task: a baseline approach, a knowledge-based approach and a ML system. The high results obtained by our systems suggests that this technology is close to practical applications. In fact, the system based on ML resulted in a real-world application to recommend similar items to users in an online digital library.

[ES]

En la XVII Edición de los premios SEPLN a la mejor tesis doctoral en Procesamiento del Lenguaje Natural se han presentado doce trabajos de gran calidad. Cada monografía ha sido evaluada por tres revisores. Destacar el alto nivel científico y técnico, siendo todas ellas merecedoras del premio. Finalmente, ha quedado mejor valorada, y por consiguiente, premiada y propuesta para su publicación electrónica la titulada “Computational Models for Semantic Textual Similarity” de Aitor González Agirre, como el décimoséptimo número de esta serie de publicaciones.

Talk: Karelian dialects, how to study variation between closely related languages? (I. Moshnikov, 2018-06-19)

Kepa Sarasola — Mon, 18 Jun 2018 22:39:03 +0000

Speaker: Ilia Moshnikov
…………Karelian Institute (Joensuu)
Date: Tuesday,June 19, 2018
Time: 15:00-16:00
Place: UPV/EHUko Informatika Fakultatea, Manuel de Lardizabal 1, 20018 Donostia (map)
Title: Variants of the active past participle in the Border Karelian dialects:
how to study variation between closely related languages?

Karelian languages (Wikipedia)

Abstract:
During my visit I would like to present my research interests. I will speak about my home university in general. I will say a few words about current situation of the Karelian language and usage of it in Internet. During my work in Kiännä-research project I investigated from a virtual linguistic landscape point of view what websites use Karelian as a language of full interface. I will also talk about my doctoral dissertation. Topic of my presentation is Variants of the active past participle in the Border Karelian dialects: how to study variation between closely related languages? I use some statistical methods. Theoretically my background is in language contacts and language variation research.

Short bio:
My name is Ilia Moshnikov and I am a visiting researcher from University of Eastern Finland (Joensuu, Finland). I will stay in San Sebastian one month. I am a linguist and my doctoral dissertation is about language contacts between Finnish and Karelian languages in Border Karelian dialects. Moreover, some of my interests are language revitalization and modern language usage. For example, I am involved in Karelian Wikipedia. Originally, I am from Russian Karelia. I speak Karelian, Finnish, English and Russian (a bit Spanish as well). I work as a researcher in Karelian Institute (Joensuu) and teach some Karelian (and Russian) courses.

Be a friend of the Minority SafePack! We need your signature!

Kepa Sarasola — Thu, 15 Mar 2018 00:26:32 +0000

We call upon the EU to adopt a set of legal acts to improve the protection of persons belonging to national and linguistic minorities and strengthen cultural and linguistic diversity in the Union. It shall include policy actions in the areas of regional and minority languages, education and culture, regional policy, participation, equality, audiovisual and other media content, and also regional (state) support

A European citizens’ initiative is an invitation to the European Commission to propose legislation on matters where the EU has competence to legislate. A citizens’ initiative has to be backed by at least one million EU citizens, coming from at least 7 out of the 28 member states. A minimum number of signatories is required in each of those 7 member states.

‘Minority SafePack‘ iniciative has got 849.888 signatures. 150.000 more are needed in two weeks.

You can sign here

(minority-safepack.eu)

In the European Union there are about 50 million people who belong to a national minority or a minority language community.

STOP LANGUAGES IN EUROPE FROM BECOMING EXTINCT!
CULTURES ARE EQUAL
LANGUAGE EQUALITY
LIKE AT HOME IN ANOTHER REGION
FREE PASSAGE OF AUDIOVISUAL CONTENT
http://www.minority-safepack.eu/main/index