Publications – Ixa Group. Language Technology. https://www.ehu.eus/ehusfera/ixa News from the Ixa Group in the University of the Basque Country Fri, 02 Oct 2020 12:16:39 +0000 en-US hourly 1 https://wordpress.org/?v=5.6.4 IXAmBERT: Good news for languages with few resources! https://www.ehu.eus/ehusfera/ixa/2020/09/30/ixambert-good-news-for-languages-with-few-resources/ https://www.ehu.eus/ehusfera/ixa/2020/09/30/ixambert-good-news-for-languages-with-few-resources/#respond Wed, 30 Sep 2020 18:10:32 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2767 Good news for languages with few resources! Pre-trained Basque monolingual and multilingual language models have proven to be very useful in NLP tasks for Basque! Even they have been created with a 500 times smaller corpus than the English one and with a 80 times smaller wikipedia.

 

An example of Conversational Question [...]]]> Good news for languages with few resources!
Pre-trained Basque monolingual and multilingual language models have proven to be very useful in NLP tasks for Basque!
Even they have been created with a 500 times smaller corpus than the English one and with a 80 times smaller wikipedia.

 

An example of Conversational Question Answering, and its  transcription to English.

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across
most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque.

Last April we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora (crawled news articles from online newspapers) produced much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. this work was presented in the paper entitled “Give your Text Representation Models some Love: the Case for Basque“. The composition of the Basque Media Corpus (BMC) used in that experiment was as follows:

Source Text type Million tokens
Basque Wikipedia Enciclopedia 35M
Berria newspaper News 81M
EiTB News 28M
Argia magazine News 16M
Local news sites News 224.6M

Take into account that the original BERT language model for English was trained using Google books corpus  that contains 155 billion words in American English, 34 billion words in British English. The  English corpus  is almost  500 times bigger than the Basque one.

 

Agerri


San Vicente

Campos

Otegi

Barrena

Saralegi

Soroa

E. Agirre

An example of a dialogue where there are many references in the questions to previous answers in the dialogue.

 

 

 

 

 

 

 

 

 

 

 

Now, in September we have published IXAmBERT,  a multilingual language model pretrained for English, Spanish and Basque. And we have successfully experimented with it in a Basque Conversational Question Answering system. This transfer experiments could be already performed with Google’s official mBERT model, but as it covers that many languages, Basque is not very well represented. In order to create this new multilingual model that contains just English, Spanish and Basque, we have followed the same configuration as in the BERTeus model presented in April. We re-use the same corpus of the monolingual Basque model and add the English and Spanish Wikipedia with 2.5G and 650M tokens respectively. The size of these wikipedias is 80 and 20 times bigger than the Basque one.

The good news is that this model has been successfully used to transfer knowledge from English to Basque in a conversational Question/Answering system, as reported in the paper Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque. In the paper, the new language model called IXAmBERT performed better than mBERT when transferring knowledge from English to Basque, as shown in the following table:

Model Zero-shot Transfer learning
Baseline 28.7 28.7
mBERT 31.5 37.4
IXAmBERT 38.9 41.2
mBERT + history 33.3 28.7
IXAmBERT + history 40.7 40.0

This table shows the results on a Basque Conversational Question Answering (CQA) dataset. Zero-shot means that the model is fine-tuned using using QuaC, an English CQA dataset. In the Transfer Learning setting the model is first fine-tuned on QuaC, and then on a Basque CQA dataset.

These works set a new state-of-the-art in those tasks for Basque.
All benchmarks and models used in this work are publicly available: https://huggingface.co/ixa-ehu/ixambert-base-cased

]]> https://www.ehu.eus/ehusfera/ixa/2020/09/30/ixambert-good-news-for-languages-with-few-resources/feed/ 0
Five papers accepted at 58th annual meeting of the Association for Computational Linguistics https://www.ehu.eus/ehusfera/ixa/2020/05/05/five-papers-accepted-at-58th-annual-meeting-of-the-association-for-computational-linguistics/ https://www.ehu.eus/ehusfera/ixa/2020/05/05/five-papers-accepted-at-58th-annual-meeting-of-the-association-for-computational-linguistics/#respond Tue, 05 May 2020 07:21:01 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2686 The members of the Ixa group and their collaborators will present five papers at 58th annual meeting of the Association for Computational Linguistics (ACL). ACL is one of the most important conferences on Natural Language Processing. It was to be held in July in Seattle, but this year it will be online.

Following, we present [...]]]> The members of the Ixa group and their collaborators will present five papers at 58th annual meeting of the Association for Computational Linguistics (ACL). ACL is one of the most important conferences on Natural Language Processing. It was to be held in July in Seattle, but this year it will be online.

Following, we present the accepted papers:

Selecting Backtranslated Data from Multiple Sources for improved Neural Machine Translation (Xabier Soto, Dimitar Shterionov, Alberto Poncelas, Andy Way): We analyse the impact that data backtranslated with diverse systems has on eu-es and de-en clinical domain NMT, and employ data selection (DS) to optimise the synthetic corpus. We further rescore the output of DS by considering the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora.

On the Cross-lingual Transferability of Monolingual Representations (Mikel Artetxe, Sebastian Ruder, Dani Yogatama): We challenge common beliefs of why multilingual BERT works by showing that a monolingual BERT model can also be transferred to new languages at the lexical level.

A Call for More Rigor in Unsupervised Cross-lingual Learning (Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre): In this position paper, we review motivations, definition, approaches and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them.

DoQA – Accessing Domain-Specific FAQs via Conversational QA (Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, Eneko Agirre): We present DoQA, a dataset for accessing FAQs via conversational Question Answering, showing that it is possible to build high quality conversational QA systems for accessing FAQs without in-domain training data.

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation (Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, Mark Cieliebak): We introduce a novel methodology to efficiently construct a corpus for question answering over structured data, with threefold manual annotation speed gains compared to previous schemes such as Spider. Our method also produces fine-grained alignment of query tokens to parsing operations. We train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.

Congratulations to all the authors!

]]> https://www.ehu.eus/ehusfera/ixa/2020/05/05/five-papers-accepted-at-58th-annual-meeting-of-the-association-for-computational-linguistics/feed/ 0
One of the best three papers on Clinical NLP in 2017 was published by Ixa Group https://www.ehu.eus/ehusfera/ixa/2019/06/28/one-of-the-best-three-papers-on-clinical-nlp-in-2017-was-published-by-ixa-group/ https://www.ehu.eus/ehusfera/ixa/2019/06/28/one-of-the-best-three-papers-on-clinical-nlp-in-2017-was-published-by-ixa-group/#respond Fri, 28 Jun 2019 19:51:40 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2676 A paper written by IXA members Arantza Casillas, Koldo Gojenola, Maite Oronoz and Alicia Perez, among the 3 best papers published in 2017 in the field of clinical Natural Language Processing.

The paper entitled “Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora“, by Pérez A, Weegar R, Casillas A, Gojenola K, [...]]]> A paper written by IXA members Arantza Casillas, Koldo Gojenola, Maite Oronoz and Alicia Perez, among the 3 best papers published in 2017 in the field of clinical Natural Language Processing.

The paper entitled “Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora“, by Pérez A, Weegar R, Casillas A, Gojenola K, Oronoz M, Dalianis H., published in the Journal of Biomedical Informatics , was considered one of the best three papers in the field of clinical Natural Language Processing in 2017.

A survey of the literature was performed in bibliographic databases. PubMed and Association of Computational Linguistics (ACL) Anthology were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A total of 709 papers were automatically ranked and then manually reviewed. A shortlist of 15 candidate best papers was selected by the section editors and peer-reviewed by independent external reviewers to come to the three best clinical NLP papers for 2017.

The paper addresses “medical named entity recognition in clinical text in Spanish and Swedish; furthermore, they emphasize methods’ contribution in a context where little training data is available, which is often the case for languages other than English or when a new medical specialty is explored”.

The selection process is described and published in “Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook“, by Aurélie Névéol, Pierre Zweigenbaum, in the Yearbook of Medical Informatics,

]]> https://www.ehu.eus/ehusfera/ixa/2019/06/28/one-of-the-best-three-papers-on-clinical-nlp-in-2017-was-published-by-ixa-group/feed/ 0
Best Paper Award on CoNLL2018 https://www.ehu.eus/ehusfera/ixa/2018/11/08/best-paper-award-on-conll2018/ https://www.ehu.eus/ehusfera/ixa/2018/11/08/best-paper-award-on-conll2018/#respond Thu, 08 Nov 2018 19:42:57 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2622

Last week our colleagues Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, and Eneko Agirre were the recipients of the Best Paper Award in the 22nd Conference on Computational Natural Language Learning (CoNLL 2018) for the paper “Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation”.

Congratulations!

. [...]]]>

Last week our colleagues Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, and Eneko Agirre were the recipients of the Best Paper Award in the  22nd Conference on Computational Natural Language Learning (CoNLL 2018) for the paper “Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation”.

Congratulations!

.
.
.
.
Abstract:
Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that adjusts the similarity order of the model without any external resource can tailor it to achieve better results in those aspects, providing a new perspective on how embeddings encode divergent linguistic information. In addition, we explore the relation between intrinsic and extrinsic evaluation, as the effect of our transformations in downstream tasks is higher for unsupervised systems than for supervised ones.

UncoVec:
This is an open source implementation in GitHub of our word embedding post-processing and evaluation framework, described in the paper.

]]>
https://www.ehu.eus/ehusfera/ixa/2018/11/08/best-paper-award-on-conll2018/feed/ 0
Science journal: ‘Ixa opens a new research avenue: Machine Translation without a dictionary?’ https://www.ehu.eus/ehusfera/ixa/2017/11/29/science-journal-ixa-opens-a-new-research-avenue-machine-translation-without-a-dictionary/ https://www.ehu.eus/ehusfera/ixa/2017/11/29/science-journal-ixa-opens-a-new-research-avenue-machine-translation-without-a-dictionary/#respond Wed, 29 Nov 2017 21:03:14 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2560 Science reported this week about the work recently published by our colleagues Mikel Artetxe, Eneko Agirre and Gorka Labaka: Artificial intelligence goes bilingual—without a dictionary
In October the 30th our three colleagues published a pre-print paper entitled  Unsupervised Neural Machine Translation in collaboration with Kyunghyun Cho.
One day later G. Lample published another paper with similar contents  entitled Unsupervised Machine Translation Using Monolingual Corpora Only. Both papers are under consideration at ICLR 2018.
Those are some sentences written by Matthew Hutson a freelance writer covering technology for Science:

[…] two new papers show that neural networks can learn to translate with no parallel texts—a surprising advance that could make documents in many languages more accessible.

[…]  Imagine that you give one person lots of Chinese books and lots of Arabic books—none of them overlapping—and the person has to learn to translate Chinese to Arabic. That seems impossible, right?” says the first author of one study, Mikel Artetxe, a computer scientist at the University of the Basque Country (UPV) in San Sebastián, Spain. “But we show that a computer can do that.”

[…]  “This is in infancy,” Artetxe’s co-author Eneko Agirre cautions. “We just opened a new research avenue, so we don’t know where it’s heading.”

[…] Artetxe says the fact that his method and Lample’s—uploaded to arXiv within a day of each other—are so similar is surprising. “But at the same time, it’s great. It means the approach is really in the right direction.”

Congratulations Mikel, Eneko, Gorka and Kyunghyun!

]]>
https://www.ehu.eus/ehusfera/ixa/2017/11/29/science-journal-ixa-opens-a-new-research-avenue-machine-translation-without-a-dictionary/feed/ 0
Best paper award in SEPLN2017 https://www.ehu.eus/ehusfera/ixa/2017/09/26/best-paper-award-in-sepln2017/ https://www.ehu.eus/ehusfera/ixa/2017/09/26/best-paper-award-in-sepln2017/#comments Tue, 26 Sep 2017 11:34:35 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2539 Last week, our colleagues Begoña Altuna, María Jesús Aranzabe, and Arantza Diaz de Ilarraza were awarded in Murcia with the best paper award in the 33rd INTERNATIONAL CONFERENCE OF THE SPANISH SOCIETY FOR NATURAL LANGUAGE PROCESSING (SEPLN 2017)

CONGRATULATIONS!

The paper is available here: EusHeidelTime: Time Expression Extraction and Normalisation for Basque

Temporal information [...]]]>
Last week, our colleagues Begoña Altuna, María Jesús Aranzabe, and Arantza Diaz de Ilarraza were awarded in Murcia with the best paper award in the 33rd INTERNATIONAL CONFERENCE OF THE SPANISH SOCIETY FOR NATURAL LANGUAGE PROCESSING (SEPLN 2017)

CONGRATULATIONS!

The paper is available here: EusHeidelTime: Time Expression Extraction and Normalisation for Basque

Temporal information helps to organise the information in texts as it places the actions and states in time. It is therefore very important to identify the time points and intervals in the text, as well as what times they refer to. We developed EusHeidelTime for Basque time expression extraction and normalisation. For it, we analysed time expressions in Basque, we created the rules and resources for the tool and we built corpora for development and testing. We finally ran an experiment to evaluate EusHeidelTime’s performance. We achieved satisfactory results and we proved the adaptability of the tool for morphologically rich languages.
]]>
https://www.ehu.eus/ehusfera/ixa/2017/09/26/best-paper-award-in-sepln2017/feed/ 1
Our papers in Japan (COLING 2016) https://www.ehu.eus/ehusfera/ixa/2016/12/12/our-papers-in-japan-coling-2016/ https://www.ehu.eus/ehusfera/ixa/2016/12/12/our-papers-in-japan-coling-2016/#comments Mon, 12 Dec 2016 14:12:18 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2466

Those are our six papers in COLING 2016, taking place in Osaka, Japan, on Dec 11 2016.

Machine Learning for Metrical Analysis of English Poetry Manex Agirrezabal, Inaki Alegria and Mans Hulden Using Linguistic Data for English and Spanish Verb-Noun Combination Identification Uxoa Iñurrieta, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola, Itziar [...]]]>

Those are our six papers in COLING 2016, taking place in Osaka, Japan, on Dec 11 2016.

]]>
https://www.ehu.eus/ehusfera/ixa/2016/12/12/our-papers-in-japan-coling-2016/feed/ 1