Collaboration – Ixa Group. Language Technology.

Workshop: Resources and tools for the automatic processing of the languages of the Pyrenees (2021-05-12, Online, Free)

Kepa Sarasola — Fri, 30 Apr 2021 17:52:56 +0000

The European project EFA 227/16/LINGUATEC “Development of cross-border cooperation and knowledge transfer in language technologies” organizes the workshop open to all researchers, with the aim of disseminating the work carried out within the project and presenting some of the advances made for Basque and Occitan.
This project is co-financed by the European Regional Development Fund (ERDF)

Free registration. Please, use this registration form

12 May, 2021
Online, with presentations in English, Spanish and French, with simultaneous translation into English, Spanish and French.

10h – Opening

10h15 Invited talks: Catalan processing

Lluis Padró (Universitat Politècnica de Catalunya)
Morphological and Syntactic Resources in FreeLing
Presentation in English – Simultaneous translation in Spanish and French

Mariona Taulé (Universitat de Barcelona)
AnCora: un corpus anotado a diferentes niveles lingüístico
AnCora: a corpus annotated at different linguistic levels
Presentation in Spanish – Simultaneous translation in English and French

11h15 — Break

11h30 Presentations: Corpora for Occitan, Basque and other under-resourced languages

Assaf Urieli, Joliciel
Talismane, Jochre: automatic syntax analysis and OCR for under-resourced languages
Presentation in English – Simultaneous translation in Spanish and French

Aleksandra Miletic y Dejan Stosic, CLLE
Mutualisation des ressources pour la création de treebanks : le cas du serbe et de l’occitan
Pooling resources for the creation of syntactic tree banks: the case of Serbian and Occitan
Presentation in French – Simultaneous translation in English and Spanish

Ainara Estarrona (IXA, HiTZ, UPV/EHU)
Construcción del corpus histórico en euskera
Construction of a historical corpus in Basque
Presentation in Spanish – Simultaneous translation in English and French

13h — Break

14h30 Invited talk: Use of Neural Networks

Mans Hulden (University of Colorado)
Neural Networks in Linguistic Research
Presentation in English – Simultaneous translation in Spanish and French

15h30 Presentación: Language processing

Rodrigo Agerri (IXA, HITZ, UPV/EHU)
Contextual lemmatization for inflected languages: statistical and deep-learning approaches
Presentation in English – Simultaneous translation in Spanish and French

16h – Break

16h15 – Presentations: Results of the LINGUATEC project

Myriam Bras, Aleksandra Miletic, Marianne Vergez-Couret, Clamença Poujade, Jean Sibille, Louise Esher, CLLE :
Automatic processing of Occitan: construction of the first annotated corpora
Video in Occitan with accessible subtitles in English, Spanish and French

Elhuyar
Creation and improvement of Basque resources within the framework of Linguatec
Video in Occitan with accessible subtitles in English, Spanish and French

16h45 – Conclusions

Presentation in Spanish and French – No simultaneous translation

17h – Closing

IXAmBERT: Good news for languages with few resources!

Kepa Sarasola — Wed, 30 Sep 2020 18:10:32 +0000

Good news for languages with few resources!
Pre-trained Basque monolingual and multilingual language models have proven to be very useful in NLP tasks for Basque!
Even they have been created with a 500 times smaller corpus than the English one and with a 80 times smaller wikipedia.

An example of Conversational Question Answering, and its transcription to English.

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across
most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque.

Last April we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora (crawled news articles from online newspapers) produced much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. this work was presented in the paper entitled “Give your Text Representation Models some Love: the Case for Basque“. The composition of the Basque Media Corpus (BMC) used in that experiment was as follows:

Source	Text type	Million tokens
Basque Wikipedia	Enciclopedia	35M
Berria newspaper	News	81M
EiTB	News	28M
Argia magazine	News	16M
Local news sites	News	224.6M

Take into account that the original BERT language model for English was trained using Google books corpus that contains 155 billion words in American English, 34 billion words in British English. The English corpus is almost 500 times bigger than the Basque one.

Agerri

San Vicente

Campos

Otegi

Barrena

Saralegi

Soroa

E. Agirre

An example of a dialogue where there are many references in the questions to previous answers in the dialogue.

Now, in September we have published IXAmBERT, a multilingual language model pretrained for English, Spanish and Basque. And we have successfully experimented with it in a Basque Conversational Question Answering system. This transfer experiments could be already performed with Google’s official mBERT model, but as it covers that many languages, Basque is not very well represented. In order to create this new multilingual model that contains just English, Spanish and Basque, we have followed the same configuration as in the BERTeus model presented in April. We re-use the same corpus of the monolingual Basque model and add the English and Spanish Wikipedia with 2.5G and 650M tokens respectively. The size of these wikipedias is 80 and 20 times bigger than the Basque one.

The good news is that this model has been successfully used to transfer knowledge from English to Basque in a conversational Question/Answering system, as reported in the paper Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque. In the paper, the new language model called IXAmBERT performed better than mBERT when transferring knowledge from English to Basque, as shown in the following table:

Model	Zero-shot	Transfer learning
Baseline	28.7	28.7
mBERT	31.5	37.4
IXAmBERT	38.9	41.2
mBERT + history	33.3	28.7
IXAmBERT + history	40.7	40.0

This table shows the results on a Basque Conversational Question Answering (CQA) dataset. Zero-shot means that the model is fine-tuned using using QuaC, an English CQA dataset. In the Transfer Learning setting the model is first fine-tuned on QuaC, and then on a Basque CQA dataset.

These works set a new state-of-the-art in those tasks for Basque.
All benchmarks and models used in this work are publicly available: https://huggingface.co/ixa-ehu/ixambert-base-cased

Five papers accepted at 58th annual meeting of the Association for Computational Linguistics

ITZIAR GONZALEZ DIOS — Tue, 05 May 2020 07:21:01 +0000

The members of the Ixa group and their collaborators will present five papers at 58th annual meeting of the Association for Computational Linguistics (ACL). ACL is one of the most important conferences on Natural Language Processing. It was to be held in July in Seattle, but this year it will be online.

Following, we present the accepted papers:

– Selecting Backtranslated Data from Multiple Sources for improved Neural Machine Translation (Xabier Soto, Dimitar Shterionov, Alberto Poncelas, Andy Way): We analyse the impact that data backtranslated with diverse systems has on eu-es and de-en clinical domain NMT, and employ data selection (DS) to optimise the synthetic corpus. We further rescore the output of DS by considering the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora.

– On the Cross-lingual Transferability of Monolingual Representations (Mikel Artetxe, Sebastian Ruder, Dani Yogatama): We challenge common beliefs of why multilingual BERT works by showing that a monolingual BERT model can also be transferred to new languages at the lexical level.

– A Call for More Rigor in Unsupervised Cross-lingual Learning (Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre): In this position paper, we review motivations, definition, approaches and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them.

– DoQA – Accessing Domain-Specific FAQs via Conversational QA (Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, Eneko Agirre): We present DoQA, a dataset for accessing FAQs via conversational Question Answering, showing that it is possible to build high quality conversational QA systems for accessing FAQs without in-domain training data.

– A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation (Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, Mark Cieliebak): We introduce a novel methodology to efficiently construct a corpus for question answering over structured data, with threefold manual annotation speed gains compared to previous schemes such as Spider. Our method also produces fine-grained alignment of query tokens to parsing operations. We train a state-of-the-art semantic parsing model on our data and show that our corpus is a challenging dataset and that the token alignment can be leveraged to increase the performance significantly.

Congratulations to all the authors!

Visitor: Andrea Horbach, Automatic scoring

Kepa Sarasola — Fri, 20 Sep 2019 06:45:05 +0000

Andrea Horbach is visiting San Sebastian within the enetCollect network on crowdsourcing for language learning, as part of an ongoing collaborating with Itziar Aldabe, Oier Lopez de Lacalle and Montse Maritxalar about evaluating manually as well as automatically generated reading comprehension questions.

Andrea Horbach is a researcher at the Language Technology Lab headed by Prof. Torsten Zesch at the University of Duisburg-Essen, Germany. Last year, she defended her PhD thesis in computational linguistics, titled “Analyzing Short-Answer Questions and their Automatic Scoring: Studies on Semantic Relations in Reading Comprehension and the Reduction of Human Annotation Effort“ at Saarland Universityl. Her main research interests include educational NLP, such as automatic scoring and exercise generation, as well as the processing of non-standard language.

Last Tuesday (2019-09-17 ) she pesented us a talk entitles “Automated and Assisted Content Scoring in Mono- and Cross-Lingual Educational Settings”
Summary:

Automatic content scoring of free-text answers has the goal to reduce the scoring workload of teachers and to provide consistency in scoring. In high-stakes tests, fully automatic scoring is often not an option. Nevertheless teachers can benefit from assisted scoring, where they are supported by NLP but are still in control of the scoring process.This talk presents ongoing work of two research projects related to educational scoring: First, we investigate content scoring in a cross-lingual setup, where a model trained on data in one language is applied to new data in a different language in order to foster educational equality as well as to overcome data sparseness. We present our cross-lingual data collection, as well as machine learning experiments using machine translation to bridge the language gap.

In the second part of the talk we present work on assisted scoring of listening comprehension data from language proficiency testing. We show assisted scoring studies where teachers are supported in scoring answers by the use of clustering techniques.

One of the best three papers on Clinical NLP in 2017 was published by Ixa Group

Kepa Sarasola — Fri, 28 Jun 2019 19:51:40 +0000

A paper written by IXA members Arantza Casillas, Koldo Gojenola, Maite Oronoz and Alicia Perez, among the 3 best papers published in 2017 in the field of clinical Natural Language Processing.

The paper entitled “Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora“, by Pérez A, Weegar R, Casillas A, Gojenola K, Oronoz M, Dalianis H., published in the Journal of Biomedical Informatics , was considered one of the best three papers in the field of clinical Natural Language Processing in 2017.

A survey of the literature was performed in bibliographic databases. PubMed and Association of Computational Linguistics (ACL) Anthology were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A total of 709 papers were automatically ranked and then manually reviewed. A shortlist of 15 candidate best papers was selected by the section editors and peer-reviewed by independent external reviewers to come to the three best clinical NLP papers for 2017.

The paper addresses “medical named entity recognition in clinical text in Spanish and Swedish; furthermore, they emphasize methods’ contribution in a context where little training data is available, which is often the case for languages other than English or when a new medical specialty is explored”.

The selection process is described and published in “Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook“, by Aurélie Névéol, Pierre Zweigenbaum, in the Yearbook of Medical Informatics,

European Parliament endorses the report on Language Equality in the Digital Age (2018-09-11)

Kepa Sarasola — Thu, 13 Sep 2018 16:12:26 +0000

This week, the report on Language Equality in the Digital Age that was presented by Jill Evans MEP of Wales was overwhelmingly endorsed by the European Parliament with 592 MEPs voting in favour, and with only 45 against and 44 abstentions. Find here a link to a press release about the vote: https://www.greens-efa.eu/en/article/press/victory-for-language-equality-in-the-european-parliament/

CONGRATULATIONS!

https://youtu.be/MqRBloVr5N4

The report endorsed by EuroParl

It is not a law, but it is a declaration made by the European Parliament, which can be a guiding reference for all countries. Today, as until now there were no laws or declarations of the European Parliament to protect the low resourced languages, everything remains in the hands of the local legislation of each country, which could without problems ignore these languages. It is not a law, but this Europarl report is a step forward.

Jill Evans MEP said:

“I am pleased that the European Parliament agrees with my view that action needs to be taken to address the digital gap between European languages.
“European citizens must be able to access and use the digital world in their own languages, including in minority languages. This will require investment and leadership at the EU level.
“This is a huge opportunity for the EU to demonstrate a real commitment to language equality, for the speakers of all of Europe’s languages, including Welsh.”

The report calls on the EU to:

improve the institutional frameworks for language technology policies,
create new research policies to increase the use of language technology in Europe,
use education policies in order to secure the future of language equality in the digital age,
increase the support for both private companies and public bodies to make better use of language technologies.

Last January, Maite Melero representing Catalan, Delith Prys representing Welsh, and Iñaki Irazabalbeitia and Kepa Sarasola representing Basque participated in the creation of the first draft of the report.

A conference with the same title Language equality in the digital age will be held on september 27th in the European Parliament to show to the MEPs the opportunities this technology is offering to European languages. Jill Evans, Maite Melero, Delith Prys and our colleague Montse Maritxalar from Ixa Group are going to participate. (See here the schedule)

Be a friend of the Minority SafePack! We need your signature!

Kepa Sarasola — Thu, 15 Mar 2018 00:26:32 +0000

We call upon the EU to adopt a set of legal acts to improve the protection of persons belonging to national and linguistic minorities and strengthen cultural and linguistic diversity in the Union. It shall include policy actions in the areas of regional and minority languages, education and culture, regional policy, participation, equality, audiovisual and other media content, and also regional (state) support

A European citizens’ initiative is an invitation to the European Commission to propose legislation on matters where the EU has competence to legislate. A citizens’ initiative has to be backed by at least one million EU citizens, coming from at least 7 out of the 28 member states. A minimum number of signatories is required in each of those 7 member states.

‘Minority SafePack‘ iniciative has got 849.888 signatures. 150.000 more are needed in two weeks.

You can sign here

(minority-safepack.eu)

In the European Union there are about 50 million people who belong to a national minority or a minority language community.

STOP LANGUAGES IN EUROPE FROM BECOMING EXTINCT!
CULTURES ARE EQUAL
LANGUAGE EQUALITY
LIKE AT HOME IN ANOTHER REGION
FREE PASSAGE OF AUDIOVISUAL CONTENT
http://www.minority-safepack.eu/main/index

Presentation: Research groups in the Faculty of Informatics (2017-10-10, 10:00-11:10)

Kepa Sarasola — Mon, 09 Oct 2017 12:49:57 +0000

Tomorrow morning the research groups in the Faculty of Informatics will present their work to the students.

Date: Tuesday, October 10
Time: 10:05-11:10
Where: Ada-Lovelace room
Audience: Students of 3rd & 4th levels
Subject: Presentation of research subjects and groups in the Faculty.
IXA Group’s collaboration with students: job opportunities for undergraduate students, scholarships…

Research groups in the Faculty of Informatics

Nora Aranberri: Machine Translation for Translators (Innsbruck, 2016-07-20)

Kepa Sarasola — Sun, 24 Jul 2016 16:58:30 +0000

Our colleague Nora Aranberri has been the lecturer in the workshop on “Machine Translation for Translators: Taking Advantage of the New Technology” at SummerTrans 2016.

The International Translation Summer School SummerTrans, was founded in Innsbruck in 2004. From 11 to 20 July 2016 the University of Innsbruck hosted the 7th International Translation SummerSchool “SummerTrans VII: Quality and Competence in Translation”. Addressing trainee translators, professional translators and translation researchers alike, its varied programme featured cutting-edge courses and workshops aiming to advance participants’ theoretical knowledge of and practical skills in translation and interpreting, including state-of-the art translation technology and human-machine interaction in translation.
SummerTrans VII welcomed more than 60 participants from 16 countries spanning from Tunisia over half of Europe to India and China.
Michael Ustaszewski, one of our students in Eramus Mundus LCT master2014-2016, now is a lecturer at the University of Innsbruck and one of the organizers of SummerTrans 2016
Michael told us that now the participants in the workshop know the state-of-the art translation technology and human-machine interaction in translation.

Nice results in Codefestdss2016 projects

Kepa Sarasola — Sat, 09 Jul 2016 08:11:05 +0000

This a list of the aims of the projects in CODEFEST 2016 summer school and the results achieved by each of them. Further information can be found in Codefest_dss2016 website.

Quiz Bowl: Multilingual question-answering for trivia games with Wikipedia

The QUIZ Bowl team was the winner in our codefest competition. Congratulations!

Aims:The question-answering trivia quiz project is in progress. To start the first game prototype, the team is using some of the questions translated into Basque on Monday. This prototype matches the Basque Wikipedia articles with the questions or hints from the quiz, so that the answer to the hint pops out as an article.

Results: We had the chance tre o play a quiz based on Wikipedia trivia: Human vs. Computer. This time humans have been the winners, but by a very small margin only.

The code is available here: github.com/dss2016eu/codefest/tree/master/quizbowl
References to all the code generated in #codefestdss2016 will also be posted there!

Create a morphological analyzer for your minority language

Aims:In order to develop the morphological analyzer for Hungarian language, Ixa group members Iñaki Alegria and Montse Maritxalar have gathered to offer their help in programming tasks. After creating a list of the lexical roots of Hungarian, they have made a selection based on verbs and adjectives, among other criteria. Afterwards, they want to computerize that selection through a specific program in lexc format.

Results: They have explained several projects they’ve been developing through these days, all of them related to machine translation devices: for Hungarian, Buryat (a variation of Mongolian), Rif Berber (language spoken mostly in Morocco), Uyghur (Turkic language spoken in Western China), among others.

NLP for Literature Analysis and Creation

Aims:Members of the group have chosen the name Story buffet for their tools for analysis and creation of literary texts. The team is made up of linguists, programmers and other experts who consider themselves to be “hybrids” of the two.

On the second day, we had a break so that people from Ixa group (the ones in charge of this project) could explain their work to us. Manex Agirrezabal is an expert on metrical analysis in poetry; therefore, along with his knowledge in programming/coding, he thinks this is a great chance to semantically alter short stories. Originally, Itziar Gonzalez-Dios’ field of study was linguistics, but she has joined the world of programming in the last few years; she is interested in the analysis of the complexity and synthesis of texts.

Results: They have showed their webpage (Story buffet) for literature creation and analysis, in a quite humorous way.

Behagune

Aims:The team has continued developing the Behagunea project making use of their different abilities. Victor (programmer) has visualized the results of the Ixa-pipes, and he is working on designing an attractive interface. Also, Dani (IT expert) is trying to translate Ixa-pipes resources into Catalán. Sabrina (linguist), with the help of Iñaki (programmer), is starting an app based on tweets to study what countries think about each other. Finally, due to some problems, Kassandra has decided to put aside one of the projects: the one that aims to include social media in the website DSS2016EU Iritzien Behagunea (Opinion Observatory). Instead, she has chosen to examine the tweets about the DonostiCup football competition.

Results: They have accomplished their goals. Apart from adding new languages (Catalan, Italian) to the Behagunea project, they have managed to merge social media and geolocalization.

Enriching ZureTTS platform with new languages

Working hard on #ZureTTS #codefestdss2016

Aims: Several aspects of the project ZureTTS have been treated. On the one hand, the members of Aholab have focused on developing the platform to include the dialect from Iparralde (the northern side of the Basque Country), and they have started both writing the questions for the voice donors and designing the new interface. Concerning the app for Android, they have spent the day identifying errors and preparing everything required to install the new platform. To conclude, in the “Ireland team” they have translated the webpage interface into Gaélic and contacted some Irish experts within their university to get hold of a good, reliable database.

Results: At the end of the week, apart from adding the Lapurtera (Basque dialect) version to the web, they’ve made a huge progress in Gaelic, thanks to the help of the Irish people specially.

SRL and Dockers

Aims: Members of the SRL project have been structuring a database to add and handle information later on. As Suhail Sarwan says, developments in SRL mean a direct benefit in the field of semantics, particularly if we want to promote and improve the e-learning model. Aided by Rodrigo Agerri, among others, they have worked on the SRL, and Eleanor Dutton intends to develop a tool for linguistic analysis and to apply it to Moroccan Arabic.

Results: They showed us a tool they have developed to identify the participants of the events described by the predicates within a sentence, by sequence tagging methods.

Machine Translation for minority languages

Aims: Each member of the group is focusing on the pair of languages in which he/she is fluent. Based on the program called Apertium, for example, they have started working on a translator for the language combination French-Occitan, so that they can later develop a linguistic analyzer for Occitan. They have also been working on a Tetum-Portuguese translator (the two official languages spoken on the island of Timor) with the same program. Others have started preparing lexical transfers (they will try to do the same with dependency transfers) for the English-Spanish combination using Matxin. This exact same program also allows the creation of a English-Welsh translator, as well as a translator for English-Basque (one such translator already exists, but some errors must be identified and corrected). The latter will be applied in the field of medicine.