Ixa Group. Language Technology.

Neural Machine Translation. Open workshop with Kyunghyun Cho (2017-05-29)

Neural Machine Translation
Open workshop with Kyunghyun Cho
Donostia, 2017-05-29

Trends in Neural Machine Translation (Olof Mogren, 2016)

The third generation of machine translation systems is currently under active development. After initially dominating the field, rule-based machine translation (RBMT) systems have been gradually replaced by data-driven approaches in the last two decades, with statistical machine translation (SMT) systems prevailing as the main paradigm. In the last two years, deep learning approaches have significantly impacted the field, with the rise of neural machine translation (NMT) as the new state-of-the-art in automated translation. This event presents advanced results in the field, in particular for machine translation of Basque.
Context:

The MODELA project was created to advance research and development in deep-learning approaches to machine translation and to address the many challenges of Basque machine translation. The project is financed by the Basque Government and is being carried out by the following entities: Ametzagaiña, Elhuyar, ISEA, UPV/EHU (IXA group) and Vicomtech-IK4.

Guest:

The main speaker will be Kyunghyun Cho (Center for Data Science, New York University), who is an eminent researcher in the area, the most referenced on NMT, a field in which he has obtained a Google prize. Additionally, he is a a brilliant speaker.

Date: May 29, 2017, 11:00
Place: UPV/EHUko Informatika Fakultatea, Manuel de Lardizabal 1, 20018 Donostia (map)
Schedule:

11.00-11.15: Introduction and presentation of the project
11.15-12.30: Neural Machine Translation (Kyunghyun Cho)
12.30-13.15: First results in the Modela project

Sponsor: Modela project and University of the Basque Country

Next day, on Tuesday May 30, at 15:00 he will be with the students of our Master on Language Technology

PhD position in Innsbruck with Michael Ustaszewski

After finishing our Erasmus Mundus LCT Master in 2016 Michael Ustaszewski is now a postdoc assistant at the University of Innsbruck, and Unit Manager (liaison with the Department of Translation Studies) at the Innsbruck Translation Centre. His group is working on Corpus-Based Translation and asked us to publish this Call for PhD Position Candidates:

The Department of Translation Studies at the University of Innsbruck invites applications for a PhD position in the framework of the two-year research project “TransBank: A Meta-Corpus for Translation Research” funded by the Austrian Academy of Sciences.

The goal of the project is to build a large, open and expandable bank of translated texts and their original texts. Its main innovative feature is the ability to exploit a rich set of metadata labels characterising each text and text pair for the compilation and download of sub-corpora, tailored to the requirements of specific translation-related research questions.

The PhD student will be involved in all stages of the corpus building process, thus having the opportunity to gather translation data relevant to his/her specific research interest. The student will work autonomously on the development of the metadata labelset and on collecting translation data, on the basis of which he or she will conduct quantitative and/or qualitative analyses for his/her thesis. Work will be carried out in close collaboration with the project’s two principal investigators and two MA students.

The following requirements are looked for in the successful candidate:

Master’s degree in Translation Studies, Corpus Linguistics,
Computational Linguistics or a related field
proven familiarity with translation theory
strong interest in data-driven research methodologies and linguistic annotation
excellent teamwork skills
proficiency in English on a level suitable for written and spoken scientific communication
solid programming skills in a scripting language (e.g. Python) will be an asset, as will knowledge of German or any other language(s)

The two-year position with a weekly working time of 20 hours (50%) commences in September 2017 and offers an annual stipend of € 19,117 plus allowances for conference attendance. The position involves enrolment in the PhD programme in Linguistics and Media Studies at the University of Innsbruck.

Applications should include:

A cover letter (1 page maximum) that relates the candidate’s experience and interest in the TransBank project
A two-page thesis proposal describing the research question and methodology underlying the candidate’s envisaged analyses using TransBank data
A CV listing any publications
Copies of relevant diplomas and certificates
A recommendation letter by the candidate’s MA thesis supervisor or a university professor
A copy of the MA thesis or the latest draft

To apply, please submit the documents in two PDF files (one containing documents 1 to 5, one containing document 6) by 10 April 2017 via the upload form at http://transbank.info/jobs

Shortlisted applicants will be interviewed in person or via Skype towards the end of April.

Further information:

Details on the research project can be found on the project website http:/www.transbank.info
For enquiries about the position and the application process, please contact mail[at]transbank.info
Information about the Department of Translation Studies at the University of Innsbruck: http://translation.uibk.ac.at
For information on the PhD programme in Linguistics and Media Studies at the University of Innsbruck and the enrolment process, please refer to
https://www.uibk.ac.at/studium/angebot/phd-sprach-und-medienwissenschaft/index.html.en

Mikel Artetxe awarded in Hackaton on Language Technologies organized by Red.es

Yesterday, Mikel Artetxe was awarded in Barcelona with the second prize in the First Hackaton on Language Technologies organized by Red.es in collaboration with the Spanish Plan to promote Language Technology managed by the Spanish Government’s SESIAD agency.

This hackathon was organized in the context of “4 Years From Now” (4YFN), the bussines platform created by Mobile World Capital Barcelona to promote technological startups. Several IXA members participated as organizers (German Rigau, Iñaki Alegria and Rodrigo Agerri).

Eight projects participated in the final session yesterday in Barcelona. Mikel developed a free alternative that allows the automatic creation of bilingual dictionaries offering examples with real uses of words (an application similar to Linguee).

German Rigau keynote speaker in the JRC Conference TEXT MINING IN POLICY MAKING

IXA Group member German Rigau participated as keynote speaker last Monday in the JRC Conference “TEXT MINING IN POLICY MAKING” organised by the European Commission in Brussels to present the new JRC competence centre on text mining. This new JRC has been organized with a showcase of various success stories of JRC applied text mining solutions. German Rigau addressed challenges related to textual data.

“This conference was an opportunity for policy makers from EU institutions to understand better the benefits of text mining in policy making processes, and pave the way forward for a better use of these solutions in policy making.

Information needed by policy makers is increasingly embedded in large amounts of textual data available on the Internet, e.g. traditional or social media, or in large public or proprietary document sets.

Text mining, the automatic extraction of information from text, offers policy makers timely access to important information which would otherwise be inaccessible. Indeed, the sheer volume of data makes it nearly impossible to extract the available information manually.”

Our papers in Japan (COLING 2016)

Those are our six papers in COLING 2016, taking place in Osaka, Japan, on Dec 11 2016.

Machine Learning for Metrical Analysis of English Poetry
Manex Agirrezabal, Inaki Alegria and Mans Hulden
Using Linguistic Data for English and Spanish Verb-Noun Combination Identification
Uxoa Iñurrieta, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola, Itziar Aduriz and John Carroll
Improving Translation Selection with Supersenses
Haiqing Tang, Deyi Xiong, Oier Lopez de Lacalle and Eneko Agirre
The impact of simple feature engineering in multilingual medical NER
Rebecka Weegar, Arantza Casillas, Arantza Diaz de Ilarraza, Maite Oronoz, Alicia Pérez and Koldo Gojenola
Clinical Natural Language Processing Workshop
A Preliminary Study of Statistically Predictive Syntactic Complexity Features and Manual Simplifications in Basque
Itziar Gonzalez-Dios, María Jesús Aranzabe and Arantza Díaz de Ilarraza
Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Comparing two Basic Methods for Discriminating Between Similar Languages and Varieties

Pablo Gamallo, José Ramom Pichel, Iñnaki Alegria and Manex Agirrezabal

Third Workshop on NLP for Similar Languages, Varieties and Dialects

HAP/LAP master theses (2016-09-27)

Master HAP/LAP
EMLCT master
Master thesis defences

Date: September 27th
Place: Ada Lovelace room

15:30
Universal Dependencies for Buryat.
Author: Elena Badmaeva
Supervirors: Koldo Gojenola , Gosse Bouma

16:15
LexSynSimpleText, a lexical and syntactic simplifier: first steps.
Author: Maria Eguimendia
Supervirors: Arantza Diaz de Ilarraza and Gosse Bouma

17:00
Data Sparsity in Highly Inflected Languages: The Case of Morphosyntactic Tagging in Polish.
Egilea / Author: Michael Ustaszewski
Tutoreak / Supervirors: Rodrigo Agerri and German Rigau

17:45
Multilingual Central Repository version 3.0: improving a very large lexical knowledge base.
Egilea / Author: Daniel Parera Perez
Tutoreak / Supervirors: German Rigau Claramunt

Book: Microparameters in the Grammar of Basque

Edited by Beatriz Fernández (UPV/EHU) and Jon Ortiz de Urbina (Deusto University), this book is an endeavor to present and analyze some standard topics in the grammar of Basque from a micro-comparative perspective. From case and agreement to word order and the left periphery, and including an incursion into determiners, the book combines fine-grained theoretical analyses with empirically detailed descriptions. Working from a micro-parametric perspective, the contributions to the volume address in depth some of the exuberant variation attested in the different dialects and subdialects of Basque. At the same time, although the contributions focus mainly on Basque data, cross-linguistic evidence is also presented and discussed.
After all, the goal pursued in this book is to attempt to explain variation in Basque as a particular instantiation of variation in human language at large. The volume presents and analyzes a wide range of empirical phenomena, many typologically marked among European languages, and will therefore be a welcome resource to linguists looking for detailed description and/or theoretical discussion.

Nora Aranberri: Machine Translation for Translators (Innsbruck, 2016-07-20)

Our colleague Nora Aranberri has been the lecturer in the workshop on “Machine Translation for Translators: Taking Advantage of the New Technology” at SummerTrans 2016.

The International Translation Summer School SummerTrans, was founded in Innsbruck in 2004. From 11 to 20 July 2016 the University of Innsbruck hosted the 7th International Translation SummerSchool “SummerTrans VII: Quality and Competence in Translation”. Addressing trainee translators, professional translators and translation researchers alike, its varied programme featured cutting-edge courses and workshops aiming to advance participants’ theoretical knowledge of and practical skills in translation and interpreting, including state-of-the art translation technology and human-machine interaction in translation.
SummerTrans VII welcomed more than 60 participants from 16 countries spanning from Tunisia over half of Europe to India and China.
Michael Ustaszewski, one of our students in Eramus Mundus LCT master2014-2016, now is a lecturer at the University of Innsbruck and one of the organizers of SummerTrans 2016 🙂
Michael told us that now the participants in the workshop know the state-of-the art translation technology and human-machine interaction in translation.

Nice results in Codefestdss2016 projects

This a list of the aims of the projects in CODEFEST 2016 summer school and the results achieved by each of them. Further information can be found in Codefest_dss2016 website.

Codefest_participants

Quiz Bowl: Multilingual question-answering for trivia games with Wikipedia

The QUIZ Bowl team was the winner in our codefest competition. Congratulations!

Aims:The question-answering trivia quiz project is in progress. To start the first game prototype, the team is using some of the questions translated into Basque on Monday. This prototype matches the Basque Wikipedia articles with the questions or hints from the quiz, so that the answer to the hint pops out as an article.

Results: We had the chance tre o play a quiz based on Wikipedia trivia: Human vs. Computer. This time humans have been the winners, but by a very small margin only.

The code is available here: github.com/dss2016eu/codefest/tree/master/quizbowl
References to all the code generated in #codefestdss2016 will also be posted there!

Create a morphological analyzer for your minority language

Aims:In order to develop the morphological analyzer for Hungarian language, Ixa group members Iñaki Alegria and Montse Maritxalar have gathered to offer their help in programming tasks. After creating a list of the lexical roots of Hungarian, they have made a selection based on verbs and adjectives, among other criteria. Afterwards, they want to computerize that selection through a specific program in lexc format.

Results: They have explained several projects they’ve been developing through these days, all of them related to machine translation devices: for Hungarian, Buryat (a variation of Mongolian), Rif Berber (language spoken mostly in Morocco), Uyghur (Turkic language spoken in Western China), among others.

NLP for Literature Analysis and Creation

Aims:Members of the group have chosen the name Story buffet for their tools for analysis and creation of literary texts. The team is made up of linguists, programmers and other experts who consider themselves to be “hybrids” of the two.

On the second day, we had a break so that people from Ixa group (the ones in charge of this project) could explain their work to us. Manex Agirrezabal is an expert on metrical analysis in poetry; therefore, along with his knowledge in programming/coding, he thinks this is a great chance to semantically alter short stories. Originally, Itziar Gonzalez-Dios’ field of study was linguistics, but she has joined the world of programming in the last few years; she is interested in the analysis of the complexity and synthesis of texts.

Results: They have showed their webpage (Story buffet) for literature creation and analysis, in a quite humorous way.

Behagune

Aims:The team has continued developing the Behagunea project making use of their different abilities. Victor (programmer) has visualized the results of the Ixa-pipes, and he is working on designing an attractive interface. Also, Dani (IT expert) is trying to translate Ixa-pipes resources into Catalán. Sabrina (linguist), with the help of Iñaki (programmer), is starting an app based on tweets to study what countries think about each other. Finally, due to some problems, Kassandra has decided to put aside one of the projects: the one that aims to include social media in the website DSS2016EU Iritzien Behagunea (Opinion Observatory). Instead, she has chosen to examine the tweets about the DonostiCup football competition.

Results: They have accomplished their goals. Apart from adding new languages (Catalan, Italian) to the Behagunea project, they have managed to merge social media and geolocalization.

Enriching ZureTTS platform with new languages

Working hard on #ZureTTS #codefestdss2016

Aims: Several aspects of the project ZureTTS have been treated. On the one hand, the members of Aholab have focused on developing the platform to include the dialect from Iparralde (the northern side of the Basque Country), and they have started both writing the questions for the voice donors and designing the new interface. Concerning the app for Android, they have spent the day identifying errors and preparing everything required to install the new platform. To conclude, in the “Ireland team” they have translated the webpage interface into Gaélic and contacted some Irish experts within their university to get hold of a good, reliable database.

Results: At the end of the week, apart from adding the Lapurtera (Basque dialect) version to the web, they’ve made a huge progress in Gaelic, thanks to the help of the Irish people specially.

SRL and Dockers

Aims: Members of the SRL project have been structuring a database to add and handle information later on. As Suhail Sarwan says, developments in SRL mean a direct benefit in the field of semantics, particularly if we want to promote and improve the e-learning model. Aided by Rodrigo Agerri, among others, they have worked on the SRL, and Eleanor Dutton intends to develop a tool for linguistic analysis and to apply it to Moroccan Arabic.

Results: They showed us a tool they have developed to identify the participants of the events described by the predicates within a sentence, by sequence tagging methods.

Machine Translation for minority languages

Aims: Each member of the group is focusing on the pair of languages in which he/she is fluent. Based on the program called Apertium, for example, they have started working on a translator for the language combination French-Occitan, so that they can later develop a linguistic analyzer for Occitan. They have also been working on a Tetum-Portuguese translator (the two official languages spoken on the island of Timor) with the same program. Others have started preparing lexical transfers (they will try to do the same with dependency transfers) for the English-Spanish combination using Matxin. This exact same program also allows the creation of a English-Welsh translator, as well as a translator for English-Basque (one such translator already exists, but some errors must be identified and corrected). The latter will be applied in the field of medicine.

Erasmus Mundus LCT master. Annual Meeting 2016 in Donostia (June 09 – 10)

European Masters Program
Language & Communication Technologies

Annual Meeting 2016 in Donostia (June 09 – 10)

Day 1, Thursday, June 9, 2016

08:45 : Welcome.
09:00 : Introduction, Dr. Ivana Kruijff-Korbayová.
09:30 : Invited talk: Eneko Agirre (UPV/EHU) – Natural Language Understanding using Knowledge Bases and Random Walks.
10:30 : Coffee break.
11:00 : In parallel:
- Student meeting.
- Consortium partner meeting.
12:30 : Lunch.
13:30-14:30 : Poster session Group 1.
14:30-15:30 : Poster session Group 2.
15:30 : Coffee break.
16:00 : Invited talk Dr. Francis Tyers ( Higher School of Economics ) – Apertium: Rule-based machine translation is still something people do.
17:00 : Graduation ceremony.
18:00 : End of official programme.

Day 2, Friday, June 10, 2016

09:30 : Plenary meeting : students + coordinators.
11:00 : Coffee break.
11:30 : Invited talk : Tim Baldwin ( University of Melbourne ) – Multiword Expressions: From Theory to Practicum.
12:30 : Lunch.
13:30 : Workshop: Maria Saiz (UPV/EHU) – Entrepreneurial University: How to create an Spin-off.
15:30 : Consortium partner meeting.
15.30 : Social event.