PhD Thesis – Ixa Group. Language Technology. News from the Ixa Group in the University of the Basque Country Thu, 29 Oct 2020 15:49:02 +0000 en-US hourly 1 PhD Thesis: Corpus based metrics for measuring distances between languages (José Ramom Pichel, 2020-10-29) Tue, 27 Oct 2020 18:42:20 +0000 Title: Medidas de distância entre línguas baseadas em corpus. Aplicação à linguística histórica do galego, poruguês, espanhol e inglês / Corpus based metrics for measuring distances between languages. Application to historical linguistics of Galician, Portuguese, Spanish and English.

Where: Teleconference, ………. ……….Faculty of informatics (UPV/EHU) Ada Lovelace room Date: October 29, 2020, Thursday, 18:30 Author: [...]]]> Title:  Medidas de distância entre línguas baseadas em corpus. Aplicação à linguística histórica do galego, poruguês, espanhol e inglês
           / Corpus based metrics for measuring distances between languages. Application to historical linguistics of Galician, Portuguese, Spanish and English.

Where: Teleconference,
……….Faculty of informatics (UPV/EHU) Ada Lovelace room
Date: October 29, 2020, Thursday,  18:30
Author: José Ramom Pichel
Supervisors: Iñaki Alegria & Pablo Gamallo
Languages:  Portuguese, mainly


As dúvidas sobre a classificação filogenética histórica e actual do galego e as hesitações na identificação au-
tomática da língua e na construção e concepção de tradutores automáticos, sugerem que o cálculo automático da distância entre o galego, o português e o espanhol, a partir de textos escritos reais, e um desafio interessante.

1. Pode a distância entre línguas ser medida automaticamente com base em corpus?
2. Que papel desempenha a ortografia na distância entre as línguas?
3. É possível traduzir esta distância numa única métrica robusta?
4. A distância calculada com essa métrica verifica as hipóteses dos linguistas?
Adiciona novos dados sobre hipóteses minoritárias ou controversas?
5. Será que a distância entre períodos históricos da mesma língua muda? Como?
6. A distância entre línguas muda historicamente ou é sempre a mesma?
E se mudar, esta distância entre línguas é linear?
7. Será que a distância histórica entre variantes reconhecidas da mesma língua muda?

Related publications to his PhD work:

]]> 0
HAP/LAP Master Theses (2020-09-21 and 22) Wed, 16 Sep 2020 09:14:06 +0000 The following master theses will be presented next week. Date: September 21 and 22 Place: Ada Lovelace room

Day/ Time Student Supervisor Title Sep. 21

15:00 Garcia Montero, Eneritz Arantza del Pozo-Echezarreta

Itziar Gonzalez-Dios Coping with Data Scarcity: First Steps towards Word Expansion for a Chatbot in the Urban transportation Domain Sep 21

15:45 [...]]]>

]]> 1
PhD Thesis: Unsupervised Machine Translation (Mikel Artetxe, 2020/07/29) Tue, 28 Jul 2020 13:37:29 +0000 Title: Unsupervised Machine Translation / Itzulpen automatiko gainbegiratu gabea

Non: Teleconference: Faculty of informatics (UPV/EHU) Ada Lovelace room Date: July 29, 2020, Wednesday, 11:00 Author: Mikel Artetxe Zurutuza Supervisors: Eneko Agirre & Gorka Labaka Languages: Basque (motivation, state of the art) and English (second half, papers, conclusions, ~11:30…)

Title:  Unsupervised Machine Translation
           / Itzulpen automatiko gainbegiratu gabea

Non: Teleconference:
Faculty of informatics (UPV/EHU) Ada Lovelace room
Date: July 29, 2020, Wednesday,  11:00
Author: Mikel Artetxe Zurutuza 
Supervisors: Eneko Agirre & Gorka Labaka
Languages:  Basque (motivation, state of the art)  and English (second half, papers, conclusions, ~11:30…)


The advent of neural sequence-to-sequence models has led to impressive progress in machine translation, with large improvements in standard benchmarks and the first solid claims of human parity in certain settings. Nevertheless, existing systems require strong supervision in the form of parallel corpora, typically consisting of several million sentence pairs. Such a requirement greatly departs from the way in which humans acquire language, and poses a major practical problem for the vast majority of low-resource
language pairs.

The goal of this thesis is to remove the dependency on parallel data altogether, relying on nothing but monolingual corpora to train unsupervised machine translation systems. For that purpose, our approach first aligns separately trained word representations in
different languages based on their structural similarity, and uses them to initialize either a neural or a statistical machine translation system, which is further trained through back-translation.

Mikel Artetxe publications related to his PhD work:

]]> 0
Mitxelena Award for PhD theses 2018 to Olatz Perez-de-Viñaspre: Automatic medical term generation Fri, 24 May 2019 12:04:04 +0000 Our colleague Olatz Perez de Viñaspre won last week the VI. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and the University of the Basque Country. CONGRATULATIONS Olatz!

This thesis faced the creation of computational tools to promote the use of Basque in helath services.

The winners [...]]]>

Our colleague Olatz Perez de Viñaspre won last week the VI. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and  the University of the Basque Country.

This thesis faced the creation of computational tools to promote the use of Basque in helath services.

The winners of Mitxelena Awards 2018

Title: Automatic medical term generation fora low-resource language: translation of SNOMED CT into Basque (pdf)
Supervisors: Arantza Diaz de Ilarraza and Maite Oronoz
Publications in English:

  • Design of EuSnomed:
    • Perez-de-Viñaspre O., and Oronoz M.Translating SNOMEDCT Terminology into a Minor Language.Proceedings ofthe 5th International Workshop on Health Text Mining and Infor-mation Analysis (Louhi), 38–45. Association for ComputationalLinguistics. Gothenburg, Sweden, 2014.
    • Perez-de-Viñaspre O., and Oronoz M.An XML Based TBXFramework to Represent Multilingual SNOMED CT forTranslation.12th Mexican International Conference on Artifi-cial Intelligence, MICAI 2013. Lecture Notes in Artificial Intel-ligence, vol. 8265, 419–429. Springer, ISBN 978-3-642-45113-3.Mexico DF, Mexico. 2013
  • Sinple terms: lexical resources and neoclassical terms:
    • Perez-de-Viñaspre O., Oronoz M., Agirrezabal M., and LersundiM.A finite state approach to translate SNOMED CTterms into Basque using medical prefixes and suffixes.Proceedings of the 11th International Conference on Finite StateMethods and Natural Language Processing, 99–103. St Andrews,Scotland, 2013.7
    • Perez-de-Viñaspre O., and Oronoz M.SNOMED CT in a lan-guage isolate: an algorithm for a semiautomatic transla-tion.BMC medical informatics and decision making, volume 15,number 2, S5. BioMed Central. 2015.
  • Complex terms: nested terms and automatic translator:
    • Perez-de-Viñaspre O., and Oronoz M.Osasun-zientzietako ter-minologiaren euskaratze automatikoaren ebaluazioa, os-asungintzako euskal komunitatea inplikatuz.II. IkerGazte,Nazioarteko Ikerketa Euskaraz. Udako Euskal Unibertsitatea. IruÃśea,Basque Country, 2017.
  • Other papers:
    • Perez-de-Viñaspre O., Oronoz M., and Patrick J.Osasun-txostenelebidunak posible ote?I. IkerGazte, Nazioarteko Ikerketa Eu-skaraz, 730–738. Udako Euskal Unibertsitatea, ISBN 978-84-8438-539-4. Durango, Basque Country, 2015. IkerGazte Special Award.
    • Perez-de-Viñaspre O., and Labaka G.IXA Biomedical TranslationSystem at WMT16 Biomedical Translation Task.Proceedingsof the First Conference on Machine Translation (WMT16), 477–482.Association for Computational Linguistics. Berlin, Germany, 2016
]]> 1
PhD Thesis: Computational Model for Semantic Textual Similarity (I. San Vicente, 2019/03/11) Tue, 12 Mar 2019 19:41:56 +0000 Title: Multilingual Sentiment Analysis in Social Media Author: Iñaki San Vicente Supervisors: German Rigau / Rodrigo Agerri (Ixa Group) Date: Mars 11, 2019, Monday


The main goal of this thesis was to research on Multilingual Sentiment Analysis in order to develop a social media monitor on specific topics. The most relevant contributions are [...]]]> Title: Multilingual Sentiment Analysis in Social Media
Author: Iñaki San Vicente
Supervisors: German Rigau  / Rodrigo Agerri (Ixa Group)
Date: Mars 11, 2019, Monday


The main goal of this thesis was to research on Multilingual Sentiment Analysis in order to develop a social media monitor on specific topics. The most relevant contributions are listed below:

  • Improvement of the state of the art for Spanish polarity classification, and obtained the first position in the TASS shared task twice
  • Contribution to the state of the art in aspect based SA for English, and notable results on the Semeval 2015 aspect based SA shared task
  • Pioneering work for Basque in the SA field, specifically:
    • Creating the first sentiment lexicons for Basque
    • The first polarity annotated datasets for Basque.
    • First resources for Basque microtext normalization.
  • EliXa, The first multilingual SA system including Basque.
  • Talaia, a real social media monitoring platform applying all the previous research.
  • A set of robust and open domain tools and resources that are freely available.
]]> 0
2 master-theses today (2019-02-25, 16:00) Mon, 25 Feb 2019 08:55:46 +0000 Date: February 25th Place: Ada Lovelace room

16:00 Izenburua / Title: Clinical report multi-label classification using the international classification of diseases (ICD-10) Egilea / Author: Jorge Pérez Tutoreak / Supervisors: Alicia Pérez & Arantza Casillas

16:45 Izenburua / Title: Building a dialogue system for question-answer forum websites (defence in Basque) Egilea / Author: [...]]]>


]]> 1
“Language Analysis and Processing” master theses (2018-06-26) Tue, 17 Jul 2018 09:07:30 +0000 Four master theses have been presented in June:

15:00 Noisy Speech Recognition using Kaldi and Neural Architectures Ikaslea/Student: Ander González Docasal Zuzendariak/Supervisors: Vassilis Tsiaras, George P. Kafentzis, Yannis Stylianou

15:45 Unsupervised Methods to Predict Example Difficulty in Word Sense Annotation Ikaslea/Student: Cristina Aceta Moreno Zuzendariak/Supervisors: Oier Lopez de Lacalle, Eneko Agirre, Izaskun Aldezabal

Four master theses have been presented in June:

Noisy Speech Recognition using Kaldi and Neural Architectures
Ikaslea/Student: Ander González Docasal
Zuzendariak/Supervisors: Vassilis Tsiaras, George P. Kafentzis, Yannis Stylianou

Unsupervised Methods to Predict Example Difficulty in Word Sense Annotation
Ikaslea/Student: Cristina Aceta Moreno
Zuzendariak/Supervisors: Oier Lopez de Lacalle, Eneko Agirre, Izaskun Aldezabal

To post‐edit or to translate… That is the question.
A case study of a recommender system for Quality Estimation of Machine Translation based on linguistic feature
Ikaslea/Student: Ona de Gilbert Bonet
Zuzendaria/Supervisor: Nora Aranberri

Basque‐to‐Spanish and Spanish‐to‐Basque Machine Translation for the health domain
Ikaslea/Student: Xabier Soto García
Zuzendariak/Supervisors: Gorka Labaka, Olatz Perez de Viñaspre
Zuzendarikidea/Co‐advisor: Maite Oronoz

]]> 0
HAP/LAP master theses (2017-09-26) Tue, 26 Sep 2017 11:47:08 +0000 Master HAP/LAP — EMLCT master Master thesis defences


Izenburua / Title: Automatic Generation of Named Entity Taggers Leveraging Parallel Corpora Egilea / Author: Yi-Ling Chung (EMLCT) Tutoreak / Supervirors: Rodrigo Agerri and German Rigau


Izenburua / Title: Dialect normalisation with deep learning-based automatic speech recognition Egilea / Author: Mahsa Vafaie (EMLCT) [...]]]>

Master HAP/LAP  —  EMLCT master
Master thesis defences


Izenburua / Title: Automatic Generation of Named Entity Taggers Leveraging Parallel Corpora
Egilea / Author: Yi-Ling Chung (EMLCT)
Tutoreak / Supervirors: Rodrigo Agerri and German Rigau


Izenburua / Title: Dialect normalisation with deep learning-based automatic speech recognition
Egilea / Author: Mahsa Vafaie (EMLCT)
Tutoreak / Supervirors
: Inma Hernaez, Josef Van Genabith
Izenburua / Title: Mapping of Electronic Health Records in Spanish to the Unified Medical Language System Metathesaurus
Egilea / Author: Naiara Perez (HAP/LAP)
Tutoreak / Supervirors
: Montse Cuadros and German Riga
]]> 0
PhD Thesis: Computational Model for Semantic Textual Similarity (A. Gonzalez, 2017/07/07) Thu, 06 Jul 2017 17:43:30 +0000 Title: Computational Model for Semantic Textual Similarity Author: Aitor Gonzalez-Agirre Supervisors: German Rigau i Claramunt / Eneko Agirre Bengoa (Ixa Group) Date: July 7, 2017, Friday Time: 11:00 Where: Faculty of Informatics, Ada Lovelace Room (UPV/EHU)


The goal is to advance on computational models of meaning and their evaluation. We [...]]]>

Title: Computational Model for Semantic Textual Similarity
Author: Aitor Gonzalez-Agirre
Supervisors: German Rigau i Claramunt  / Eneko Agirre Bengoa (Ixa Group)
Date: July 7, 2017, Friday
Time: 11:00
Where:  Faculty of Informatics, Ada Lovelace Room (UPV/EHU)


The goal is to advance on computational models of meaning and their evaluation. We define two tasks: Semantic Textual Similarity (STS) and Typed Similarity.

STS aims to measure the degree of semantic equivalence between two sentences. We have collected pairs of sentences to construct datasets for STS, a total of 15,436 pairs of sentences, being by far the largest collection of data for STS.  We have designed, constructed and evaluated a new approach to combine knowledge-based and corpus-based methods using a cube.

Typed Similarity tries to identify the type of relation that holds between a pair of similar items in a digital library. Providing a reason why items are similar has applications in recommendation, personalization, and search. A range of types of similarity in this collection were identified and a set of 1,500 pairs of items from the collection were annotated using crowdsourcing.

We present systems that resolve the Typed Similarity task.

]]> 1
HAP/LAP master thesis (Noelia Migueles, 2017-06-27) Tue, 27 Jun 2017 11:47:24 +0000 Today afternoon, June the 27th, Noelia Migueles will defend her master thesis.

Date: june 27th, 15:00 Place: Ada Lovelace room

Izenburua / Title: A Study Towards Spanish Abstract Meaning Representation

Egilea / Author: Noelia Migueles-Abraira

Tutoreak / Supervirors: Arantza Diaz de Ilarraza and Rodri Agerri

Today afternoon, June the 27th, Noelia Migueles will defend her master thesis.

Date: june 27th, 15:00
Place: Ada Lovelace room

Izenburua / Title: A Study Towards Spanish Abstract Meaning Representation

Egilea / Author: Noelia Migueles-Abraira

Tutoreak / Supervirors: Arantza Diaz de Ilarraza and Rodri Agerri

]]> 1