PhD Thesis – Ixa Group. Language Technology.

Koldo Mitxelena award for PhD theses to Arantxa Otegi

Kepa Sarasola — Thu, 07 Feb 2013 17:39:59 +0000

Our colleague Arantxa Otegi won last Janaury the III. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and the University of the Basque Country.

CONGRATULATIONS Arantxa!

Congratulations to her supervisors (Xabier Arregi and Eneko Agirre).

The title of this thesis is ‘Expansion for information retrieval: contribution of word sense disambiguation and semantic relatedness’.

The whole text is available here. This is the abstract:

Information retrieval (IR) aims at searching documents which satisfy the information need of an user. In that way, an IR system informs the user about relevant documents, that is those documents that contain the information they need as formulated in the query. Well-known search engines like Google and Yahoo are prime examples of IR systems.
A perfect IR system should retrieve only, and all, the relevant documents, rejecting the non-relevant ones. However, perfect retrieval systems do not exist. One of the main problems is the so-called vocabulary mismatch problem between query and documents: some documents might be relevant to the query even if the specific terms used differ substantially, or some documents might not be relevant to the query even they have some terms in common. The former is because several words or phrases can be used to express the same idea or item (synonymy). The latter is caused by ambiguity, where one word can have more than one interpretation depending on the context. Owing to these facts, if an IR system relies only on terms occurring in both the query and the document when it comes to deciding whether a document is relevant, it might be diffcult to fnd some of the interesting documents, and also to reject non-relevant documents. It seems fair to think that there will be more chances of successful retrieval if the meaning of the text is also taken into account.
Even though the vocabulary mismatch problem has been widely discussed in the literature from the early days of IR it remains unsolved, and most search engines just ignore it. This PhD dissertation explores whether natural language processing (NLP) can be used to alleviate this problem.
In a nutshell, we expand queries and documents making use of two NLP techniques, word sense disambiguation and semantic relatedness. For each of the mentioned techniques we propose an expansion strategy, in which we obtain synonyms and other related words for the words in the query and documents. We also present, for each case, a method to combine the expansions and original words effectively in an IR system. Furthermore, as the expansion technique we propose is useful for translating queries and documents, we show how a cross lingual information retrieval system could be improved using such an expansion technique.

Our extensive experiments on three datasets show that the expansion methods explored in this dissertation help overcome the mismatch problem, consequently improving the effectiveness of an IR system.

News from OPENMT-2 project

Kepa Sarasola — Sun, 10 Apr 2011 16:28:00 +0000

Three pieces of news related to the OPENMT-2 project (2010-2012):

Gorka Labaka’s PhD thesis

In his PhD thesis (“EUSMT: Incorporating Linguistic Information to Statistical Machine Translation for Basque“) Labaka studied how Statistical Machine Translation (SMT) can handle the translation of Spanish into Basque, a morphologically rich and less-resourced language. He found two ways to enhance the quality of the translation by using linguistic tools:

The use of morphological tools allowed him to perform translation at word-segments level, so avoiding spareness problems in corpora.
Complementarily, the syntactic tools enabled the Spanish word-segments to be rearranged into their corresponding order in Basque. This reordering helped the SMT decoder to look for correct translations.

Recent research trends to focus more on statistical systems, and to ignore rule-based attempts. However, according to Gorka Labaka’s evaluation the RBMT and the state-of-the-art basic SMT systems work with a similar quality when translating into Basque. His improved SMT system based on segmentation and re-ordering outperforms both, the RBMT system and the basic SMT system, in more than 10% in HTER metric. Besides, he calculated that a hypothetical oracle system would yield a result even 10% better; this oracle system should select the improved SMT output for 55% of the sentences, the RBMT output for other 41% of them, and EBMT for 4%. So he concluded that, at least in the case of morphologically rich languages with few resources, and hence few parallel corpora, the SMT approach is limited, and the RBMT approach should not be ignored. Currently, we are experimenting with hybrid architectures combining Matxin (rule-based) and EUSMT (statistical) translation-engines.

Visiting researcher Lluís Màrquez (NLPRG, Technical University of Catalonia, UPC)

With the aim of collaborating in this research line, Lluis Marquez, the main researcher in the UPC team within the OPENMT-2 project, is going to be in Donostia visiting the Ixa group until summer. He is an expert in integrating Machine Learning techniques in Language Technology. The first experiments on combining MT engines made by Gorka Labaka confirmed there is room for improvement. Now we want to find out the most suitable ways to do it.

Collaboration on Post-Editing with Basque Wikipedia (eu.wikipedia)

Within this project, a set of 60 long articles of the Spanish Wikipedia (adding up to more than 100.000 words) have been selected, and then translated into Basque language by using Matxin-Opentrad, our open-source rule-based machine translation system. Soon, in 2011 spring, a group of users of Basque Wikipedia will review them using an special interface we have adapted using OmegaT. They will correct the errors they find; this process is also known as post-editing. In this process, changes made by these users will be logged. The fixed articles will be included into Basque Wikipedia, but additionally the resulting post-editing logs will be used to enhance the machine translation process by manually improving the different modules of their MT system, or by implementing an automated statistical post-editing process that is expected to enhance the accuracy in the translation. (paper in Wikimania 2010)

Mitxelena Award for PhD theses: Maite Oronoz eta Larraitz Uria

Kepa Sarasola — Wed, 06 Apr 2011 11:46:22 +0000

Our colleague Maite Oronoz won last Monday the II. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and the University of the Basque Country.

CONGRATULATIONS Maite!

Besides, our colleague Larraitz Uria’s PhD thesis was also nominated for this award.

Both theses face language error detection. Maite’s thesis deals with it from a computational point of view, while Larraitz’ work does it from a linguistic perspective.

Title of Maite’s thesis: Euskarazko errore sintaktikoak detektatzeko eta zuzentzeko baliabideen garapena: datak, postposizio-lokuzioak eta komunztadura.
(Saroi, a system to detect and correct syntactic mistakes: dates, complex postpositions, and agreement.)
Maite’s supervisors: Arantza Diaz de Ilarraza and Koldo Gojenola
Title of Larraitz’ thesis: Euskarazko erroreen eta desbideratzeen analisirako lan-ingurunea. Determinatzaile-erroreen azterketa eta prozesamendua.
(A framework for the analysis of errors and deviations in Basque texts. Analysis and processing of errors on the use of determiners.
Larraitz’ supervisors: Igone Zabala and Montse Maritxalar
Publications:

Maite Oronoz, Arantza Díaz de Ilarraza, Koldo Gojenola 2010
Design and evaluation of an agreement error detection system: testing the effect of ambiguity, parser and corpus type
7th International Conference on Natural Language Processing, IceTAL 2010, H. Loftsson, E. R ̈gnvaldsson, S. Helgad ́ttir (Eds.): IceTAL 2010, LNAI 6233, pp. 281–292, 2010. Springer-Verlag Berlin Heidelberg 2010, August 16-18, 2010 Reykjavik, Iceland
Díaz de Ilarraza A., Gojenola K., Oronoz M. 2009
Evaluating the Impact of Morphosyntactic Ambiguity in Grammatical Error Detection
Recent Advances in Natural Language Processing ISSN 1313-8502. Páginas: 155-160

Uria L. Arrieta B., Díaz de Ilarraza A., Maritxalar M., Oronoz M. 2009
Determiner errors in Basque: Analysis and Automatic Detection.
Proceedings de XXV Conferencia de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), Revista nº 43, pp. 41-48 Septiembre 2009. ISSN: 1135-5948.

Larraitz Uria’s PhD Thesis: A framework for the analysis of errors and deviations in Basque texts. Analysis and processing of errors on the use of determiners.

Kepa Sarasola — Fri, 16 Apr 2010 18:01:18 +0000

We analyse errors and deviations made in Basque to bring contributions to two research fields: in the field of automatic error treatment, our aim is to develop spelling, grammar and style checkers for Basque; in the field of ICALL, we aim to create resources for the study of the language learning process. In both fields, in order to develop useful tools which consider the needs that real users might have, it is necessary to carry out a complete analysis of the errors and deviations made in Basque, taking also into account the sociolinguistic situation of our language community.

For that, in this dissertation we have developed a whole environment composed of those resources which are essential for error analysis: corpora (learner and native speaker corpora), an error editor tool called EtikErro, an error classification and two data-bases (named Errors and Learners) to store information related to the tagged examples. We have also defined the basic criterion for the analysis and processing of errors. As it is not possible to study all the error types at the same time, in this dissertation we have dealt with the analysis and automatic treatment of determiner-errors in Basque. We have written some rules that will be integrated in XUXENg, the grammar-checker we are developing for Basque.

ixa.si.ehu.es/Ixa/Argitalpenak/Tesiak/1260962506/publikoak/TESIA-Larraitz_2009

Maite Oronoz develops new system to detect and correct syntactic mistakes (2009/09/21)

Kepa Sarasola — Mon, 21 Sep 2009 17:51:39 +0000

The new system called Saroi is a general tool which, apart from dealing with errors, can be used to make consultations about syntactic structures in the trees of analysis and to make searches for linguistic structures in such trees.

Maite Oronoz has analysed the existing tools for the detection and correction of syntactic mistakes. To detect context errors such as concordance errors, it is necessary to analyse the tree structure of the sentences. The researcher did not find a suitable tool for this purpose so she created Saroi, which not only deals with mistakes in syntax but can also be used for consulting tree structure analyses and for carrying out searches for linguistic structures in such trees.