Basque – Ixa Group. Language Technology. https://www.ehu.eus/ehusfera/ixa News from the Ixa Group in the University of the Basque Country Wed, 03 Dec 2014 15:17:48 +0000 en-US hourly 1 https://wordpress.org/?v=5.6.4 The Basque WordNet semantic dictionary is a “public resource” now https://www.ehu.eus/ehusfera/ixa/2014/06/13/the-basque-wordnet-semantic-dictionary-is-a-public-resource-now/ https://www.ehu.eus/ehusfera/ixa/2014/06/13/the-basque-wordnet-semantic-dictionary-is-a-public-resource-now/#comments Fri, 13 Jun 2014 13:13:04 +0000 http://www.ehu.eus/ehusfera/ixa/?p=2039 Machines need computing tools that are more powerful than conventional dictionaries for tasks like information extraction, disambiguation of word meanings, etc. This is in fact the function of the Euskal WordNet application —developed by the IXA Group (UPV/EHU)— which can already be consulted and downloaded free of charge.

This is the first Lexical Knowledge Base [...]]]> Machines need computing tools that are more powerful than conventional dictionaries for tasks like information extraction, disambiguation of word meanings, etc. This is in fact the function of the Euskal WordNet application —developed by the IXA Group (UPV/EHU)— which can already be consulted and downloaded free of charge.

This is the first Lexical Knowledge Base (LKB) developed for the Basque language: a “semantic dictionary” or “store” that compiles and organises lexical and semantic information. “It’s like a database, but the difference is that it not only gathers the usual information of a dictionary —the meanings of words and their corresponding definitions and examples—, it also links the concepts with each other,” pointed out Eneko Agirre, an IXA Group computer programmer.

If we look up the entry hatz (“finger”, “digit” or “toe” in Basque), the result is as follows: “Each of the five appendages at the end of human hands and feet.” That is what the term means. But apart from this information, we can get much more: the finger/toe is an appendage of the body; the thumb is a finger; fingers are part of the hand; hands, in turn, are part of the arm and fingers are used to touch objects, etc. In short: all the concepts are interrelated hierarchically. Every concept is also related to its equivalents in other languages: digit, hatz, dedo, dixito and dit.

EuskalWordnet_hatz_eleanitza

Consulting the word hatz in Basque WordNet.

This database is tremendously useful in various fields, like machine translation, information extraction, disambiguation of word meanings and for question-answer systems. In machine translation, for example, the system has to understand which word it is translating, a task for which it needs a “semantic dictionary” of this type. “For a quality translation, it is necessary to be able to distinguish the most appropriate meaning from among the various ones,” stressed Agirre.

“Our aim (within the framework of QTLeap European project) is to improve the quality of machine translations by using WordNet,” he pointed out.

Over the 2014-2015 academic year, the university Master’s degree in Language Analysis and Processing (LAP) that the IXA Group will be running at the UPV/EHU will be studying the Basque WordNet and other language technologies used to develop similar applications.

Master’s in Language Analysis and Processing (LAP)

The aim of the University Master’s in Language Analysis and Processing is to analyse language and to learn about the techniques and applications available for processing it with the help of the computer.

This Master’s has been organised by the UPV/EHU’s IXA Group and is geared towards anybody who combines linguistics and computing: philologists and linguistics experts, computing and telecommunications engineers, mathematicians, translators, etc. To apply for it, it is enough to be in possession of a University degree, have some experience and display some interest in the subject.

The Master’s will take one year and a half and the classes will be held at the Computing Faculty of the UPV/EHU-University of the Basque Country. It will be possible to spread it over two or three academic years (to cater for professionals who are working).

The pre-registration period is already open, and applications will be accepted until June 30. For further information on the Master’s, please check out http://ixa.si.ehu.es/master/.

]]> https://www.ehu.eus/ehusfera/ixa/2014/06/13/the-basque-wordnet-semantic-dictionary-is-a-public-resource-now/feed/ 1
News from OPENMT-2 project https://www.ehu.eus/ehusfera/ixa/2011/04/10/openmt-2/ https://www.ehu.eus/ehusfera/ixa/2011/04/10/openmt-2/#respond Sun, 10 Apr 2011 16:28:00 +0000 http://www.ehu.eus/ehusfera/ixa/?p=404

 

Three pieces of news related to the OPENMT-2 project (2010-2012):

Gorka Labaka’s PhD thesis

In his PhD thesis (“EUSMT: Incorporating Linguistic Information to Statistical Machine Translation for Basque“) Labaka studied how Statistical Machine Translation (SMT) can handle the translation of Spanish into Basque, a morphologically rich and less-resourced language. He found two ways [...]]]>

 

Three pieces of news related to the OPENMT-2 project (2010-2012):

Gorka Labaka’s PhD thesis

In his PhD thesis (“EUSMT: Incorporating Linguistic Information to Statistical Machine Translation for Basque“)  Labaka studied how Statistical Machine Translation (SMT) can handle the translation of Spanish into Basque, a morphologically rich and less-resourced language. He found two ways to enhance the quality of the translation by using linguistic tools:

  • The use of morphological tools allowed him to perform translation at word-segments level, so avoiding spareness problems in corpora.
  • Complementarily, the  syntactic tools enabled the Spanish word-segments to be rearranged into their corresponding order in Basque. This reordering helped the SMT decoder to look for correct translations.

Recent research trends to focus more on statistical systems, and to ignore rule-based attempts. However, according to Gorka Labaka’s evaluation the RBMT and the state-of-the-art basic SMT systems work with a similar quality when translating into Basque. His improved SMT system based on segmentation and re-ordering outperforms both, the RBMT system and the basic SMT system, in more than 10% in HTER metric.  Besides, he calculated that a hypothetical oracle system would yield a result even 10% better; this oracle system should select the improved SMT output for 55% of the sentences, the RBMT output for other 41% of them, and EBMT for 4%. So he concluded that, at least in the case of morphologically rich languages with few resources, and hence few parallel corpora, the SMT approach is limited, and the RBMT approach should not be ignored. Currently, we are experimenting with hybrid architectures combining Matxin (rule-based) and EUSMT (statistical) translation-engines.

.

Visiting researcher Lluís Màrquez (NLPRG, Technical University of Catalonia, UPC)

With the aim of collaborating in this research line, Lluis Marquez, the main researcher in the UPC team within the OPENMT-2 project, is going to be in Donostia visiting the Ixa group until summer. He is an expert in integrating Machine Learning techniques in Language Technology. The first experiments on combining MT engines made by Gorka Labaka confirmed there is room for improvement. Now we want to find out the most suitable ways to do it.

.

.

Collaboration on Post-Editing with Basque Wikipedia (eu.wikipedia)

Within this project, a set of 60 long articles of the Spanish Wikipedia (adding up to more than 100.000 words) have been selected, and then translated into Basque language by using Matxin-Opentrad, our open-source rule-based machine translation system. Soon, in 2011 spring, a group of users of Basque Wikipedia will review them using an special interface we have adapted using OmegaT. They will correct the errors they find; this process is also known as post-editing. In this process, changes made by these users will be logged. The fixed articles will be included into Basque Wikipedia, but additionally the resulting post-editing logs will be used to enhance the machine translation process by manually improving the different modules of their MT system, or by implementing an automated statistical post-editing process that is expected to enhance the accuracy in the translation. (paper in Wikimania 2010)

]]> https://www.ehu.eus/ehusfera/ixa/2011/04/10/openmt-2/feed/ 0
Mitxelena Award for PhD theses: Maite Oronoz eta Larraitz Uria https://www.ehu.eus/ehusfera/ixa/2011/04/06/mitxelena-award_oronoz-uria/ https://www.ehu.eus/ehusfera/ixa/2011/04/06/mitxelena-award_oronoz-uria/#comments Wed, 06 Apr 2011 11:46:22 +0000 http://www.ehu.eus/ehusfera/ixa/?p=498  

Our colleague Maite Oronoz won last Monday the II. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and the University of the Basque Country.

CONGRATULATIONS Maite!

Besides, our colleague Larraitz Uria’s PhD thesis was also nominated for this award.

Both theses face language error detection. Maite’s thesis deals [...]]]>  

Our colleague Maite Oronoz won last Monday the II. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and  the University of the Basque Country.

CONGRATULATIONS Maite!

Besides, our colleague Larraitz Uria’s PhD thesis was also nominated for this award.

Both theses face language error detection. Maite’s thesis deals with it from a computational point of view, while Larraitz’ work does it from a linguistic perspective.

Title of Maite’s thesis: Euskarazko errore sintaktikoak detektatzeko eta zuzentzeko baliabideen garapena: datak, postposizio-lokuzioak eta komunztadura.
(Saroi, a system to detect and correct syntactic mistakes: dates, complex postpositions, and agreement.)
Maite’s supervisors: Arantza Diaz de Ilarraza and Koldo Gojenola
Title of Larraitz’ thesis: Euskarazko erroreen eta desbideratzeen analisirako lan-ingurunea. Determinatzaile-erroreen azterketa eta prozesamendua.
(A framework for the analysis of errors and deviations in Basque texts. Analysis and processing of errors on the use of determiners.
Larraitz’ supervisors: Igone Zabala and Montse Maritxalar
Publications:

]]> https://www.ehu.eus/ehusfera/ixa/2011/04/06/mitxelena-award_oronoz-uria/feed/ 1
Collaborating on language processing for Basque and Sami (Laponian) https://www.ehu.eus/ehusfera/ixa/2010/06/28/collaborating-on-language-processing-for-basque-and-sami-laponian/ https://www.ehu.eus/ehusfera/ixa/2010/06/28/collaborating-on-language-processing-for-basque-and-sami-laponian/#comments Mon, 28 Jun 2010 22:16:17 +0000 http://www.ehu.eus/ehusfera/ixa/2011/01/20/collaborating-on-language-processing-for-basque-and-sami-laponian/

Linda Wiechetek, a researcher from the University of Tromsø (Norway) is visiting the Ixa Group in Donostia in the period April to July in 2010. Her visit is founded by the NILS mobility project.

Why Sami and Basque? Why do we [...]]]> Researchers working on Basque and Sami (Laponian) are collaborating on Automatic Language Processing.

Linda Wiechetek, a researcher from the University of Tromsø (Norway) is visiting the Ixa Group in Donostia in the period April to July in 2010. Her visit is founded by the NILS mobility project. Linda

Why Sami and Basque? Why do we work with this unusual language pair?

Some of the reasons for that are:
1) Both are small languages,
2) With limited resources to face the use of language technology. (Sami is even lesser resourced than Basque now adays).
3) Sami and Basque morphologies are very rich and demand adequate tools such as our morphological transducers and syntactic disambiguation and analysis modules. Many of the better resourced languages with highly developed language Technology such as English, Spanish and French do not need such complex modules to create their basic tools.
4) There are clear syntactic parallels betwen Basque and Sami including the grammatical cases/postpositions causing morpho-syntactic ambiguity.

In this context we are collaborating on the following ways:
a) Use of semantic prototype features in Constraint Grammar for syntactic disambiguation.
b) Use of semantic features in Constraint Grammar for lexical/syntactic transfer in Machine Translation.
c) Use of information on verb-subcategorization for syntactic disambiguation.
d) Use of verb-subcategorization information in for lexical and syntactic transfer in Machine Translation.

The parser for Basque is not very accurate yet, not as accurate as English parsers. The Sami parser on the other hand gets good results in accuracy, but the use of valency is necessary for other tasks such as MT and QA.
With this collaboration between Basque and Sami researchers we aim to improve our NLP tools.

Besides of that, now Linda is able to speak some Basque, and we are learning some words in Sami.
That’s another way of collaboration 😉

Linda_IXA
giellatekno.uit.no/background/giellatekno3.pdf

]]> https://www.ehu.eus/ehusfera/ixa/2010/06/28/collaborating-on-language-processing-for-basque-and-sami-laponian/feed/ 1
CLARIN Meeting in Donostia. May 2010 https://www.ehu.eus/ehusfera/ixa/2010/06/06/clarin-meeting-in-donostia-may-2010/ https://www.ehu.eus/ehusfera/ixa/2010/06/06/clarin-meeting-in-donostia-may-2010/#respond Sun, 06 Jun 2010 22:15:40 +0000 http://www.ehu.eus/ehusfera/ixa/2011/01/20/clarin-meeting-in-donostia-may-2010/ CLARIN meeting 10:00: Steven Krawer. CLARIN project Coordinator. 10:30: Nuria Bel (Pompeu Fabra University). Coordinator of CLARIN in Spain. 11:00: Coffee -break 11:30 -13:00 Presentation of Basque groups (I)

* Miriam Urkia. Euskaltzaindia * Miren Azkarate. Euskara institutua. UPV/EHU * Mikel Santesteban. Gogo Elebiduna. UPV/EHU * Antton Gurrutxaga eta Iñaki San Vicente. Elhuyar I+G

13:00: [...]]]> CLARIN meeting
CLARIN
10:00: Steven Krawer. CLARIN project Coordinator.
10:30: Nuria Bel (Pompeu Fabra University). Coordinator of CLARIN in Spain.
11:00: Coffee -break
11:30 -13:00 Presentation of Basque groups (I)

* Miriam Urkia. Euskaltzaindia
* Miren Azkarate. Euskara institutua. UPV/EHU
* Mikel Santesteban. Gogo Elebiduna. UPV/EHU
* Antton Gurrutxaga eta Iñaki San Vicente. Elhuyar I+G

13:00: Luncha
14:00 -15:00 Presentation of Basque groups (II)

* Igone Zabala. Euskal Filologia saila. UPV/EHU
* Ibon Aizpurua. Eleka.
* Jon Sánchez. Aholab. UPV/EHU.
* Kepa Sarasola. IXA Group. UPV/EHU

15:00-15:30 Conclusions
CLARIN_meeting_Donostia10

]]> https://www.ehu.eus/ehusfera/ixa/2010/06/06/clarin-meeting-in-donostia-may-2010/feed/ 0
Larraitz Uria’s PhD Thesis: A framework for the analysis of errors and deviations in Basque texts. Analysis and processing of errors on the use of determiners. https://www.ehu.eus/ehusfera/ixa/2010/04/16/larraitz-urias-phd-thesis-a-framework-for-the-analysis-of-errors-and-deviations-in-basque-texts-analysis-and-processing-of-errors-on-the-use-of-determiners/ https://www.ehu.eus/ehusfera/ixa/2010/04/16/larraitz-urias-phd-thesis-a-framework-for-the-analysis-of-errors-and-deviations-in-basque-texts-analysis-and-processing-of-errors-on-the-use-of-determiners/#comments Fri, 16 Apr 2010 18:01:18 +0000 http://www.ehu.eus/ehusfera/ixa/2011/01/20/larraitz-urias-phd-thesis-a-framework-for-the-analysis-of-errors-and-deviations-in-basque-texts-analysis-and-processing-of-errors-on-the-use-of-determiners/ We analyse errors and deviations made in Basque to bring contributions to two research fields: in the field of automatic error treatment, our aim is to develop spelling, grammar and style checkers for Basque; in the field of ICALL, we aim to create resources for the study of the language learning process. In both fields, [...]]]> We analyse errors and deviations made in Basque to bring contributions to two research fields: in the field of automatic error treatment, our aim is to develop spelling, grammar and style checkers for Basque; in the field of ICALL, we aim to create resources for the study of the language learning process. In both fields, in order to develop useful tools which consider the needs that real users might have, it is necessary to carry out a complete analysis of the errors and deviations made in Basque, taking also into account the sociolinguistic situation of our language community.

For that, in this dissertation we have developed a whole environment composed of those resources which are essential for error analysis: corpora (learner and native speaker corpora), an error editor tool called EtikErro, an error classification and two data-bases (named Errors and Learners) to store information related to the tagged examples. We have also defined the basic criterion for the analysis and processing of errors. As it is not possible to study all the error types at the same time, in this dissertation we have dealt with the analysis and automatic treatment of determiner-errors in Basque. We have written some rules that will be integrated in XUXENg, the grammar-checker we are developing for Basque.


ixa.si.ehu.es/Ixa/Argitalpenak/Tesiak/1260962506/publikoak/TESIA-Larraitz_2009

]]>
https://www.ehu.eus/ehusfera/ixa/2010/04/16/larraitz-urias-phd-thesis-a-framework-for-the-analysis-of-errors-and-deviations-in-basque-texts-analysis-and-processing-of-errors-on-the-use-of-determiners/feed/ 2