Lexical Representations – Ixa Group. Language Technology.

The Basque WordNet semantic dictionary is a “public resource” now

Kepa Sarasola — Fri, 13 Jun 2014 13:13:04 +0000

Machines need computing tools that are more powerful than conventional dictionaries for tasks like information extraction, disambiguation of word meanings, etc. This is in fact the function of the Euskal WordNet application —developed by the IXA Group (UPV/EHU)— which can already be consulted and downloaded free of charge.

This is the first Lexical Knowledge Base (LKB) developed for the Basque language: a “semantic dictionary” or “store” that compiles and organises lexical and semantic information. “It’s like a database, but the difference is that it not only gathers the usual information of a dictionary —the meanings of words and their corresponding definitions and examples—, it also links the concepts with each other,” pointed out Eneko Agirre, an IXA Group computer programmer.

If we look up the entry hatz (“finger”, “digit” or “toe” in Basque), the result is as follows: “Each of the five appendages at the end of human hands and feet.” That is what the term means. But apart from this information, we can get much more: the finger/toe is an appendage of the body; the thumb is a finger; fingers are part of the hand; hands, in turn, are part of the arm and fingers are used to touch objects, etc. In short: all the concepts are interrelated hierarchically. Every concept is also related to its equivalents in other languages: digit, hatz, dedo, dixito and dit.

Consulting the word hatz in Basque WordNet.

This database is tremendously useful in various fields, like machine translation, information extraction, disambiguation of word meanings and for question-answer systems. In machine translation, for example, the system has to understand which word it is translating, a task for which it needs a “semantic dictionary” of this type. “For a quality translation, it is necessary to be able to distinguish the most appropriate meaning from among the various ones,” stressed Agirre.

“Our aim (within the framework of QTLeap European project) is to improve the quality of machine translations by using WordNet,” he pointed out.

Over the 2014-2015 academic year, the university Master’s degree in Language Analysis and Processing (LAP) that the IXA Group will be running at the UPV/EHU will be studying the Basque WordNet and other language technologies used to develop similar applications.

Master’s in Language Analysis and Processing (LAP)

The aim of the University Master’s in Language Analysis and Processing is to analyse language and to learn about the techniques and applications available for processing it with the help of the computer.

This Master’s has been organised by the UPV/EHU’s IXA Group and is geared towards anybody who combines linguistics and computing: philologists and linguistics experts, computing and telecommunications engineers, mathematicians, translators, etc. To apply for it, it is enough to be in possession of a University degree, have some experience and display some interest in the subject.

The Master’s will take one year and a half and the classes will be held at the Computing Faculty of the UPV/EHU-University of the Basque Country. It will be possible to spread it over two or three academic years (to cater for professionals who are working).

The pre-registration period is already open, and applications will be accepted until June 30. For further information on the Master’s, please check out http://ixa.si.ehu.es/master/.

Seminar: ‘The Lexikoaren Behatokia project’ & ‘Enriching EDBL with Hiztegi Batua’ (12/11/2013)

Kepa Sarasola — Fri, 13 Dec 2013 09:01:11 +0000

Topics : The Lexikoaren Behatokia project (X. Artola) + Enriching EDBL with Hiztegi Batua (Gorka Labaka – Xabier Artola)
Speakers: Xabier Artola and Gorka Labaka
Day: December 11th 2013, Wednesday

The Basque language academy, Euskaltzaindia, launched in 2008 the Lexikoaren Behatokia project (“The Lexicon Observatory”), led by Andoni Sagarna. The objective of the project was to create a labelled and linguistically annotated corpus for research. In order to carry out the corpus, it was expected to use a variety of sources, mostly media, especially general interest ones. In late 2012, the corpus consisted of 26,565,924 words, and has been expanded year after year. Euskaltzaindia, the IXA research group, the Elhuyar foundation and UZEI collaborate on the project.

The Lexikoaren Behatokia corpus is available here.

On the other hand, the latest version of the prescriptive dictionary Hiztegi Batua has provided new entries for the Basque lexical database EDBL. This enriching process was explained in the seminar.

Seminar. First steps towards Quechua’s processing. (2012/11/15)

Kepa Sarasola — Mon, 12 Nov 2012 20:00:50 +0000

Hugo and Richard visiting Aholab in Bilbao.

Speakers: Hugo Quispe and Richard Castro (Universidad UNSAAC of Cusco, Peru),
………………Olatz Arregi, Xabier Artola eta Kepa Sarasola (Ixa Group)
Title: Primera aproximación al procesamiento automático del Quechua
………(First steps towards Quechua’s processing.)
Date: November 15, 2012, Thursday
Time: 16:00–17:00
Where: Computer Science Faculty, Room 3.2

Abstract

El Quechua (Runa Simi) como lengua oriunda de la cultura Inca en el Perú, es una familia de lenguas en Latinoamérica. La situación actual de la lengua, por factores como la occidentalización entre otros, ha hecho que el quechua sea una lengua vulnerable, en vías de extinción.

Un grupo de profesores e investigadores del grupo IXA de la UPV/EHU, en conjunto con la UNSAAC en Cusco, Perú, estamos realizando un trabajo para sentar las bases de lo que pretende ser el centro de ingeniería lingüística
de Cusco. Se trata de desarrollar los primeros recursos básicos y herramientas para al procesamiento automático del quechua. Los temas en los que estamos trabajando son: recopilación de un corpus textual, una base de datos léxica para la lengua quechua (BDLQ) y futuras herramientas derivadas de la misma, uso de la herramienta FOMA en el análisis morfológico y creación de un TTS como herramientas básicas para el tratamiento de la lengua.

De esta manera, se ha consolidado las bases de apoyo y trabajo en equipo entre las dos universidades, en bien de una lengua en situación crítica.

Hugo and Richard visiting Ixa Group in Donostia.

Quechua (Runa Simi) is a native South American language family and dialect cluster spoken primarily in the Andes of South America. It is the most widely spoken language family of the indigenous peoples of the Americas, with a total of probably some 8 to 10 million speaker. Like Basque Quechua remains alive but in last centuries suffered continuous regression. The region in which Quechua is spoken is becaming smaller and smaller. Similar with what happened with Basque, Quechua was not an official language, it has been out of educational systems, out of media, and out of industrial environments. Today Quechua holds co-official language status in Peru and Bolivia, even it is not regulated. But, although there have been several changes in the last years, Quechua is still associated with lack of education, stigmatized as uneducated, rural, or holding low economic and power resources, as it was Basque some years ago. Language technology may help to the Quechua speakers’ community and to scholars to built a standard. So opening a door to face Quechua’s future in the digital world. Corpus tools, lexical data-bases and spelling checkers have proven to be useful tools in that way for other languages such as Basque.
The group created by Prof. Juan Cruz in UNSAAC University in Cusco (Peru) has been collaborating with Ixa Group and Aholab since the beginning of 2012. Hugo Quispe and Richard Castro will present in this seminar the work they are doing on the definition of a lexical data-base and a TTS system (Text to Speech) for Quechua.

The group of Cusco (January 2012)

40th anniversary of the Centro de Lingüística Aplicada in Santiago de Cuba.

Kepa Sarasola — Fri, 25 Feb 2011 12:36:49 +0000

In January Iñaki Alegria participated in the XII Simposium de Comunicación Social organized by Centro de Lingüística Aplicada (CLA) in Santiago de Cuba.

The title of his course was:

“Computational Morphology: trends, finite-states and open-source”
(Evolución de la morfología computacional: nuevas posibilidades)

Foma, the application developed by Mans Hulden (University of Helsinki), was the main tool used in this tutorial.

As the CLA Centre is celebrating its 40th anniversary this year , they have sent the Ixa Group a sculpture (see the picture) to commemorate our co-operation.

THANK YOU VERY MUCH!

And CONGRATULATIONS to Eloina, Julio Vitelio, Leonel, and all those compañeros that created this research centre and have been promoting it!

IXA group has been collaborating with CLA for 10 years. One of the fruits of this collaboration is the third edition of the Diccionario Básico Escolar (DBE). This dictionary is coded in XML and has been implemented using leXkit, an application developed by Ixa Group for dictionary managing.

Version in Basque of this new / Berri hau euskaraz

Invited talk: Computational Semantics and Pragmatics (Rodolfo Delmonte, 2011/01/17,18

Kepa Sarasola — Fri, 14 Jan 2011 22:20:34 +0000

Speaker: Rodolfo Delmonte, (Università Ca’ Foscari, Venice, Italy).
Date: January 17 and 18, 2011
Time: 16:00 – 19:30
Where: Computer Science Faculty

ABSTRACT
These two sessions cover some of the most important aspects of Computational Semantics and Pragmatics including:
* Lexical Representations and Argument Structure
* Parsing with constituency or dependency structure
* Co-reference resolution
* Underspecified arguments
* Argumentative structure, subjectivity, factuality and sentiment analysis
* Textual Entailment
The talks follow a linguistically motivated approach with the use of ontologies and similar resources to deal with co-reference or textual entailment tasks. The talks are accompanied by several applications and demonstrations.

SHORT BIO
Rodolfo Delmonte is Associate Professor of Computational Linguistics at the University of Venice where he is in charge of the corresponding course at BA, MA and Ph.D. level. Specialist in experimental phonetics and computational linguistics he presents his research work at major international conferences and publishes articles in international journals. He is referee for and publishes in Speech Communication, International Journal of Speech Technologies, Journal of Natural Language Engineering and international conferences every year. He has been invited speaker in a number of conferences, teacher at international schools, and invited professor in the last five years in Boulder, Colorado at the CLSR, in Besançon at the Centre Tesnière, in Dallas at UTD. Hot topics of his latest research include the following: Implicit entities and antecedents of omitted and underspecified arguments; Argumentative Analysis, Subjectivity, Factuality and Sentiment Analysis.

project.cgm.unive.it/delmonte.html