Information Access – Ixa Group. Language Technology.

Koldo Mitxelena award for PhD theses to Arantxa Otegi

Kepa Sarasola — Thu, 07 Feb 2013 17:39:59 +0000

Our colleague Arantxa Otegi won last Janaury the III. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and the University of the Basque Country.

CONGRATULATIONS Arantxa!

Congratulations to her supervisors (Xabier Arregi and Eneko Agirre).

The title of this thesis is ‘Expansion for information retrieval: contribution of word sense disambiguation and semantic relatedness’.

The whole text is available here. This is the abstract:

Information retrieval (IR) aims at searching documents which satisfy the information need of an user. In that way, an IR system informs the user about relevant documents, that is those documents that contain the information they need as formulated in the query. Well-known search engines like Google and Yahoo are prime examples of IR systems.
A perfect IR system should retrieve only, and all, the relevant documents, rejecting the non-relevant ones. However, perfect retrieval systems do not exist. One of the main problems is the so-called vocabulary mismatch problem between query and documents: some documents might be relevant to the query even if the specific terms used differ substantially, or some documents might not be relevant to the query even they have some terms in common. The former is because several words or phrases can be used to express the same idea or item (synonymy). The latter is caused by ambiguity, where one word can have more than one interpretation depending on the context. Owing to these facts, if an IR system relies only on terms occurring in both the query and the document when it comes to deciding whether a document is relevant, it might be diffcult to fnd some of the interesting documents, and also to reject non-relevant documents. It seems fair to think that there will be more chances of successful retrieval if the meaning of the text is also taken into account.
Even though the vocabulary mismatch problem has been widely discussed in the literature from the early days of IR it remains unsolved, and most search engines just ignore it. This PhD dissertation explores whether natural language processing (NLP) can be used to alleviate this problem.
In a nutshell, we expand queries and documents making use of two NLP techniques, word sense disambiguation and semantic relatedness. For each of the mentioned techniques we propose an expansion strategy, in which we obtain synonyms and other related words for the words in the query and documents. We also present, for each case, a method to combine the expansions and original words effectively in an IR system. Furthermore, as the expansion technique we propose is useful for translating queries and documents, we show how a cross lingual information retrieval system could be improved using such an expansion technique.

Our extensive experiments on three datasets show that the expansion methods explored in this dissertation help overcome the mismatch problem, consequently improving the effectiveness of an IR system.

Ixa Group at the kick-off meeting of the NewsReader Project

Kepa Sarasola — Tue, 05 Feb 2013 15:50:53 +0000

Ixa Group is one of the five partners in the consortium of the NewsReader Project (EU FP7 programme, grant 316404, Jan.2013 – Dec.2015) that was presented on Wednesday 23 January at VU University Amsterdam. These are the five partners in the consortium:

The volume of news data is enormous and expanding, covering billions of archived documents and millions of documents as daily streams, while at the same time getting more and more interconnected with knowledge provided elsewhere. Professional decision-makers that need to respond quickly to new developments faced with the problem that current solutions for consulting these archives and streams no longer work. Consequently, it becomes almost impossible to make well-informed decisions and professionals risk to be held liable for decisions based on incomplete, inaccurate and out-of-date information.

NewsReader will develop a decision-support tool that allows professional decision-makers to explore these story lines using visual interfaces and interactions to exploit their explanatory power and their systematic structural implications. The goal is to extract what happened to whom, when, and where. Align, storing provenance, not discarding any information. Distinguish unfolding story lines. Assist ﬁnancial decision support by explaining current events. Likewise, NewsReader can make predictions from the past on future events or explain new events and developments through the past. The tool will be tested by professional decision makers in the financial and economic area.

Talk. Dan Jurafsky. Extracting many kinds of meaning from text and speech. (2011/09/13)

Kepa Sarasola — Wed, 07 Sep 2011 12:25:07 +0000

Speaker: Professor Dan Jurafsky (Stanford University).
Date: September 13, 2011
Time: 16:00
Where: Computer Science Faculty

Title: Extracting many kinds of meaning from text and speech.
Abstract:
Understanding natural language, while one of the oldest goals of artificial intelligence, is immensely difficult because language expresses so many kinds of meanings, embedded as it is in the rich social world of humans. In this talk I discuss work in our lab on extracting three kinds of meaning that link to the human world. We show how to learn world knowledge about events and their participants, `narrative schemes’ about how the world works, in a purely unsupervised way from large bodies of text. We show a new algorithm for the task of ‘coreference’: deciding when two mentions in a text refer to the same person or organization. Finally, we show how to automatically detect human interpersonal stances from speech and text cues in spoken conversation, detecting whether a speaker is friendly, awkward, or flirtatious. This talk describes joint work with Nate Chambers, Angel Chang, Heeyoung Lee, Chris Manning, Dan McFarland, Yves Peirsman, Karthik Raghunathan, Rajesh Ranganath, and Mihai Surdeanu.
BIO:
Dan Jurafsky is Professor of Linguistics and Professor by Courtesy of Computer Science at Stanford University. Dan received a B.A in Linguistics in 1983 and a Ph.D. in Computer Science in 1992, both from the University of California at Berkeley, and also taught at the University of Colorado, Boulder. His research focuses on natural language understanding as well as the application of natural language processing to the behavioral and social sciences. Other research interests include the linguistics of Chinese and the linguistics of food. He is the recipient of a MacArthur Fellowship, and is the co-author with Jim Martin of the widely-used textbook “Speech and Language Processing“. It was the first book that included deep descriptions of both text and speech technology. Teachers and students of Language Technology, we know very well this nice book.

A new European Project: PATHS

Kepa Sarasola — Fri, 04 Feb 2011 13:42:28 +0000

IXA Group is participating with other 5 partners in a new European project: PATHS (2010-2012).
The PATHS project (Personalised Access To cultural Heritage Spaces) primarily addresses objective ICT-2009.4.1: Digital Libraries and Digital Preservation. It relates to target outcome (d), adaptive cultural experiences, by creating personalised views of various forms of cultural expression, adapting these views to the background and cognitive context of the user and offering meaningful guidance about the interpretation of cultural works. PATHS will make important progress in this direction.

Europeana: Significant amounts of cultural heritage material are now available through online digital library portals. However, this vast amount of cultural heritage material can also be overwhelming for many users who are provided with little or no guidance on how to find and interpret this information.

The PATHS project will create a system that acts as an interactive personalised tour guide through existing digital library collections. The system will offer suggestions about items to look at and assist in their interpretation. Navigation will be based around the metaphor of a path through the collection. A path can be based around any theme, for example artist and media (“paintings by Picasso”), historic periods (“the Cold War”), places (“Venice”) and famous people (“Muhammad Ali”). Users will be able to construct their own paths or follow pre-defined ones.

The PATHS project will provide users with innovative ways to access and utilise the contents of digital libraries that enrich their experiences of these resources. This will be achieved by extending the state-of-the-art in user-driven information access and by applying language technologies to analyse and enrich online content. The project will take a user-centred approach to development to accommodate the needs, interests and preferences of different types of users.

These goals shall be realised through the following objectives :

Analysis of users’ requirements for access to Cultural Heritage collections
Organisation and enrichment of Cultural Heritage content for use within a navigation system
Implementation of a system for navigating Cultural Heritage resources
Techniques for providing personalised access to Cultural Heritage content
Porting the navigation system for use on mobile devices and Facebook
Evaluation with user groups and in field trials

Therefore, the project will research on the following areas:

Information Access: The project will develop a user-driven navigation through collections of information, gathering the users’ requirements and modeling it.
Educational Informatics: Adapting to individual learners in relation to being directed and being allowed the freedom to explore autonomously.
Content interpretation and enrichment: Representation and sharing of information about items, and identifying background information related to the items in cultural heritage collections

IXA Group will work mainly in content processing and enrichment. This means that content from Cultural Heritage sources will processed to a multi-layered network and augmented with additional information that will enrich the user’s experience. The additional information will include links between items in the collection and to external sources like Wikipedia or other relevant collections. The resulting multi-layered network will form the basis for the paths used to navigate the collection.

The PATHS consortium contains six partners.

Two academic institutions:
- University of Sheffield (USFD), UK (Coordinator).
- Universidad del País Vasco / Euskal Herriko Unibertsitatea (UPV/EHU ), Basque Country
Two SMEs:
- i-sieve Technologies Ltd . Greece.
- Asplan Viak Internet (Avinet), Norway.
Two cultural heritage enterprises
- MDR Partners , UK (Project Management).
- Alinari 24 Ore Spa , Italy