Our colleague Arantxa Otegi won last Janaury the III. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and the University of the Basque Country.
CONGRATULATIONS Arantxa!
Congratulations to her supervisors (Xabier Arregi and Eneko Agirre).
The title of this thesis is ‘Expansion for information retrieval: contribution of word [...]]]>
Our colleague Arantxa Otegi won last Janaury the III. Koldo MItxelena Award for PhD Theses organized by Euskaltzaindia (the Academy of Basque Language) and the University of the Basque Country.
CONGRATULATIONS Arantxa!
Congratulations to her supervisors (Xabier Arregi and Eneko Agirre).
The title of this thesis is ‘Expansion for information retrieval: contribution of word sense disambiguation and semantic relatedness’.
The whole text is available here. This is the abstract:
Information retrieval (IR) aims at searching documents which satisfy the information need of an user. In that way, an IR system informs the user about relevant documents, that is those documents that contain the information they need as formulated in the query. Well-known search engines like Google and Yahoo are prime examples of IR systems.
A perfect IR system should retrieve only, and all, the relevant documents, rejecting the non-relevant ones. However, perfect retrieval systems do not exist. One of the main problems is the so-called vocabulary mismatch problem between query and documents: some documents might be relevant to the query even if the specific terms used differ substantially, or some documents might not be relevant to the query even they have some terms in common. The former is because several words or phrases can be used to express the same idea or item (synonymy). The latter is caused by ambiguity, where one word can have more than one interpretation depending on the context. Owing to these facts, if an IR system relies only on terms occurring in both the query and the document when it comes to deciding whether a document is relevant, it might be diffcult to fnd some of the interesting documents, and also to reject non-relevant documents. It seems fair to think that there will be more chances of successful retrieval if the meaning of the text is also taken into account.
Even though the vocabulary mismatch problem has been widely discussed in the literature from the early days of IR it remains unsolved, and most search engines just ignore it. This PhD dissertation explores whether natural language processing (NLP) can be used to alleviate this problem.
In a nutshell, we expand queries and documents making use of two NLP techniques, word sense disambiguation and semantic relatedness. For each of the mentioned techniques we propose an expansion strategy, in which we obtain synonyms and other related words for the words in the query and documents. We also present, for each case, a method to combine the expansions and original words effectively in an IR system. Furthermore, as the expansion technique we propose is useful for translating queries and documents, we show how a cross lingual information retrieval system could be improved using such an expansion technique.
Our extensive experiments on three datasets show that the expansion methods explored in this dissertation help overcome the mismatch problem, consequently improving the effectiveness of an IR system.
The volume of news data is enormous and expanding, covering billions of archived documents and millions of documents as daily streams, while at the same time getting more and more interconnected with knowledge provided elsewhere. Professional decision-makers that need to respond quickly to new developments faced with the problem that current solutions for consulting these archives and streams no longer work. Consequently, it becomes almost impossible to make well-informed decisions and professionals risk to be held liable for decisions based on incomplete, inaccurate and out-of-date information.
NewsReader will develop a decision-support tool that allows professional decision-makers to explore these story lines using visual interfaces and interactions to exploit their explanatory power and their systematic structural implications. The goal is to extract what happened to whom, when, and where. Align, storing provenance, not discarding any information. Distinguish unfolding story lines. Assist financial decision support by explaining current events. Likewise, NewsReader can make predictions from the past on future events or explain new events and developments through the past. The tool will be tested by professional decision makers in the financial and economic area.
]]>
IXA Group is participating with other 5 partners in a new European project: PATHS (2010-2012). The PATHS project (Personalised Access To cultural Heritage Spaces) primarily addresses objective ICT-2009.4.1: Digital Libraries and Digital Preservation. It relates to target outcome (d), adaptive cultural experiences, by creating personalised views of various forms of cultural expression, adapting these [...]]]>
IXA Group is participating with other 5 partners in a new European project: PATHS (2010-2012).
The PATHS project (Personalised Access To cultural Heritage Spaces) primarily addresses objective ICT-2009.4.1: Digital Libraries and Digital Preservation. It relates to target outcome (d), adaptive cultural experiences, by creating personalised views of various forms of cultural expression, adapting these views to the background and cognitive context of the user and offering meaningful guidance about the interpretation of cultural works. PATHS will make important progress in this direction.
Europeana: Significant amounts of cultural heritage material are now available through online digital library portals. However, this vast amount of cultural heritage material can also be overwhelming for many users who are provided with little or no guidance on how to find and interpret this information.
The PATHS project will create a system that acts as an interactive personalised tour guide through existing digital library collections. The system will offer suggestions about items to look at and assist in their interpretation. Navigation will be based around the metaphor of a path through the collection. A path can be based around any theme, for example artist and media (“paintings by Picasso”), historic periods (“the Cold War”), places (“Venice”) and famous people (“Muhammad Ali”). Users will be able to construct their own paths or follow pre-defined ones.
The PATHS project will provide users with innovative ways to access and utilise the contents of digital libraries that enrich their experiences of these resources. This will be achieved by extending the state-of-the-art in user-driven information access and by applying language technologies to analyse and enrich online content. The project will take a user-centred approach to development to accommodate the needs, interests and preferences of different types of users.
These goals shall be realised through the following objectives :
Therefore, the project will research on the following areas:
Information Access: The project will develop a user-driven navigation through collections of information, gathering the users’ requirements and modeling it.
Content interpretation and enrichment: Representation and sharing of information about items, and identifying background information related to the items in cultural heritage collections
IXA Group will work mainly in content processing and enrichment. This means that content from Cultural Heritage sources will processed to a multi-layered network and augmented with additional information that will enrich the user’s experience. The additional information will include links between items in the collection and to external sources like Wikipedia or other relevant collections. The resulting multi-layered network will form the basis for the paths used to navigate the collection.
The PATHS consortium contains six partners.