Talk. Dan Jurafsky. Extracting many kinds of meaning from text and speech. (2011/09/13)

Kepa Sarasola — Wed, 07 Sep 2011 12:25:07 +0000

Speaker: Professor Dan Jurafsky (Stanford University).
Date: September 13, 2011
Time: 16:00
Where: Computer Science Faculty

Title: Extracting many kinds of meaning from text and speech.
Abstract:
Understanding natural language, while one of the oldest goals of artificial intelligence, is immensely difficult because language expresses so many kinds of meanings, embedded as it is in the rich social world of humans. In this talk I discuss work in our lab on extracting three kinds of meaning that link to the human world. We show how to learn world knowledge about events and their participants, `narrative schemes’ about how the world works, in a purely unsupervised way from large bodies of text. We show a new algorithm for the task of ‘coreference’: deciding when two mentions in a text refer to the same person or organization. Finally, we show how to automatically detect human interpersonal stances from speech and text cues in spoken conversation, detecting whether a speaker is friendly, awkward, or flirtatious. This talk describes joint work with Nate Chambers, Angel Chang, Heeyoung Lee, Chris Manning, Dan McFarland, Yves Peirsman, Karthik Raghunathan, Rajesh Ranganath, and Mihai Surdeanu.
BIO:
Dan Jurafsky is Professor of Linguistics and Professor by Courtesy of Computer Science at Stanford University. Dan received a B.A in Linguistics in 1983 and a Ph.D. in Computer Science in 1992, both from the University of California at Berkeley, and also taught at the University of Colorado, Boulder. His research focuses on natural language understanding as well as the application of natural language processing to the behavioral and social sciences. Other research interests include the linguistics of Chinese and the linguistics of food. He is the recipient of a MacArthur Fellowship, and is the co-author with Jim Martin of the widely-used textbook “Speech and Language Processing“. It was the first book that included deep descriptions of both text and speech technology. Teachers and students of Language Technology, we know very well this nice book.

News from OPENMT-2 project

Kepa Sarasola — Sun, 10 Apr 2011 16:28:00 +0000

Three pieces of news related to the OPENMT-2 project (2010-2012):

Gorka Labaka’s PhD thesis

In his PhD thesis (“EUSMT: Incorporating Linguistic Information to Statistical Machine Translation for Basque“) Labaka studied how Statistical Machine Translation (SMT) can handle the translation of Spanish into Basque, a morphologically rich and less-resourced language. He found two ways to enhance the quality of the translation by using linguistic tools:

The use of morphological tools allowed him to perform translation at word-segments level, so avoiding spareness problems in corpora.
Complementarily, the syntactic tools enabled the Spanish word-segments to be rearranged into their corresponding order in Basque. This reordering helped the SMT decoder to look for correct translations.

Recent research trends to focus more on statistical systems, and to ignore rule-based attempts. However, according to Gorka Labaka’s evaluation the RBMT and the state-of-the-art basic SMT systems work with a similar quality when translating into Basque. His improved SMT system based on segmentation and re-ordering outperforms both, the RBMT system and the basic SMT system, in more than 10% in HTER metric. Besides, he calculated that a hypothetical oracle system would yield a result even 10% better; this oracle system should select the improved SMT output for 55% of the sentences, the RBMT output for other 41% of them, and EBMT for 4%. So he concluded that, at least in the case of morphologically rich languages with few resources, and hence few parallel corpora, the SMT approach is limited, and the RBMT approach should not be ignored. Currently, we are experimenting with hybrid architectures combining Matxin (rule-based) and EUSMT (statistical) translation-engines.

Visiting researcher Lluís Màrquez (NLPRG, Technical University of Catalonia, UPC)

With the aim of collaborating in this research line, Lluis Marquez, the main researcher in the UPC team within the OPENMT-2 project, is going to be in Donostia visiting the Ixa group until summer. He is an expert in integrating Machine Learning techniques in Language Technology. The first experiments on combining MT engines made by Gorka Labaka confirmed there is room for improvement. Now we want to find out the most suitable ways to do it.

Collaboration on Post-Editing with Basque Wikipedia (eu.wikipedia)

Within this project, a set of 60 long articles of the Spanish Wikipedia (adding up to more than 100.000 words) have been selected, and then translated into Basque language by using Matxin-Opentrad, our open-source rule-based machine translation system. Soon, in 2011 spring, a group of users of Basque Wikipedia will review them using an special interface we have adapted using OmegaT. They will correct the errors they find; this process is also known as post-editing. In this process, changes made by these users will be logged. The fixed articles will be included into Basque Wikipedia, but additionally the resulting post-editing logs will be used to enhance the machine translation process by manually improving the different modules of their MT system, or by implementing an automated statistical post-editing process that is expected to enhance the accuracy in the translation. (paper in Wikimania 2010)

Machine Learning – Ixa Group. Language Technology.

Talk. Dan Jurafsky. Extracting many kinds of meaning from text and speech. (2011/09/13)

News from OPENMT-2 project

Gorka Labaka’s PhD thesis

Visiting researcher Lluís Màrquez (NLPRG, Technical University of Catalonia, UPC)

Collaboration on Post-Editing with Basque Wikipedia (eu.wikipedia)