diskurtso – Hizkuntza-teknologiak, Ixa Taldearen bloga

Hitzaldia: Diskurtsoaren egitura itzulpen automatikoaren ebaluazioan (L. Màrquez, 2015/06/15).

ixa — Tue, 23 Jun 2015 14:30:36 +0000

Hizlaria: Lluis Màrquez
…………….Arabic Language Technologies group from the Qatar Computing Research
Eguna: Ekainaren 15ean, ostegunean
Ordua: 12:00
Gela: 3.2 gela. Informatika Fakultatea (UPV/EHU)

Hitzaldiaren izenburua: “Discourse Structure in Machine Translation Evaluation / Diskurtsoaren egitura itzulpen automatikoaren ebaluazioan”

Hitzaldia: Egitura Erretorikoaren Teoria (T. Pardo, 2014/02/27)

ixa — Tue, 25 Feb 2014 08:41:13 +0000

Hizlaria: Thiago Pardo

Irakaslea eta ikertzailea da Instituto de Ciências Matemáticas e de Computação (ICMC) Brasileko Universidade de São Paulon (USP)
Bere ikerkuntza-lerroak laburpen automatikoa, analisi diskurtsibo automatikoa, sinplifikazio automatikoa eta itzulpen automatikoa dira eta nabarmentzekoak dira bere ikerkuntzen ondorioz komunitate zientifikoari eskaini dizkion corpus aberastuak eta diskurtso-egitura aztertzeko tresnak.

Eguna: Otsailaren 27an, 2014, osteguna
Ordua: Arratsaldeko 3:30etan
Non: Informatika Fakutatea, 3.1 gela
Izenburua:

“Rhetorical Structure Theory: relational discourse structure annotation”
/ Egitura Erretorikoaren Teoria: diskurtsoaren egitura erlazionalaren anotazioa

Edukia:

Thiago A.S. Pardok dikurtsoa analizatzen duten hainbat tresna eraikitzen parte hartu du. Esperientzia horren berri emango digu. Besteak beste, hauek dira tresna horietako batzuk:

CSTNews interface – access to 50 clusters of news texts and their multidocument summaries, with texts annotated according to the Cross-document Structure Theory

CSTTool – a semi-automatic edition tool for annotating texts according to the Cross-document Structure Theory

DiZer 2.0 – an on-line version of DiZer, which is easily adaptable and portable to different text types/genres and languages

RSTeval – tool for discourse parsing evaluation, following Marcu (2000) evaluation method – the tool is able to compare RST trees (automatically or manually produced), producing precision and recall numbers

CorpusTCC – corpus of 100 Brazilian Portuguese scientific texts (from Computer Science domain – introduction sections of theses), marked by Marcu’s RSTTool (using this relation set), used for developing DiZer

RhetDB – Rhetorical Database – an edition environment for handling the rhetorical analyses produced by Daniel Marcu’s RSTTool; it offers several computational facilities for both computational and linguistic purposes. (this is an old version of the software; for better and more advanced features, use RST Toolkit above)

Hitzaldia: Laburpen automatikoa diskurtsoaren ezagutza, testu-sinplikazioa eta korreferentzia erabiliz (T. Pardo, 2014/02/28)

ixa — Tue, 25 Feb 2014 08:22:06 +0000

Hizlaria: Thiago Pardo

Irakaslea eta ikertzailea da Instituto de Ciências Matemáticas e de Computação (ICMC) Brasileko Universidade de São Paulon (USP) Bere ikerkuntza-lerroak laburpen automatikoa, analisi diskurtsibo automatikoa, sinplifikazio automatikoa eta itzulpen automatikoa dira eta nabarmentzekoak dira bere ikerkuntzen ondorioz komunitate zientifikoari eskaini dizkion corpus aberastuak eta diskurtso-egitura aztertzeko tresnak.

Eguna: Otsailaren 28an, 2014, ostirala
Ordua: Goizeko 10:30etan
Non: Informatika Fakutatea, 3.2 gela
Izenburua:

“Text summarization using discourse knowledge. Text simplification and co-reference”
/ Laburpen automatikoa diskurtsoaren ezagutza, testu-sinplikazioa eta korreferentzia erabiliz.

Edukia:

Thiago A.S. Pardok diskurtsoaren egitura lantzeaz gain laburpen automatikoa egiten duten hainbat tresna ere eraikitzen parte hartu du. Esperientzia horren berri emango digu. Besteak beste, hauek dira tresna horietako batzuk:

Summarization extension to Google Chrome – extension for on-line news summarization, based on RSumm system

TextTiling for Portuguese – topical segmentation tool adapted to news texts in Brazilian Portuguese, based on the work of Hearst (1997)

CSTSumm – a multi-document summarizer based on CST information (see README.txt in the rar file)

CSTNews – a corpus with 50 clusters of news texts – in Portuguese – with their multi-document summaries, as well as several discourse and semantic annotations

TeMário 2006 – 150 news texts and the corresponding human summaries, which complement the original TeMário corpus, resulting in a corpus of 250 texts for summarization purposes

DMSumm – Discourse Modeling SUMMarizer

NeuralSumm – NEURAL network for SUMMarization (for scientific texts) – with tools for training the system with new data, if necessary

GistSumm – GIST SUMMarizer

Hitzaldia: V. Kordoni. Hitz anitzeko terminoen erauzketa automatikoa.(2011/11/25)

ixa — Mon, 14 Nov 2011 16:32:54 +0000

Hitz anitzeko terminoen detekzioa eta ulermena ez da arazo erraza. Alemaniako Saarbruken-eko DFKI laborategi ospetsutik bisitan datorkigun Valia Kordoni ikertzaileak horretaz hitz egingo digu: Nola erauzi automatikoki hitz anitzeko terminoak eta nola erabili horiek gramatika eleanitzak sortzeko.

Gaia: Automated Annotation and Acquisition of Linguistic Knowledge for Efficient Multilingual Grammar Engineering (Hitz anitzeko terminoen erauzketa automatikoa gramatika eleanitzak sortzeko).
Tokia: 3.2 aretoa. Informatika Fakultatea
Hizlaria: Valia Kordoni (LT-Lab DFKI GmbH & Dept. of Computational Linguistics, Saarland University).
Eguna: Azaroaren 25an
Ordua: 16:00-18:00

Laburpena

In this talk, I mainly deal with automated acquisition of linguistic knowledge as a means of enhancing robustness of lexicalised grammars for real life applications. The case study I focus on in the best part of this talk is Multiword Expressions (henceforward MWEs). Specifically, in the first part of the talk I am taking a closer look at the linguistic properties of MWEs, in particular, their lexical, syntactic, as well as semantic characteristics. The term Multiword Expressions has been used to describe expressions for which the syntactic or semantic properties of the whole expression cannot be derived from its parts (cf., Sag et al., 2002), including a large number of related but distinct phenomena, such as phrasal verbs (e.g., “come along”), nominal compounds (e.g., “frying pan”), institutionalised phrases (e.g., “breadand butter”), and many others. Jackendoff (1997) estimates the number of MWEs in a speaker’s lexicon to be comparable to the number of single words.

However, due to their heterogeneous characteristics, MWEs present a tough challenge for both linguistic and computational work (cf., Sag et al., 2002).For instance, some MWEs are fixed, and do not present internal variation, such as “ad hoc”, while others allow different degrees of internal variability and modification, such as “spill beans” (“spill several/musical/mountains of beans”). With the observations about the linguistic properties of MWEs at hand, I turn in the second part of the talk to methods for the automated acquisition of these properties for robust grammar engineering. To this effect, I first investigate the hypothesis that MWEs can be detected by the distinct statistical properties of their component words, regardless of their type, comparing various statistical measures, a procedure which leads to extremely interesting conclusions. I then investigate the influence of the size and quality of different corpora, using the BNC and the Web search engines Google and Yahoo. I conclude that, in terms of language usage, web generated corpora are fairly similar to more carefully built corpora, like the BNC, indicating that the lack of control and balance of these corpora are probably compensated by their size. Then, I show a qualitative evaluation of the results of automatically adding extracted MWEs to existing linguistic resources. To this effect, I first discuss two main approaches commonly employed in NLP for treating MWEs: the words-with-spaces approach which models an MWE as a single lexical entry and it can adequately capture fixed MWEs like “by and large”, and compositional approaches which treat MWEs by general and compositional methods of linguistic analysis, being able to capture more syntactically flexible MWEs, like “rock boat”, which cannot be satisfactorily captured by a wordswith-spaces approach, since this would require lexical entries to be added for all the possible variations of an MWE (e.g., “rock/rocks/rocking this/that/his…boat”). On this basis, I argue that the process of the automatic addition of extracted MWEs to existing linguistic resources improves qualitatively, if a more compositional approach to grammar/lexicon automatedextension is adopted.

Finally, I also propose that the methods developed for the acquisition of linguistic knowledge in the case of the English MWEs can be tuned to enhance robustness of lexicalised grammars for languages with richer morphology and freer word order, as is the case of German, and can benefit from gold standard syntactically and semantically annotated corpora, for the (semi-automated) development of which I am briefly showing a very simple statistical ranking model which significantly improves treebanking efficiency by prompting human annotators to the most relevant linguistic annotation decisions.