EUSTAGGER LITE

Lemmatizer/tagger for Basque

Eustagger Lite is a robust and wide-coverage morphological analyzer and a Part-of-Speech tagger for Basque, which is an adapted version of a lemmatizer/tagger for Basque called Eustagger. It processes a text morphosyntactically following the next steps: tokenization, segmentation, the word grammar, treatment of multiword expressions and morphosyntactic disambiguation.

The morphological segmentation of words is based on a set of two-level rules converted into finite-state transducers. The analysis is performed in two main phases and gives as a result all the possible analyses of each word in the text: on the one hand, the standard analyzer that is able to analyze/generate standard-language words based on a general lexicon and the corresponding rules for morphotactics and morphophonological changes; on the other hand, the guesser or analyzer of words with lemmas not belonging to the previous lexicon. Comparing to Eustagger, this adaptation lacks of the module for linguistic variants, which has been discarded to reduce the complexity of the process and to make it more efficient. After segmenting the word into its constituent morphemes, the word grammar based processor analyzes and elaborates the sequential intraword information in order to build the information of the word as a whole. For the detection of multiword expression, we have integrated a reduced version of the processor in Eustagger due to simplicity, which detects only the most common expressions. Once we have given all possible morphological analysis to each token/multi-token, PoS tagging and lemmatization must be performed in order to assign the correct lemma and grammatical category to each token taking into account the context. The disambiguation is based on linguistic knowledge, as well as statistical information. First, a set of Constraint Grammar rules are used to discard some analysis. After that, a stochastic HMM disambiguation is applied to choose the final analysis.

Eustagger has been evaluated on the test set of EPEC corpus (50,000 words) obtaining a performance of 95.17 % on PoS tagging accuracy, and 91.89 % when considering all morphological information. Although we still have to confirm the results of this reduced tool in question, we expect to obtain similar results.

Eustagger Lite has been developed by the IXA NLP group of the University of the Basque Country.

Eustagger Lite package includes ixa-pipe-pos-eu, an adapted version of Eustagger Lite to be a ixaKat tool. ixaKat is a modular chain of Natural Language Processing tools for Basque. ixa-pipe-pos-eu is the first tool of this linguistic processing chain. The tool takes a raw text as an input and outputs the lemma, the PoS tag and the morphological information for each token in NAF format. You will find more information about ixaKat tools following this link: ixaKat tools.

News:

2016.03.4 We've imported Eustagger Lite source code from IXA's internal SVN server to github.

Mailing List

Please, if you have any question/problem, send a mail to the following mailing list: Eustagger mailing list.

Source code

The git source code repository is at github. Using git, you can get the whole repository running:

git clone https://github.com/ixa-ehu/eustagger-lite

Download

You can download snapshots of the code and the necessary dependencies here.

License

All the original code produced for Eustagger Lite is licensed under GPL v3 free license.

This software uses some external libraries that have their own license and copyright owner:

PCRE++: Copyright (C) 2002-2003, Thomas Linden. GNU Lesser General Public License
VISL CG-3: Copyright (C) 2007-2013, GrammarSoft ApS. GNU General Public License v3
SWI Prolog: Copyright (C) 2008, University of Amsterdam. GNU General Public License v2
Foma: Copyright (C) 2008-2012, Mans Hulden. GNU General Public License v2
Freeling: Copyright (C) 2004, TALP Research Center, Universitat Politecnica de Catalunya. GNU General Public License v3
Boost: Copyright (C) 2004-2006, Joe Coder. Boost Software Li cense

Documentation

At this moment the documentation is very limited. We are working to solve this issue. Meanwhile you can check the building instructions available here.

References

If you use Eustagger Lite, please cite the following paper in your academic work:

Ezeiza, N., Alegria, I., Arriola, J. M., Urizar, R., & Aduriz, I. 1998. Combining stochastic and rule-based methods for disambiguation in agglutinative languages. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1, pp. 380-384. Association for Computational Linguistics.

Other publications:

Alegria I., Aranzabe M., Ezeiza A., Ezeiza N., Urizar R. 2002. Robustness and customisation in an analyser/lemmatiser for Basque. LREC-2002 Customizing knowledge in NLP applications Workshop.
Alegria I., Artola X., Sarasola K., Urkia M. 1996. Automatic morphological analysis of Basque Literary & Linguistic Computing Vol. 11, No. 4, 193-203. Oxford University Press. Oxford.

Patents:

EUSLEM (2001) 00/2002/2475

Contact

You may contact us at ixa.teknikaria at ehu.es

Page last edited: 2017.08.30