On this page: Introduction / Download / Source code / License / How to cite / Platform and requirements / Installation / How to use / Contact

ixa-pipe-pos-eu

ixa-pipe-pos-eu is a robust and wide-coverage morphological analyzer and a PoS tagger, which is an adapted version of Eustagger, a tool lemmatizer/tagger for Basque. It is implemented in C++ programming language.

It is the first module of the linguistic processing chain. The tool takes a raw text as an input text and outputs the lemma, the PoS tag and the morphological information for each token in NAF format.

Download

You can download a pre-compiled binary package for latest stable version from the following links:

Source code

Source code for the latest development version can be downloaded or cloned from Github Eustagger Lite page.

License

All the original code produced for ixa-pipe-pos-eu is licensed under GPL v3 free license.

This software uses some external libraries that have their own license and copyright owner:

How to cite

If you use ixa-pipe-pos-eu tool, please cite the following paper in your academic work:

Arantxa Otegi, Nerea Ezeiza, Iakes Goenaga and Gorka Labaka. A Modular Chain of NLP Tools for Basque. In Proceedings of the 19th International Conference on Text, Speech and Dialogue - TSD 2016, Brno, Czech Republic, volume 9924 of Lecture Notes in Artificial Intelligence, pp. 93-100. 2016
[bibtex]

Platform and requirements

The ready to use packages are available only for Linux.

In order to use in other machines, you can download the source code and compile it. As it has some dependencies, it is required to install some additional libraries and programs beforehand. Follow the instructions in the INSTALL file.

Installation

Once you download the pre-compiled binary package, decompress the file and the executable will be ready to use, without any installation.

If you want to compile the source code, follow the instructions in the INSTALL file.

How to use

The executable ixa-pipe-pos-eu.sh is used to run the ixa-pipe-pos-eu tool. It has not any argument.

This tool reads from standard input, and it should be UTF-8 encoded plain text. Therefore, you can obtain lemmas, PoS tags and morphological information of a plain text file using the following command:
> cat test.txt | sh ixa-pipe-pos-eu/ixa-pipe-pos-eu.sh

The output is written to standard output and it is in UTF-8 encoding and NAF format. In the NAF output document some linguistic information will be marked by text and terms elements as it is shown in the example below (the input sentence of the example is this one: "Donostiako Zinemaldiko sail ofizialean lehiatuko da Handia filma."):
<text>
   <wf id="w1" offset="0" length="10" sent="1" para="1">Donostiako</wf>
   <wf id="w2" offset="11" length="11" sent="1" para="1">Zinemaldiko</wf>
   <wf id="w3" offset="23" length="4" sent="1" para="1">sail</wf>
   <wf id="w4" offset="28" length="10" sent="1" para="1">ofizialean</wf>
   <wf id="w5" offset="39" length="9" sent="1" para="1">lehiatuko</wf>
   <wf id="w6" offset="49" length="2" sent="1" para="1">da</wf>
   <wf id="w7" offset="52" length="6" sent="1" para="1">Handia</wf>
   <wf id="w8" offset="59" length="5" sent="1" para="1">filma</wf>
   <wf id="w9" offset="64" length="1" sent="1" para="1">.</wf>
</text>
<terms>
   <!-- Donostiako -->
   <term id="t1" lemma="Donostia" morphofeat="NL0LS000" pos="R" case="IZE LIB PLU- GEL NUMS MUGM ZERO HAS_MAI @<IZLG @IZLG>">
     <span>
       <target id="w1"/>
     </span>
   </term>
   <!-- Zinemaldiko -->
   <term id="t2" lemma="zinemaldi" morphofeat="NC0LS000" pos="N" case="IZE ARR GEL NUMS MUGM ZERO HAS_MAI @<IZLG @IZLG>">
     <span>
       <target id="w2"/>
     </span>
   </term>
   <!-- sail -->
   <term id="t3" lemma="sail" morphofeat="NC000000" pos="N" case="IZE ARR BIZ- ZERO @KM>">
     <span>
       <target id="w3"/>
     </span>
   </term>
  ...
</terms>

Contact

Arantxa Otegi, arantza.otegi@ehu.eus
Nerea Ezeiza, n.ezeiza@ehu.eus