IXA pipes is a modular set of Natural Language Processing tools (or pipes) which provide easy access to NLP technology for several languages. It offers robust and efficient linguistic annotation to both researchers and non-NLP experts with the aim of lowering the barriers of using NLP technology either for research purposes or for small industrial developers and SMEs. The ixa pipes can be used or exploit its modularity to pick and change different components. The tools are developed by the IXA NLP Group of the University of the Basque Country.

ixa pipes

If you use the ixa pipes tools or the models, please cite this paper:

Rodrigo Agerri, Josu Bermudez and German Rigau (2014): "IXA pipeline: Efficient and Ready to Use Multilingual NLP tools", in: Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), 26-31 May, 2014, Reykjavik, Iceland. PDF paper

ixa-pipe-tok: Tokenizer and Segmenter for several languages.

ixa-pipe-pos: Statistical POS tagging and Lemmatizer for Basque, Dutch, English, French, Galician, German, Italian and Spanish.

ixa-pipe-nerc: Named Entity Recognition tagger for Basque, Spanish, English, German, Dutch and Italian; Opinion Target Extraction (OTE) for English.

ixa-pipe-chunk: Probabilistic chunker for Basque and English.

ixa-pipe-parse: Probabilistic constituent parser for Spanish and English.

Every ixa pipe can be up an running after two simple steps. The tools require Java 1.7+ to run and are designed to come with all batteries included, which means that it is not required to do any system configuration or install any third-party dependencies. The modules will run on any platform as long as a JVM 1.7+ is available.

IXA pipes are just a set of processes chained by their standard streams, in a way that the output of each process feeds directly as input to the next one. The Unix pipes metaphor has been applied for NLP tools by adopting a very simple and well known data centric architecture, in which every module/pipe is interchangeable by any other tool as long as it reads and writes the required data format via the standard streams.

The data format in which both the input and output of the modules needs to be formatted to represent and pipe linguistic annotations is NAF. Our Java modules all use the kaflib library for easy NAF integration.