ZT corpusa

Morphosyntactically-tagged Science and Technology corpus.

Zientzia eta Teknologiako testuen corpusa (morfosintaktikoki etiketatua).

Descripción (en):

The ZT Corpus (Basque Corpus of Science and Technology) is a tagged collection of specialised texts in Basque, which aims to be a major resource in research and development with respect to written technical Basque: terminology, syntax and style. It is composed of two parts, a 1.6 million-word balanced part, whose annotation has been revised by hand, and another automatically tagged 6 million-word part. We built new tools to help in building ZTC: corpus compilation, corpus annotation. and a specific interface for advanced queries. It was released in December 2006 and has been developed by Elhuyar and the Ixa NLP Research Group (On-line consultation: http://www.ZTcorpusa.eus). The ZT Corpus stands out among other Basque corpora for many reasons: it is the first specialised corpus in Basque, it has been designed to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque), it is the first corpus in Basque annotated using a TEI-P4 compliant XML format, it is the first written corpus in Basque to be distributed by ELDA and it has a friendly and sophisticated query interface. The corpus has two kinds of annotation, a structural annotation and a stand-off linguistic annotation.

Descripción:

Zientzia eta Teknologiaren Corpusa, edo ZT corpusa, zientzia eta teknologiaren alorreko euskarazko testu-bilduma egituratu eta etiketatua da, eta alor horietako euskararen erabilera ikertzeko baliabidea izatea du helburu nagusia.

Corpus berezi edo espezializatua da, eta UPV/EHUko IXA taldeak eta Elhuyar Fundazioak elkarlanean eratu dute.

Corpusaren osaera dela eta, 1990-2002 bitartean argitaratutako zientzia eta teknologiaren alorreko obrak hartu dira kontuan corpusa elikatzeko. Corpus sailkatua da, eremuaren (jakintza-alorraren) eta generoaren (testu-motaren) arabera.

Corpus etiketatua da, bai testuaren egiturari eta formatuari dagokionez, bai linguistikoki. Etiketatze linguistikoa egiteko, euskara automatikoki prozesatzeko teknologia aurreratua erabili da (IXA taldearen Eustagger etiketatzailea). Testuko hitz bakoitzaren lema eta kategoria/azpikategoria etiketatu dira. Corpusaren lehen bertsio honetan, 8,5 milioi hitz daude, eta horietatik 1,9 milioi hitz eskuz berrikusi, desanbiguatu eta zuzendu dira.

Online

Enlace para acceder online o descargar:

http://www.ztcorpusa.eus/cgi-bin/kontsulta.py?mota=arrunta

Tipo:

Corpora

Persona de contacto:

Xabier Artola

Email persona de contacto:

xabier.artola@ehu.eus

Grupo de investigación:

IXA-UPV/EHU

Euskara

Displaying 1 - 2 of 2

Grammars and language models

EDGK

Rule-based Dependency Grammar for Basque

BERTeus

BERT language model for Basque

Displaying 1 - 20 of 20

Tools and services

Averell Averell is a Python library and command line interface to download and to standardize corpora from ten multi-lingual poetry repositories	Jollyjumper Jollyjumper is our enjambment detection Python library for Spanish	Rantanplan Rantanplan is a Python library for the automated scansion of Spanish poetry	PoetryLab app PoetryLab: An Open Source Toolkit for the Analysis of Spanish Poetry Corpora
PDMapping Tool for documenting and analyzing speakers' judgments about spatial and sociocultural linguistic variation.	Ferramenta On-Line de ExpeRimentación PerceptivA (FOLErPa) FOLERPA is an online tool for carrying out perceptual experiments.	Cartografía dos apelidos de Galicia Research tool for the study of the geographical distribution of surnames in Galicia.	Vocabulary analyzer Web Service This web service calculates different lexicometric measures and displays them graphically (tokens, types, hapaxes & type/token ratio).
Ngram Statistics de Pedersen Pedersen's Ngram Statistics Package	UPF Freeling-based part-of-speech tagger. This is the UPF Freeling-based part-of-speech tagger.	Análisis de relaciones de dependencias This WS performs dependency parsing using Bohnet's graph-based Parser. The input is text in plain text or CoNLL format. The languages supported are English and Spanish.	Freeling Named Entity Recognition - NER Freeling-based Named Entity Recognition - NER
WSD-IXA Word-Sense Disambiguation	Ixa pipes Multilingual NLP tools	ixaKat A modular chain of Natural Language Processing tools for Basque	Maltixa Statistical Syntactic analyzer for Basque
Eustagger Morphosyntactic tagger for Basque	Xuxen Spelling and grammar checker for Basque	BASYQUE A web application to analyse syntactic variation of Basque dialects	Analhitza Category analyzer

You are here

ZT corpusa

Grammars and language models

Tools and services