Automatic terminology extraction and its application to Basque 1. Introduction In recent years work has begun to develop instruments in several languages for automatic terminology extraction in technical texts, though human intervention is still required to make the final selection from the terms automatically chosen. As an example we can cite the following instruments: LEXTER (Bourigault, 92), AT & Tko Terminght (Church & Dagan, 94), TERMS by IBM (Justeson & Katz, 95) and NPtool (Arpper, 95). Their areas of application can be divided into two main groups: information indexing and the making-up of terminological glossaries. In areas where terminology is developing dynamically, such as computer science, it is almost impossible to carry out effective terminological work without an instrument of this type. If a similar instrument is to be developed for Basque we shall come up against more major drawbacks, because the unifying process of the language has not been completed, research carried out is limited and Basque is an agglutinative language. 2. Terminology extraction It is a hard task to obtain a formal, complete definition of a term, but that is precisely what a major part of this work consists of: defining the characteristics of terms. To obtain technical terms from the corpus a combination of NLP techniques (based on linguistic knowledge) and statistical techniques is usually used. 2.1. Linguistic Techniques Linguistic techniques are used basically to make the initial selection of terms. Morpho-syntactic models are usually used, so it is advisable to have the text already analysed or at least labelled. The results are conditioned heavily by the quality of the linguistic tool used. In any event in some projects neither morphological nor syntactic analysis is carried out (Su et al., 96). Lemmatisation is linked to morphological analysis and the removal of ambiguities. In complex inflected languages poor results will ensue if only the formal aspect of words is dealt with: lemmatisation will be necessary. Linguistic knowledge is also of prime importance in the standardisation of terminology: a discrimination between terms must be made, because some of them may form part of longer units. 2.2. Statistical Techniques In most projects statistical methods have been used to reduce the assumed terms which follow the linguistic model. The methods applied vary widely from project to project, so the simplest idea is to require a minimum absolute frequency (Justeson & Katz, 95), though several probabilistic formulae are generally combined. 2.3. Results The results obtained are not yet those required for absolutely automatic extraction. A balance must be found between recall and precision. In this balance preference is given to recall, provided there is a person who can carry out the terminology reduction. To obtain a recall of 95% precision is usually reduced to 50%, and for a precision of 85% cover is not reduced even to 35%. 3. Application to Basque The IXA Group intends to develop a tool of this type for Basque. The morphological analyser is already being prepared (Alegria et al, 96), the lemmatizer/labeller is almost completed (Aduriz et al, 96) and work has been done on surface level syntax. While these tools are being prepared, we must work on the modelling of technical terms, i.e. we must reduce their characteristics. To that end, basing work on existing technical dictionaries and using statistical techniques, principal models must be obtained. We do not yet have any results, but we believe that the model will be wider than the noun phrase. In the choice of technical terms, the case of internal declension may prove decisive.