| Id | Segment | Tagger | CU |
| 1 | Automatic terminology extraction and its application to Basque | A1 | |
| 2 |
1. Introduction | A1 | |
| 3 |
In recent years work has begun to develop instruments in several languages for automatic terminology extraction in technical texts, | A1 | |
| 4 | though human intervention is still required to make the final selection from the terms automatically chosen. | A1 | |
| 5 | As an example we can cite the following instruments: LEXTER (Bourigault, 92), AT & Tko Terminght (Church & Dagan, 94), TERMS by IBM (Justeson & Katz, 95) and NPtool (Arpper, 95). | A1 | |
| 6 |
Their areas of application can be divided into two main groups: information indexing and the making-up of terminological glossaries. | A1 | |
| 7 | In areas where terminology is developing dynamically, such as computer science, it is almost impossible to carry out effective terminological work without an instrument of this type. | A1 | |
| 8 |
If a similar instrument is to be developed for Basque | A1 | |
| 9 | we shall come up against more major drawbacks, | A1 | |
| 10 | because the unifying process of the language has not been completed, | A1 | |
| 11 |
2. Terminology extraction | A1 | |
| 12 |
It is a hard task to obtain a formal, complete definition of a term, | A1 | |
| 13 | but that is precisely what a major part of this work consists of: defining the characteristics of terms. | A1 | |
| 14 | To obtain technical terms from the corpus a combination of NLP techniques (based on linguistic knowledge) and statistical techniques is usually used. | A1 | |
| 15 | lemmatisation will be necessary. | A1 | |
| 16 |
2.1. Linguistic Techniques | A1 | |
| 17 |
Linguistic techniques are used basically to make the initial selection of terms. | A1 | |
| 18 |
Morpho-syntactic models are usually used, | A1 | |
| 19 | so it is advisable to have the text already analysed or at least labelled. | A1 | |
| 20 | The results are conditioned heavily by the quality of the linguistic tool used. | A1 | |
| 21 | In any event in some projects neither morphological nor syntactic analysis is carried out (Su et al., 96). | A1 | |
| 22 |
Lemmatisation is linked to morphological analysis and the removal of ambiguities. | A1 | |
| 23 | In complex inflected languages poor results will ensue if only the formal aspect of words is dealt with: | A1 | |
| 24 | Linguistic knowledge is also of prime importance in the standardisation of terminology: | A1 | |
| 25 |
2.2. Statistical Techniques | A1 | |
| 26 | because some of them may form part of longer units. | A1 | |
| 27 |
In most projects statistical methods have been used to reduce the assumed terms which follow the linguistic model. | A1 | |
| 28 | The methods applied vary widely from project to project, | A1 | |
| 29 | so the simplest idea is to require a minimum absolute frequency (Justeson & Katz, 95), | A1 | |
| 30 | though several probabilistic formulae are generally combined. | A1 | |
| 31 |
2.3. Results | A1 | |
| 32 |
The results obtained are not yet those required for absolutely automatic extraction. | A1 | |
| 33 | A balance must be found between recall and precision. | A1 | |
| 34 | In this balance preference is given to recall, | A1 | |
| 35 | provided there is a person who can carry out the terminology reduction. | A1 | |
| 36 | To obtain a recall of 95% precision is usually reduced to 50%, | A1 | |
| 37 |
3. Application to Basque | A1 | |
| 38 |
The IXA Group intends to develop a tool of this type for Basque. | A1 | |
| 39 | and for a precision of 85% cover is not reduced even to 35%. | A1 | |
| 40 | The morphological analyser is already being prepared (Alegria et al, 96), | A1 | |
| 41 | the lemmatizer/labeller is almost completed (Aduriz et al, 96) | A1 | |
| 42 | and work has been done on surface level syntax. | A1 | |
| 43 |
While these tools are being prepared, | A1 | |
| 44 | we must work on the modelling of technical terms, | A1 | |
| 45 | i.e. we must reduce their characteristics. | A1 | |
| 46 | To that end, basing work on existing technical dictionaries and using statistical techniques, principal models must be obtained. | A1 | |
| 47 | We do not yet have any results, | A1 | |
| 48 | but we believe that the model will be wider than the noun phrase. | A1 | |
| 49 | In the choice of technical terms, the case of internal declension may prove decisive. | A1 | |
| 50 | research carried out is limited | A1 | |
| 51 | and Basque is an agglutinative language. | A1 | |
| 52 | a discrimination between terms must be made, | A1 | |