Semeval 2007 4th International Workshop on Semantic Evaluations Task #1: Evaluating WSD on Cross-Language Information Retrieval A Semeval task run in collaboration with the Cross-Language Evaluation Forum -CLEF Release of features in Semcor 1.6 January 23, 2007 [Additional data files] We provide some of the widely used WSD features in a word-to-word fashion (Agirre et al. 2006) in order to make participation easier. These features will be available for both topics and documents (test data) as well as all the words with frequency above 10 in Semcor 1.6 (which can be taken as the training data for supervised WSD systems). This document relates to the later. The document and topic features are released separately. [Features file format] Each file contains the extracted features for each occurrence of the target-word in the document collection (or topic collection). The occurrences are separated by a header followed by feature instances (feature_type value), each in one line. The header includes information on the word, pos and unique identifier of the occurrence, plus a sense identifier. This is a sample set of lines for a single occurrence. "..." means some lines have been skipped: NounModifier halfway_NN big_lem_func_+1 the leader big_lem_func_-1 leader at big_pos_+1 DT NNS big_pos_-1 NNS IN big_wf_func_+1 the leaders ... pedersen_bigr up with pedersen_bigr won with post_J_lem two-under-par post_J_wf two-under-par post_N_lem halfway ... prev_R_wf too prev_V_lem be ... trig_lem_func_+1 with the leader trig_pos_0 DT NNS IN trig_wf_func_+1 with the leaders ... unigram leaders ... win_cont_lem_4w be win_cont_lem_4w cut win_cont_lem_4w halfway ... win_cont_lem_context Dubai win_cont_lem_context Dunbar win_cont_lem_context Gary win_cont_lem_context Oldcorn ... [Header] The first line of each occurrence is a header (e.g. ) wick contains the information of the occurrence (the target word, document collection, and the offset in the document collection), plus an optional sense identifier. The sense-identifier is present in training data, but not in test data (in this case it is void). Code pattern: for example: DocId contains the information of the identification of the document: Corpus-subdirectory-file-ID GH95-1-2-25 (see 00-README.txt in the trial data for more information) [Feature types] We grouped feature types in three sets: Local collocations: Bigrams and trigrams formed with the words around the target. These features are constituted with lemmas, word-forms, or PoS tags (PoS tagging was performed with the fnTBL toolkit (Ngai & Florian 01)). Other local features are those formed with the previous/posterior lemma/word-form in the context for each main PoS. E.g. The feature "prev_V_lem stand" would indicate that the target word is preceded by the verb stand. Syntactic dependencies: Syntactic dependencies were extracted using heuristic patterns, and regular expressions defined with the PoS tags around the target (this software was kindly provided by David Yarowsky's group, from the Johns Hopkins University). The following relations were used: object, subject, noun-modifier, preposition, and sibling. E.g. "list OBJ petition". Bag-of-word features: We extract the lemmas of the content words in the whole context, and in a (+-)4-word window around the target. We also obtain salient bigrams in the context, with the methods and the software described in (Pedersen, 2001). e.g. the feature "context_bigr visionary eyes" would express that "visionary eyes" has been found to be relevant for the target word, and has been seen in the given context. [Feature type codes] -Local collocations -Unigram (word form) unigram -Bigrams (lemmas, word forms, PoS tags): big_lem_cont_+1 (content words) big_lem_cont_-1 big_lem_func_+1 (function words) big_lem_func_-1 big_wf_cont_+1 big_wf_cont_-1 big_wf_func_+1 big_wf_func_-1 big_pos_+1 big_pos_-1 -Trigrams (lemmas, word forms, PoS tags) trig_lem_cont_+1 trig_lem_cont_-1 trig_lem_cont_0 trig_lem_func_+1 trig_lem_func_-1 trig_lem_func_0 trig_wf_cont_+1 trig_wf_cont_-1 trig_wf_cont_0 trig_wf_func_+1 trig_wf_func_-1 trig_wf_func_0 trig_pos_+1 trig_pos_-1 trig_pos_0 -Previous/posterior lemma/word form (J=adective; N=noun; R=adverb; V=verb) post_J_lem post_J_wf post_N_lem post_N_wf post_R_lem post_R_wf post_V_lem post_V_wf prev_J_lem prev_J_wf prev_N_lem prev_N_wf prev_R_lem prev_R_wf prev_V_lem prev_V_wf -Syntactic dependencies DominatingNoun NounModifier Object ObjectTo ObjectToPreposition Preposition Sibling SubjectTo - Bag-of-Words features: win_cont_lem_4w win_cont_lem_context pedersen_bigr [Authors] Eneko Agirre (e.agirre@ehu.es) Oier Lopez de Lacalle (jibloleo@si.ehu.es) David Martínez (davidm@csse.unimelb.edu.au) [Acknowledgements] Thanks to Florian et al. for the use of their syntactic dependency extractor. [References] Agirre E., O. Lopez de Lacalle Lekuona , D. Martinez. Exploring feature set combinations for WSD. In procceedings of the annual meeting of the SEPLN, Spain. 2006 Florian, Radu, Silviu Cucerzan, Charles Schafer, and David Yarowsky. 2002. Combining classifiers for word sense disambiguation. Natural Language Engineering, 4(8):327-341. Pedersen, T. 2001. A Decision Tree of Bigrams is an Accurate Predictor of Word Sense. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-01), Pittsburgh, PA.