Semeval 2007 4th International Workshop on Semantic Evaluations

Task #1: Evaluating WSD on Cross-Language Information Retrieval

A Semeval task run in collaboration 
with the Cross-Language Evaluation Forum -CLEF

Release of features in Semcor 1.6
January 23, 2007


[Additional data files]

We provide some of the widely used WSD features in a word-to-word
fashion (Agirre et al. 2006) in order to make participation
easier. These features will be available for both topics and documents
(test data) as well as all the words with frequency above 10 in Semcor
1.6 (which can be taken as the training data for supervised WSD
systems). This document relates to the later. The document and topic
features are released separately.


[Features file format]

Each file contains the extracted features for each occurrence of the
target-word in the document collection (or topic collection). The
occurrences are separated by a header followed by feature instances
(feature_type value), each in one line. The header includes
information on the word, pos and unique identifier of the occurrence,
plus a sense identifier.

This is a sample set of lines for a single occurrence. "..." means some
lines have been skipped:

<leader.n.1#1>
NounModifier halfway_NN
big_lem_func_+1 the leader
big_lem_func_-1 leader at
big_pos_+1 DT NNS
big_pos_-1 NNS IN
big_wf_func_+1 the leaders
...
pedersen_bigr up with
pedersen_bigr won with
post_J_lem two-under-par
post_J_wf two-under-par
post_N_lem halfway
...
prev_R_wf too
prev_V_lem be
...
trig_lem_func_+1 with the leader
trig_pos_0 DT NNS IN
trig_wf_func_+1 with the leaders
...
unigram leaders
...
win_cont_lem_4w be
win_cont_lem_4w cut
win_cont_lem_4w halfway
...
win_cont_lem_context Dubai
win_cont_lem_context Dunbar
win_cont_lem_context Gary
win_cont_lem_context Oldcorn
...


[Header]

The first line of each occurrence is a header
(e.g. <leader.n.GH95-1-2-25.1#1>) wick contains the information of the
occurrence (the target word, document collection, and the offset in
the document collection), plus an optional sense identifier. The
sense-identifier is present in training data, but not in test data (in
this case it is void).

Code pattern:

  <target-word.Part-of-Speech.DocId.Sentence-number#sense-identifier>

for example:

  <leader.n.GH95-1-2-25.1#>

DocId contains the information of the identification of the document:

  Corpus-subdirectory-file-ID

  GH95-1-2-25

 (see 00-README.txt in the trial data for more information)


[Feature types]

We grouped feature types in three sets:

Local collocations:

  Bigrams and trigrams formed with the words around the target. These
  features are constituted with lemmas, word-forms, or PoS tags (PoS
  tagging was performed with the fnTBL toolkit (Ngai & Florian 01)).
   Other local features are those formed with the previous/posterior
  lemma/word-form in the context for each main PoS. E.g. The feature
  "prev_V_lem stand" would indicate that the target word is preceded
  by the verb stand.

Syntactic dependencies: 

  Syntactic dependencies were extracted using heuristic patterns, and
  regular expressions defined with the PoS tags around the target
  (this software was kindly provided by David Yarowsky's group, from
  the Johns Hopkins University). The following relations were used:
  object, subject, noun-modifier, preposition, and sibling.
  E.g. "list OBJ petition".

Bag-of-word features: 

  We extract the lemmas of the content words in the whole context, and
  in a (+-)4-word window around the target. We also obtain salient
  bigrams in the context, with the methods and the software described
  in (Pedersen, 2001). e.g. the feature "context_bigr visionary eyes"
  would express that "visionary eyes" has been found to be relevant
  for the target word, and has been seen in the given context.


[Feature type codes]

-Local collocations
   
  -Unigram (word form)
     unigram

  -Bigrams (lemmas, word forms, PoS tags):

     big_lem_cont_+1   (content words)
     big_lem_cont_-1
     big_lem_func_+1   (function words)
     big_lem_func_-1
     big_wf_cont_+1
     big_wf_cont_-1
     big_wf_func_+1
     big_wf_func_-1
     big_pos_+1
     big_pos_-1

 
  -Trigrams (lemmas, word forms, PoS tags)

     trig_lem_cont_+1
     trig_lem_cont_-1
     trig_lem_cont_0
     trig_lem_func_+1
     trig_lem_func_-1
     trig_lem_func_0	
     trig_wf_cont_+1
     trig_wf_cont_-1
     trig_wf_cont_0
     trig_wf_func_+1
     trig_wf_func_-1
     trig_wf_func_0
     trig_pos_+1
     trig_pos_-1
     trig_pos_0

  -Previous/posterior lemma/word form
   (J=adective; N=noun; R=adverb; V=verb)
    
     post_J_lem    
     post_J_wf
     post_N_lem
     post_N_wf
     post_R_lem
     post_R_wf
     post_V_lem
     post_V_wf
     prev_J_lem
     prev_J_wf
     prev_N_lem
     prev_N_wf
     prev_R_lem
     prev_R_wf	
     prev_V_lem	
     prev_V_wf


-Syntactic dependencies
   
     DominatingNoun
     NounModifier
     Object		
     ObjectTo	
     ObjectToPreposition
     Preposition
     Sibling
     SubjectTo


- Bag-of-Words features:
  
     win_cont_lem_4w
     win_cont_lem_context
     pedersen_bigr


[Authors]

Eneko Agirre (e.agirre@ehu.es)
Oier Lopez de Lacalle (jibloleo@si.ehu.es)
David Martínez (davidm@csse.unimelb.edu.au)

[Acknowledgements]
Thanks to Florian et al. for the use of their syntactic dependency extractor.

[References]

Agirre E., O. Lopez de Lacalle Lekuona , D. Martinez.
Exploring feature set combinations for WSD.
In procceedings of the annual meeting of the SEPLN, Spain. 2006

Florian, Radu, Silviu Cucerzan, Charles Schafer, and David Yarowsky. 2002.
Combining classifiers for word sense disambiguation.
Natural Language Engineering, 4(8):327-341.

Pedersen, T. 2001. 
A Decision Tree of Bigrams is an Accurate Predictor of Word Sense.
In Proceedings of the Second Meeting of the North American Chapter 
of the Association for Computational Linguistics (NAACL-01), Pittsburgh, PA.