Task #2: Evaluating Word Sense Induction and Discrimination Systems

The competition is over.

  • You can get the scorer scripts, systems and gold standard keyfiles and some baselines here.

Mailing list

You can browse the e-mail discussion on the task.
To join enter your e-mail here:


Datasets and formats

The dataset will be comprised by the texts from the English lexical-sample task in SemEval-2007 (task 17). 

The input and outputs of participant systems will follow the usual Senseval-3 format, with one difference: the labels for senses in the output can be arbitrary symbols. Please note that the output will consist of instances from different words, and thus the labels of each induced sense must be unique. For instance, let's assume that one participant system has induced 2 senses for the noun "brother" (named brother.n.C0, brother.n.C1) and 3 senses for verb "shake" (named shake.v.C0, shake.v.C1 and shake.v.C2). These are example outputs for a sample of instances of both words:

brother.n brother.n.00001 brother.n.C1
brother.n brother.n.00002 brother.n.C0 brother.n.C1
shake.v shake.v.00001 shake.v.C2/0.4 shake.v.C0/0.5 shake.v.C1/0.1
shake.v shake.v.00002 shake.v.C2/914 shake.v.C0/817

In the first line the system assigns sense brother.n.C1 to instance brother.n.00001 with weight 1 (default). In the second line the system assigns equal weight to senses brother.n.C0 and brother.n.C1 (1 by default). In the last two lines the weight is explicitly given for the senses of shake. Weights don't need to add to one, but must be positive. Senses not mentioned in the line will get weight 0. Check this site for more details on formats.

We interpret the results as a hard clustering task, with systems assigning the sense with maximum weight. In case of ties, we interpret that the system is forming a new sense which is a combination of those senses in the tie. For the example above:

  • instance brother.n.00001 is assigned brother.n.C1 as the induced sense
  • instance brother.n.00002 is assigned brother.n.C0_brother.n.C1 as the induced sense  
  • instance shake.v.00001 is assigned shake.v.C0 as the induced sense
  • instance shake.v.00002 is assigned shake.v.C2 as the induced sense

We recommend that participants return all induced senses per instance with associated weights, as these will be used for the second variety of evaluation (see below).


These are the steps to be followed by participants (see also important dates below):
  1. register in the Semeval website
  2. download the data from the Semeval website 
  3. participants have 2 weeks to induce the "senses", tag the whole data with those "senses" and upload it on the Semeval website


Organizers will return the evaluation in two varieties:
  1. clustering-style evaluation. We interpret the gold standard (GS) as a clustering solution: all examples tagged with a given sense in the GS form a class. The examples returned by participants that share the "sense" tag with maximum weight are the clusters. We compare participants clusters on the test data with the classes in the gold-standard, and compute F-score as usual (Agirre et al. 2006). In case of ties (or multiple sense tags in the GS), a new sense will be formed.
  2. mapping to the GS sense inventory: organizers use training/test split of the data (as defned in task 17) to map the participants "senses" into the official sense inventory. Using this mapping, the organizers convert the participants results into the official sense inventory, and compute the usual precision and recall measures. See (Agirre et al. 2006) for more details.

The first evaluation variety give better scores to the induced senses most similar to the GS senses (e.g. similar number of senses). The second evaluation variety allows for comparison with other kinds of systems. It does not necessarily favor systems inducing senses similar to the GS. We have used such framework to evaluate graph-based sense-induction techniques in (Agirre et al. 2006). 

We strongly suggest participants to discuss and propose alternative evaluation strategies, with the conditions that they make use of the available lexical-sample data.

Download area

This section will contain evaluation software, useful scripts, complementary materials, baseline systems, etc. but not the datasets proper. The datasets are available at the main site for download.

System and Results

This section will be completed after the competition.


The timing for this task can be summarized in the following steps:

  1. participants register on the 26th of Feb.
  2. deadline for submission is the 1st of Apr.
  3. participants can choose when to download and submit in this timeframe (26th of Feb. to 1st of Apr.), but will only have 2 weeks for submitting the results starting from the download date.


We thank Ted Pedersen and Phil Edmonds for comments on this task proposal.


Pedersen, T. Unsupervised Corpus-Based Methods for WSD. In Agirre, E. and Edmonds, P. (Eds.) "Word Sense Disambiguation: Algorithms and applications". Springer, 2006.

Agirre E., Lopez de Lacalle Lekuona O., Martinez D., Soroa A. 2006. Two graph-based algorithms for state-of-the-art WSD. Procceedings of EMNLP 2006.

 For more information, visit the SemEval-2007 home page.