Task #2: Evaluating Word Sense Induction and Discrimination Systems
The competition is over.
Datasets and formats
The dataset will be comprised by the texts from the English lexical-sample task in SemEval-2007 (task 17).
The input and outputs of participant systems will follow the usual Senseval-3 format, with one difference: the labels for senses in the output can be arbitrary symbols. Please note that the output will consist of instances from different words, and thus the labels of each induced sense must be unique. For instance, let's assume that one participant system has induced 2 senses for the noun "brother" (named brother.n.C0, brother.n.C1) and 3 senses for verb "shake" (named shake.v.C0, shake.v.C1 and shake.v.C2). These are example outputs for a sample of instances of both words:
brother.n brother.n.00001 brother.n.C1
In the first line the system assigns sense brother.n.C1 to instance brother.n.00001 with weight 1 (default). In the second line the system assigns equal weight to senses brother.n.C0 and brother.n.C1 (1 by default). In the last two lines the weight is explicitly given for the senses of shake. Weights don't need to add to one, but must be positive. Senses not mentioned in the line will get weight 0. Check this site for more details on formats.
We interpret the results as a hard clustering task, with systems assigning the sense with maximum weight. In case of ties, we interpret that the system is forming a new sense which is a combination of those senses in the tie. For the example above:
We recommend that participants return all induced senses per instance with associated weights, as these will be used for the second variety of evaluation (see below).
ParticipationThese are the steps to be followed by participants (see also important dates below):
EvaluationOrganizers will return the evaluation in two varieties:
The first evaluation variety give better scores to the induced senses most similar to the GS senses (e.g. similar number of senses). The second evaluation variety allows for comparison with other kinds of systems. It does not necessarily favor systems inducing senses similar to the GS. We have used such framework to evaluate graph-based sense-induction techniques in (Agirre et al. 2006).
We strongly suggest participants to discuss and propose alternative evaluation strategies, with the conditions that they make use of the available lexical-sample data.
This section will contain evaluation software, useful scripts, complementary materials, baseline systems, etc. but not the datasets proper. The datasets are available at the main site for download.
System and Results
This section will be completed after the competition.
The timing for this task can be summarized in the following steps:
We thank Ted Pedersen and Phil Edmonds for comments on this task proposal.
Pedersen, T. Unsupervised Corpus-Based Methods for WSD. In Agirre, E. and Edmonds, P. (Eds.) "Word Sense Disambiguation: Algorithms and applications". Springer, 2006.
Agirre E., Lopez de Lacalle Lekuona O., Martinez D., Soroa A. 2006. Two graph-based algorithms for state-of-the-art WSD. Procceedings of EMNLP 2006.
For more information, visit the SemEval-2007 home page.