STS (*SEM 2013 SHARED TASK)

Atlanta, June 13-14

Participants will submit systems that examine the degree of semantic equivalence between two sentences. The goal of the STS task is to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on NLP applications. We particularly encourage submissions from the lexical semantics, summarization, machine translation evaluation metric, and textual entailment communities.

The task will follow a similar design as the SemEval pilot last year, but instead of providing train/test data from the same datasets, we will provide all the 2012 data as training data, and the test data will be drawn from related but different datasets. This setting is more realistic, and the teams which prefer to train/test on the same datasets can perform experiments using the 2012 data.

There will be two tasks this year:

The core task
A pilot task on typed-similarity between semi-structured records

Core task

Given two sentences, s1 and s2, participants will quantifiably inform us on how similar s1 and s2 are, resulting in a similarity score. Participants will also provide a confidence score indicating their confidence level for the result returned for each pair. The output of participant systems will be compared to the manual scores, which range from 5 (semantic equivalence) to 0 (no relation).

The test data will include the following datasets:

Paraphrase sentence pairs
MT evaluation pairs including those from HyTER graphs and GALE HTER data
Gloss pairs

Note that there is no new train data this year. Participants can use the data from 2012 as train data. Please check the data tab in the menu to the left for all the data and details on the core task.

Pilot task on typed-similarity

In addition we will hold a pilot task on typed-similarity between semi-structured records. The types of similarity to be studied include location, author, people involved, time, events or actions, subject, description. Being an new task we just released training data for the pilot. Please check the data tab in the menu to the left for trial data, train data and details on this pilot task.

Evaluation

After in-house theoretical and empirical analysis, we have selected Pearson as the main evaluation metric. The core task will be evaluated according to the weighted mean across the three evaluation datasets. The pilot task will be evaluated according to the mean across the several similarity types.

Open source pipeline

New this year the organizers provide STS common, an open source shared annotation and inference pipeline for STS.

Further resources

A comprehensive list of evaluation tasks and datasets, software and papers related to STS can be found in http://www-nlp.stanford.edu/wiki/STS, a collaboratively maintained site, open to the STS community.

Important dates

Nov 11: Initial training dataset
Jan 3: Trial dataset, with documentation and scorer
Jan 30: Training dataset for typed similarity
Feb 15: Registration for the task closes
Mar 1: Start of evaluation period
Mar 19: End of evaluation period (23:59, UTC-11)

Apr 12: Paper due (23:59 PST)
Apr 20: Reviews due
Apr 29: Camera ready versions due

STS