A Web Corpus and Topic Signatures for all WordNet 1.6 Nominal Senses (v 1.0)

We have constructed a webcorpus for all WordNet 1.6 noun senses. To construct the web corpus we have been inspired by the "monosemous relatives" method proposed by (Leacock et. al., 1998). The method usually rely on information in WordNet in order to retrieve examples from large corpora or the web. The retrieved examples might not contain the target word, but they do contain a word that is (closely) related to the target word sense. We have shown that these examples are very useful for Word Sense Disambiguation (Martinez et al. 2008).

In our case we have used the following kind of relations in order to get the monosemous relatives: synonyms, hypernyms, direct and indirect hyponyms, and siblings. For instance, the first sense of channel has a monosemous synonym "transmission channel", all ocurrences of this synonym in any corpus can be taken as references to the first sense of channel.

The snippets returned by Google (up to 1,000 per query) are processed and filtered, discarding sentences according to the following heuristic: length shorter than 6 words, the number of non-alphanumeric characters is greater than the number of words divided by two, or the number of words uppercase is greater than those in lowercase.

Based in this webcorpus, we have built topic signatures for all polysemous WordNet 1.6 nouns. Topic signatures are context vectors built for word senses and concepts. They try to associate a topical vector to each word sense. The dimensions of this topical vectors are the word in the vocabulary and the weights try to capture the relatedness of the words to the target word sense.

Thus, in this work we present a publicly available resource which comprises both automatically extracted examples for all WordNet 1.6 noun senses and topic signatures built based on those examples. We gathered around 700 sentences per each noun in WordNet. When the monosemous relatives are used to build a sense corpus for polysemous words, they comprise an average of around 3,500 sentences per word sense. The size of the topic signatures thus constructed is of around 4,500 words per word sense.

Contact: Oier Lopez de Lacalle (IXA NLP group)

Demos:

Topic signatures for all WN 1.6 nominal senses (demo1)

Filtered and lemmatized topic signatures for all WN 1.6 nominal senses (demo2)

Filtered and lemmatized topic signatures for 20 nouns in Senseval-2, with WN 1.7.1 senses (demo3)

Download:

Four (very large) resources are available:

Snippets from Google for all nouns in WordNet 1.6. Chunks: 1 (1G) 2 (1G) 3 (1G) 4 (1G) 5 (417M) README.txt

Sense corpus for all nominal senses of polysemous nouns in WordNet 1.6. Chunks: 1 (1G) 2 (1G) 3 (1G) 4 (1G) 5 (1G) 6 (1G) 7 (1G) 8 (59M) README.txt

Topic signatures for all nominal senses of polysemous nouns in WordNet 1.6: 2.0G README.txt

Filtered and lemmatized topic signatures for all nominal senses of polysemous nouns in WordNet 1.6: 1.5G README.txt

See also:

Disambiguated topic signatures, also known as KNOWNET (Cuadros & Rigau, 2008)

References

Agirre, E., E. Alfonseca, and O. Lopez, 2004.
Approximating hierachy-based similarity for wordnet nominal synsets using topic signatures
Proc.of the 2nd Global WordNet Conference
(pdf)

Agirre, E. and O. Lopez, 2003.
Clustering wordnet word senses
Proceedings of the Conference on Recent Advances on Natural Language (RANLP’03)
(pdf)

Agirre E., Lopez de Lacalle Lekuona O. 2004
Publicly available topic signatures for all WordNet nominal senses
Proceedings of the 4rd International Conference on Languages Resources and Evaluations (LREC). Lisbon, Portugal
(postscript)

Agirre, Eneko and David Martinez, 2004.
The effect of bias on an automatically-built word sense corpus
Proceedings of the 4rd International Conference on Language Resources and Evaluations (LREC)
(pdf)

Agirre E., Martinez D., 2004.
Unsupervised WSD based on automatically retrieved examples: The importance of bias
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Barcelona, Spain.
(pdf)

Cuadros M. and Rigau G., 2008
KnowNet: building a large net of knowledge from the web
The 22nd International Conference on Computational Linguistics (Coling'08), UK, Manchester. 2008.
(pdf)

Martinez, D., Agirre, E., and Lopez de Lacalle O., 2008
On the use of automatically acquired examples for all-nouns WSD
Journal of Artificial Intelligence Research, 79-107, vol. 33. ISSN 1076-9757. (pdf).

Leacock, Claudia, Martin Chodorow, and George A. Miller, 1998.
Using corpus statistics and wordnet relations for sense identification
Computational Linguistics, 24(1):147-165.

/sc01a1/zope_resources/resources