A
Web Corpus and Topic Signatures for all WordNet
1.6 Nominal Senses (v 1.0)
We
have constructed a
webcorpus for all WordNet
1.6 noun
senses. To construct the web corpus we have been inspired by
the
"monosemous relatives" method proposed by (Leacock
et. al., 1998). The method usually rely on
information in
WordNet in order to retrieve examples from large corpora or the
web. The retrieved examples might not contain the target word, but
they do contain a word that is (closely) related to the target word
sense. We have shown that these examples are very useful for Word Sense Disambiguation (Martinez et al. 2008).
In
our case we have used
the following kind of relations in order to get the monosemous
relatives:
synonyms, hypernyms, direct and indirect hyponyms, and siblings. For
instance,
the first sense of channel has a monosemous synonym
"transmission
channel", all ocurrences of this synonym in any corpus can be
taken as
references to the first sense of channel.
The
snippets returned by Google
(up to 1,000 per query) are processed and filtered,
discarding sentences according to the following heuristic: length
shorter than 6
words, the number of non-alphanumeric characters is greater than the
number of
words divided by two, or the number of words uppercase is greater than
those in
lowercase.
Based
in this
webcorpus, we have built topic signatures for all
polysemous WordNet 1.6
nouns. Topic signatures are context vectors built for word senses and
concepts.
They try to associate a topical vector to each word sense. The
dimensions of
this topical vectors are the word in the vocabulary and the
weights try to
capture the relatedness of the words to the target word sense.
Thus, in this work we present a publicly available resource which
comprises both automatically
extracted examples for all WordNet 1.6 noun senses and topic signatures
built based on those examples. We gathered around 700 sentences per
each noun in WordNet.
When the monosemous relatives are used to build a sense corpus for
polysemous words, they comprise an average of around 3,500 sentences
per word sense. The size of the topic signatures thus constructed is of
around 4,500 words per word sense.
Contact:
Oier Lopez de Lacalle
(IXA NLP group)
Demos:
Topic signatures for all WN 1.6 nominal senses (demo1)
Filtered and lemmatized topic signatures for all WN 1.6
nominal senses (demo2)
Filtered and lemmatized topic signatures for 20 nouns in
Senseval-2,
with WN 1.7.1 senses (demo3)
Download:
Four (very
large) resources are
available:
Snippets from Google for all nouns in WordNet 1.6.
Chunks: 1 (1G) 2 (1G) 3 (1G) 4 (1G) 5 (417M)
README.txt
Sense corpus for all nominal senses of polysemous nouns in
WordNet 1.6. Chunks:
1
(1G) 2 (1G) 3 (1G) 4 (1G) 5 (1G) 6 (1G) 7 (1G) 8 (59M) README.txt
Topic signatures for all nominal senses of polysemous nouns
in WordNet 1.6: 2.0G README.txt
Filtered and lemmatized topic signatures for all nominal
senses of polysemous nouns in WordNet 1.6:
1.5G
README.txt
See also:
Disambiguated
topic signatures, also known as KNOWNET
(Cuadros & Rigau, 2008)
References
Agirre, E., E. Alfonseca, and O. Lopez, 2004.
Approximating hierachy-based similarity for wordnet nominal
synsets using topic signatures
Proc.of the 2nd Global WordNet Conference
(pdf)
Agirre, E. and O. Lopez, 2003.
Clustering wordnet word senses
Proceedings of the Conference on Recent Advances on
Natural Language (RANLP’03)
(pdf)
Agirre E., Lopez de Lacalle Lekuona O. 2004
Publicly available topic signatures for all WordNet nominal
senses
Proceedings of the 4rd International Conference on Languages
Resources and Evaluations (LREC). Lisbon, Portugal
(postscript)
Agirre, Eneko and David Martinez, 2004.
The effect of bias on an automatically-built word sense corpus
Proceedings of the 4rd International
Conference on Language Resources and Evaluations (LREC)
(pdf)
Agirre E., Martinez D., 2004.
Unsupervised WSD based on automatically retrieved examples:
The importance of bias
Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP). Barcelona, Spain.
(pdf)
Cuadros M. and Rigau G., 2008
KnowNet:
building a large net of knowledge from the web
The 22nd International Conference on Computational Linguistics
(Coling'08), UK, Manchester. 2008.
(pdf)
Martinez, D., Agirre, E., and Lopez de Lacalle O.,
2008
On the use of automatically acquired examples for all-nouns WSD
Journal of Artificial Intelligence Research, 79-107, vol. 33.
ISSN 1076-9757. (pdf).
Leacock, Claudia, Martin Chodorow, and George A. Miller, 1998.
Using corpus statistics and wordnet relations for sense identification
Computational Linguistics, 24(1):147-165.
/sc01a1/zope_resources/resources