Follow the instructions below in order to construct bilingual embeddings described in [1]. 1. Downloading the resources ---------------------------- Download the needed resources by clicking the link in section "Creating new embeddings from scratch" and uncompress them typing $ tar -xjf biemb-resources.tar.bz2 You will find the following structure: * constraints/ bilingual constraints This folder contains bilingual constraints (.cst extension) of most of the language pairs from experiments in [1]. Due to licensing issues, bilingual constraints involving Italian cannot be distributed. Obtain ItalWordNet from and, once you have signed the license agreement, we will send you the corresponding constraints. * wordnets/ Wordnet dictionaries and graph This folder contains monolingual and bilingual wordnet dictionaries (.lex extension) for all languages except Italian. It also includes a file with the WordNet graph relations (wn30g_eng_lkb.txt), which comprises all relations of WordNet 3.0 including gloss relations. To get the files for Italian, please obtain first ItalWordNet from and, once you have signed the license agreement, we will send you the corresponding graphs and dictionaries. * mapping-dictionaries/ mapping dictionaries This folder contains the mapping dictionaries (.dict extension) which are needed for the MAP method (see below). Once again, the mappings involving Italian are missing, but you can ask for them once you have the proper license. * stemmer/ A stemmer for Basque 2. Pre-requisites ----------------- - In order to execute the stemmer for Basque, install FOMA compiler from https://fomafst.github.io/ and set the $fomapath variable in "stemmer/eu-stemmer.pl" (line 8) with the path where FOMA is installed. - Download and install UKB 2.1 at http://ixa2.si.ehu.es/ukb/ukb_2.1.tgz - Download word2vec with SkipGram modified to use constraints at https://github.com/JosuGoiko/word2vec_constraints - Download bilingual mapping script at https://github.com/artetxem/vecmap 3. Building bilingual corpora ----------------------------- 3.1 TXT - Download ES, IT or EN monolingual wikipedia dumps at: http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/ - Get the xml2txt.pl script from here: https://www.dropbox.com/s/p3ta9spzfviovk0/xml2txt.pl?dl=0 Extract text from XML by executing the following: $ perl xml2txt.pl -nomath -notables INPUT.xml > OUTPUT.txt - Lowercase the result. - The EU corpus is composed of two parts, one derived from Wikipedia 07/04/2016 dump and the other from the ElHuyar Web Corpus [2]. The Wikipedia corpus can be obtained here: http://ixa2.si.ehu.es/ukb/euwiki-20160704.stem.lc.txt.bz2 Regarding the ElHuyar Web Corpus [2], it can be obtained from the authors by email (Igor Leturia ). Lowercase the corpus and stem it using the "eu-stmmer.pl" script in the "stemmer" folder. $ cat ElhuyarWebCorpus.txt | perl stemmer/eu-stemmer.pl > ElhuyarWebCorpus.stem.txt Finally, join the Wikipedia and ElHuyar Web corpus and shuffle the result. - For each language pair, ramdimly select a subset of the largest monolingual copora, according to the corpora sizes of table 10 in [1]. - For each language pair, concatenate the two monolingual corpora and shuffle the lines. 3.2 KB - Get Basque, English and Spanish WordNet dictionaries from "wordnets" folder (see above), where there are also bilingual dictionaries for each language pair. - Compile the wordnet graph using UKB (the binary file is platform-dependent): $ ./compile_kb -o wn30g.csr64 wordnets/wn30g_eng_lkb.txt - For each language pair involving Italian, create synthetic corpora executing the following command (see parameters below): $ ukb_walkandprint --srand 5555 100000000 -K wnit.csr64 -D wordnets/EN+IT.lex > KB-ENIT.txt $ ukb_walkandprint --srand 4445 124000000 -K wnit.csr64 -D wordnets/ES+IT.lex > KB-ESIT.txt $ ukb_walkandprint --srand 4444 60000000 -K wnit.csr64 -D wordnets/EU+IT.lex > KB-EUIT.txt - For the rest of language pairs, create synthetic corpora executing the following command (see parameters below): $ ukb_walkandprint --srand 4443 --multilang 112000000 -K wordnets/wn30g.csr64 -D wordnets/EN+ES.lex > KB-ENES.txt $ ukb_walkandprint --srand 4443 --multilang 42000000 -K wordnets/wn30g.csr64 -D wordnets/EN+EU.lex > KB-ENEU.txt $ ukb_walkandprint --srand 4444 --multilang 70000000 -K wordnets/wn30g.csr64 -D wordnets/EU+ES.lex > KB-ESEU.txt - For removing language tags from the synthetic corpora, execute the following $ sed -e 's/\#lang1//g;s/\#lang2//g' biligual-corpora.orig.txt > bilingual-corpora.txt In the above command, "lang1" and "lang2" make reference to the tokens, i.e. "eu", "it", "es" and "en", of the two languages that comprise the bilingual corpora. For example, for removing language tags in the ENEU corpus: $ sed -e 's/\#en//g;s/\#eu//g' KB-ENEU.orig.txt > KB-ENEU.txt 3.3 HYB - For each language pair, and given two monolingual TXT corpora and one bilingual KB corpus as described above: first, select subsets at random according to the sizes in table 11 in [1], concatenate the three corpora and shuffle. 4. Building bilingual embeddings --------------------------------- 4.1 JOINT and JOINTC - You need the bilingual constraints from the "constraints" folder (see above). - JOINT: for each language pair (PAIR) and corpus type (TYPE, one in TXT, KB, HYB) execute the following 18 times using the same parameters as in [3]: $ ./word2vec_constraints -train constraints/TYPE-PAIR.txt -output JOINT-TYPE-PAIR.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 For instance, this is the call for hybrid ENES corpora: $ ./word2vec_constraints -train constraints/HYB-ENES.txt -output JOINT-HYB-ENES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 - JOINTC: for each language pair and corpus type (TXT, KB, HYB) execute the following 18 times: $ ./word2vec_constraints -train constraints/TYPE-PAIR.txt -output JOINTC-TYPE-PAIR.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 -read-simconstr constraints/CONSTRAINT.cst -lambdasim 0.01 For instance, this is the call for text ESEU corpora: $ ./word2vec_constraints -train constraints/TXT-ESEU.txt -output JOINTC-TXT-ENEU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 -read-simconstr constraints/ESEU.cst -lambdasim 0.01 4.2 MAP - You need the bilingual dictionaries from "bilingual-dictionaries" folder (see above). - Gather monolingual corpora pairs: - TXT: monolingual corpora mentioned above (2.1). - KB: In order to create monolingual synthetic corpora as in [4], download UKB 2.1, monolingual dictionaries and graphs as above, and execute the following command: $ ukb_walkandprint --dict_weight --wemit_prob 1 -K wordnets/wn30g.csr64 -D wordnets/EU.lex 36000000 > KB-EU.txt $ ukb_walkandprint --dict_weight --wemit_prob 1 -K wordnets/wn30g.csr64 -D wordnets/EN.lex 56000000 > KB-EN.txt $ ukb_walkandprint --dict_weight --wemit_prob 1 -K wordnets/wn30g.csr64 -D wordnets/ES.lex 72000000 > KB-ES.txt $ ukb_walkandprint --dict_weight --wemit_prob 1 -K wordnets/itwn.csr64 -D wordnets/IT.lex 50000000 > KB-IT.txt - HYB: collect monolingual TXT and KB corpora as mentioned in previous two points. For each language pair a hybrid corpus of different size is produced (see Section 5.4.3 in [1] for how to set the sizes), selecting subsets at random and the shuffling together TXT and KB subsets. These are the resulting hybrid corpora sizes (and suggested file names): HYB-EN.ENES.txt HYB-EN.ENIT.txt HYB-EN.ESIT.txt HYB-ES.txt HYB-ES.ESIT.txt HYB-ES.ESEU.txt HYB-EU.txt HYB-EU.EUIT.txt HYB.IT.txt - For each language pair, execute the following to produce monolingual separate embeddings using same parameters as in [3]: $ ./word2vec_constraints -train KB-EN.txt -output KB-EN.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train KB-ES.txt -output KB-ES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train KB-EU.txt -output KB-EU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train KB-IT.txt -output KB-IT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train TXT-EN.txt -output TXT-EN.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train TXT-ES.txt -output TXT-ES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train TXT-EU.txt -output TXT-EU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train TXT-IT.txt -output TXT-IT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-EN.txt -output HYB-EN.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-EN.ENES.txt -output HYB-EN.ENES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-EN.ENIT.txt -output HYB-EN.ENIT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-EN.ESIT.txt -output HYB-EN.ESIT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-ES.txt -output HYB-ES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-ES.ESIT.txt -output HYB-ES.ESIT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-ES.ESEU.txt -output HYB-ES.ESEU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-EU.txt -output HYB-EU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-EU.EUIT.txt -output HYB-EU.EUIT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 $ ./word2vec_constraints -train HYB-IT.txt -output HYB-IT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 - Compute mapping of monolingual embeddings: $ python3 map_embeddings.py --orthogonal SRC.emb TRG.emb SRC.PAIR.mapped.emb TRG.PAIR.mapped.emb -d mapping-dictionaries/BILINGUAL-DICTIONARY.dict For instance, this is the call for mapping KB in Spanish to its English counterpart: $ python3 map_embeddings.py --orthogonal KB-ES.emb KB-EN.emb KB-ES.ENES.mapped.emb KN-EN.ENES.mapped.emb -d mapping-dictionaries/ES2EN.dict Target embedding space has to be the language with more resources. Note that we have to use the resulting mapped embeddings when calculating cross-lingual similarity. References ---------- [1] Goikoetxea, J., Agirre, E., Soroa, A.. Bilingual Embeddings with Random Walks over Multilingual WordNets. Under review. [2] Leturia, I. Evaluating Different Methods for Automatically Collecting Large General Corpora for Basque from the Web. COLING 2012. [3] Goikoetxea, J., Agirre, E., Soroa, A.. Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet. AAAI 2016. [4] Goikoetxea, J., Agirre, E., Soroa, A.. Random Walks and Neural Network Language Models on Knowledge Bases. NAACL 2015.