Follow the instructions below in order to construct bilingual embeddings
described in [1].

1. Downloading the resources
----------------------------

Download the needed resources by clicking the link in section "Creating new
embeddings from scratch" and uncompress them typing

   $ tar -xjf biemb-resources.tar.bz2

You will find the following structure:

*  constraints/              bilingual constraints

This folder contains bilingual constraints (.cst extension) of most of the
language pairs from experiments in [1]. Due to licensing issues, bilingual
constraints involving Italian cannot be distributed. Obtain ItalWordNet from
<adriana.roventini@ilc.cnr.it> and, once you have signed the license
agreement, we will send you the corresponding constraints.

*  wordnets/                 Wordnet dictionaries and graph

This folder contains monolingual and bilingual wordnet dictionaries (.lex
extension) for all languages except Italian. It also includes a file with
the WordNet graph relations (wn30g_eng_lkb.txt), which comprises all
relations of WordNet 3.0 including gloss relations. To get the files for
Italian, please obtain first ItalWordNet from <adriana.roventini@ilc.cnr.it>
and, once you have signed the license agreement, we will send you the
corresponding graphs and dictionaries.

* mapping-dictionaries/     mapping dictionaries

This folder contains the mapping dictionaries (.dict extension) which are
needed for the MAP method (see below). Once again, the mappings involving
Italian are missing, but you can ask for them once you have the proper
license.

* stemmer/                  A stemmer for Basque

2. Pre-requisites
-----------------

- In order to execute the stemmer for Basque, install FOMA compiler from
  https://fomafst.github.io/ and set the $fomapath variable in
  "stemmer/eu-stemmer.pl" (line 8) with the path where FOMA is installed.

- Download and install UKB 2.1 at 

  http://ixa2.si.ehu.es/ukb/ukb_2.1.tgz

- Download word2vec with SkipGram modified to use constraints at

  https://github.com/JosuGoiko/word2vec_constraints

- Download bilingual mapping script at 

  https://github.com/artetxem/vecmap

3. Building bilingual corpora
-----------------------------

3.1 TXT

- Download ES, IT or EN monolingual wikipedia dumps at:

  http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/

- Get the xml2txt.pl script from here:

  https://www.dropbox.com/s/p3ta9spzfviovk0/xml2txt.pl?dl=0

  Extract text from XML by executing the following:

  $ perl xml2txt.pl -nomath -notables INPUT.xml > OUTPUT.txt

- Lowercase the result.

- The EU corpus is composed of two parts, one derived from Wikipedia
  07/04/2016 dump and the other from the ElHuyar Web Corpus [2]. The
  Wikipedia corpus can be obtained here:

  http://ixa2.si.ehu.es/ukb/euwiki-20160704.stem.lc.txt.bz2

  Regarding the ElHuyar Web Corpus [2], it can be obtained from the authors
  by email (Igor Leturia <i.leturia@elhuyar.com>). Lowercase the corpus and
  stem it using the "eu-stmmer.pl" script in the "stemmer" folder.

  $ cat ElhuyarWebCorpus.txt | perl stemmer/eu-stemmer.pl > ElhuyarWebCorpus.stem.txt

  Finally, join the Wikipedia and ElHuyar Web corpus and shuffle the result.

- For each language pair, ramdimly select a subset of the largest monolingual copora,
  according to the corpora sizes of table 10 in [1].

- For each language pair, concatenate the two monolingual corpora and
  shuffle the lines.

3.2 KB

- Get Basque, English and Spanish WordNet dictionaries from "wordnets"
  folder (see above), where there are also bilingual dictionaries for each
  language pair.

- Compile the wordnet graph using UKB (the binary file is
  platform-dependent):

  $ ./compile_kb -o wn30g.csr64 wordnets/wn30g_eng_lkb.txt

- For each language pair involving Italian, create synthetic corpora
  executing the following command (see parameters below):

  $ ukb_walkandprint --srand 5555 100000000 -K  wnit.csr64 -D wordnets/EN+IT.lex > KB-ENIT.txt
  $ ukb_walkandprint --srand 4445 124000000 -K wnit.csr64 -D wordnets/ES+IT.lex > KB-ESIT.txt
  $ ukb_walkandprint --srand 4444 60000000 -K  wnit.csr64 -D wordnets/EU+IT.lex > KB-EUIT.txt

- For the rest of language pairs, create synthetic corpora executing the following command (see parameters below):

  $ ukb_walkandprint --srand 4443 --multilang 112000000 -K wordnets/wn30g.csr64 -D wordnets/EN+ES.lex > KB-ENES.txt
  $ ukb_walkandprint --srand 4443 --multilang 42000000 -K wordnets/wn30g.csr64 -D wordnets/EN+EU.lex > KB-ENEU.txt
  $ ukb_walkandprint --srand 4444 --multilang 70000000 -K wordnets/wn30g.csr64 -D wordnets/EU+ES.lex > KB-ESEU.txt

- For removing language tags from the synthetic corpora, execute the following

  $ sed -e 's/\#lang1//g;s/\#lang2//g' biligual-corpora.orig.txt > bilingual-corpora.txt

  In the above command, "lang1" and "lang2" make reference to the tokens,
  i.e. "eu", "it", "es" and "en", of the two languages that comprise the
  bilingual corpora. For example, for removing language tags in the ENEU
  corpus:

  $ sed -e 's/\#en//g;s/\#eu//g' KB-ENEU.orig.txt > KB-ENEU.txt

3.3 HYB

- For each language pair, and given two monolingual TXT corpora and one
  bilingual KB corpus as described above: first, select subsets at random
  according to the sizes in table 11 in [1], concatenate the three corpora
  and shuffle.

4. Building  bilingual embeddings
---------------------------------

4.1 JOINT and JOINTC

- You need the bilingual constraints from the "constraints" folder (see above).

- JOINT: for each language pair (PAIR) and corpus type (TYPE, one in
  TXT, KB, HYB) execute the following 18 times using the same
  parameters as in [3]:

  $ ./word2vec_constraints -train constraints/TYPE-PAIR.txt -output JOINT-TYPE-PAIR.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0

  For instance, this is the call for hybrid ENES corpora:

  $ ./word2vec_constraints -train constraints/HYB-ENES.txt -output JOINT-HYB-ENES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0

- JOINTC: for each language pair and corpus type (TXT, KB, HYB) execute the
  following 18 times:

  $ ./word2vec_constraints -train constraints/TYPE-PAIR.txt -output JOINTC-TYPE-PAIR.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 -read-simconstr constraints/CONSTRAINT.cst -lambdasim 0.01

  For instance, this is the call for text ESEU corpora:

  $ ./word2vec_constraints -train constraints/TXT-ESEU.txt -output JOINTC-TXT-ENEU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0 -read-simconstr constraints/ESEU.cst -lambdasim 0.01

4.2 MAP

- You need the bilingual dictionaries from "bilingual-dictionaries" folder
  (see above).

- Gather monolingual corpora pairs:

  - TXT: monolingual corpora mentioned above (2.1).

  - KB: In order to create monolingual synthetic corpora as in [4], download
    UKB 2.1, monolingual dictionaries and graphs as above, and execute the
    following command:

    $ ukb_walkandprint --dict_weight --wemit_prob 1 -K wordnets/wn30g.csr64 -D wordnets/EU.lex 36000000  > KB-EU.txt
    $ ukb_walkandprint --dict_weight --wemit_prob 1 -K wordnets/wn30g.csr64 -D wordnets/EN.lex 56000000 > KB-EN.txt
    $ ukb_walkandprint --dict_weight --wemit_prob 1 -K wordnets/wn30g.csr64 -D wordnets/ES.lex 72000000 > KB-ES.txt
    $ ukb_walkandprint --dict_weight --wemit_prob 1 -K wordnets/itwn.csr64 -D wordnets/IT.lex 50000000  > KB-IT.txt


  - HYB: collect monolingual TXT and KB corpora as mentioned in previous two
    points. For each language pair a hybrid corpus of different size is
    produced (see Section 5.4.3 in [1] for how to set the sizes), selecting
    subsets at random and the shuffling together TXT and KB subsets. These
    are the resulting hybrid corpora sizes (and suggested file names):

       HYB-EN.ENES.txt
       HYB-EN.ENIT.txt
       HYB-EN.ESIT.txt
       HYB-ES.txt
       HYB-ES.ESIT.txt
       HYB-ES.ESEU.txt
       HYB-EU.txt
       HYB-EU.EUIT.txt
       HYB.IT.txt

- For each language pair, execute the following to produce monolingual separate embeddings using same parameters as in [3]:

  $ ./word2vec_constraints -train KB-EN.txt -output KB-EN.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train KB-ES.txt -output KB-ES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train KB-EU.txt -output KB-EU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train KB-IT.txt -output KB-IT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train TXT-EN.txt -output TXT-EN.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train TXT-ES.txt -output TXT-ES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train TXT-EU.txt -output TXT-EU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train TXT-IT.txt -output TXT-IT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-EN.txt -output HYB-EN.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-EN.ENES.txt -output HYB-EN.ENES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-EN.ENIT.txt -output HYB-EN.ENIT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-EN.ESIT.txt -output HYB-EN.ESIT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-ES.txt -output HYB-ES.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-ES.ESIT.txt -output HYB-ES.ESIT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-ES.ESEU.txt -output HYB-ES.ESEU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-EU.txt -output HYB-EU.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-EU.EUIT.txt -output HYB-EU.EUIT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  $ ./word2vec_constraints -train HYB-IT.txt -output HYB-IT.emb -size 300 -window 5 -sample 0 -negative 5 -hs 0 -binary 0 -cbow 0
  
- Compute mapping of monolingual embeddings:

  $ python3 map_embeddings.py --orthogonal SRC.emb TRG.emb SRC.PAIR.mapped.emb TRG.PAIR.mapped.emb -d mapping-dictionaries/BILINGUAL-DICTIONARY.dict

  For instance, this is the call for mapping KB in Spanish to its English
  counterpart:

  $ python3 map_embeddings.py --orthogonal KB-ES.emb KB-EN.emb KB-ES.ENES.mapped.emb KN-EN.ENES.mapped.emb -d mapping-dictionaries/ES2EN.dict

  Target embedding space has to be the language with more resources. Note
  that we have to use the resulting mapped embeddings when calculating
  cross-lingual similarity.

References
----------

[1] Goikoetxea, J., Agirre, E., Soroa, A.. Bilingual Embeddings with
    Random Walks over Multilingual WordNets. Under review.

[2] Leturia, I. Evaluating Different Methods for Automatically
    Collecting Large General Corpora for Basque from the Web. COLING 2012.

[3] Goikoetxea, J., Agirre, E., Soroa, A.. Single or Multiple?
    Combining Word Representations Independently Learned from Text and
    WordNet. AAAI 2016.

[4] Goikoetxea, J., Agirre, E., Soroa, A.. Random Walks and Neural
    Network Language Models on Knowledge Bases. NAACL 2015.