We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.

Download directly with command line or from python

In order to download with command line or from python code, you must have installed the python package as described here.

Command line Python $ ./download_model.py en

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

(19.78%) [=========> ]

Once the download is finished, use the model as usual: $ ./fasttext nn cc.en.300.bin 10

Query word?

import fasttext.util

fasttext.util.download_model( 'en' , if_exists= 'ignore' )

ft = fasttext.load_model( 'cc.en.300.bin' )



Adapt the dimension

The pre-trained word vectors we distribute have dimension 300. If you need a smaller size, you can use our dimension reducer. In order to use that feature, you must have installed the python package as described here.

For example, in order to get vectors of dimension 100:

Command line Python $ ./reduce_model.py cc.en.300.bin 100

Loading model

Reducing matrix dimensions

Saving model

cc.en.100.bin saved

Then you can use the cc.en.100.bin model file as usual. import fasttext

import fasttext.util

ft = fasttext.load_model( 'cc.en.300.bin' )

ft.get_dimension()

300

fasttext.util.reduce_model(ft, 100 )

ft.get_dimension()

100

Then you can use ft model object as usual: ft.get_word_vector( 'hello' ).shape

( 100 ,)

ft.get_nearest_neighbors( 'hello' )

[( 0.775576114654541 , u'heyyyy' ), ( 0.7686290144920349 , u'hellow' ), ( 0.7663413286209106 , u'hello-' ), ( 0.7579624056816101 , u'heyyyyy' ), ( 0.7495524287223816 , u'hullo' ), ( 0.7473770380020142 , u'.hello' ), ( 0.7407292127609253 , u'Hiiiii' ), ( 0.7402616739273071 , u'hellooo' ), ( 0.7399682402610779 , u'hello.' ), ( 0.7396857738494873 , u'Heyyyyy' )]

or save it for later use: ft.save_model( 'cc.en.100.bin' )



Format

The word vectors are available in both binary and text formats.

Using the binary models, vectors for out-of-vocabulary words can be obtained with

$ ./fasttext print-word-vectors wiki .it . 300 .bin < oov_words .txt

where the file oov_words.txt contains out-of-vocabulary words.

In the text format, each line contain a word followed by its vector. Each value is space separated, and words are sorted by frequency in descending order. These text models can easily be loaded in Python using the following code:

import io def load_vectors (fname) : fin = io.open(fname, 'r' , encoding= 'utf-8' , newline= '

' , errors= 'ignore' ) n, d = map(int, fin.readline().split()) data = {} for line in fin: tokens = line.rstrip().split( ' ' ) data[tokens[ 0 ]] = map(float, tokens[ 1 :]) return data

Tokenization

We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.

More information about the training of these models can be found in the article Learning Word Vectors for 157 Languages.

License

The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

References

If you use these word vectors, please cite the following paper:

E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning, title={Learning Word Vectors for 157 Languages}, author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas}, booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018 )}, year={ 2018 } }

Evaluation datasets

The analogy evaluation datasets described in the paper are available here: French, Hindi, Polish.

Models

The models can be downloaded from: