Fasttext: How to recreate the English pretrained word vectors using enwik9

Created on 3 Mar 2017 · 3Comments · Source: facebookresearch/fastText

I want to expand the list of languages that have pretrained word vectors, and I'd like to know how the vectors in this list were generated. I have enwik9, and the wikifil.pl perl script, to clean up the WikiPedia XML. If I run
fasttext skipgram -input wiki_dataset_cleaned.txt -output model,

is it sufficient to recreate the English word vectors model that you've published in the list? If I do the same thing with any other language in the WikiPedia languages list, how would I test the quality of the output model?

Source

poppingtonic

Most helpful comment

For this who need to call it on strings at runtime in python:

https://gist.github.com/bittlingmayer/7139a6a75ba0dbbc3a06325394ae3a13

bittlingmayer on 27 Jul 2017

👍3

All 3 comments

@poppingtonic It may be sufficient for your purposes, but you can easily do much better. This question was asked here before, but I forget by whom. As far as I know there is no concrete answer from Facebook on how the preprocessing was done :(

What I can tell you though is that the pre-trained models in the repository were NOT created with the provided wikifil.pl script. The wikifil.pl is not very sophisticated, it will strip all special characters from your input and leave you only with lower case letters from a-z. This will give you worse results when compared to the pre-trained binaries provided -- which are unicode and include special characters. For English wikifil.pl might be OK for your purposes but other languages will suffer depending on the degree that they make use of special characters.

In regards to using the enwik9 archive, please know that it is a bit small in size and will not even come close to the quality of the pre-trained vectors which have been generated from a recent and whole wikipedia dump (which is about 10-13GB compressed and 60GB uncompressed).

That said, if you are serious about making your own vectors based on wikipedia you simply need to get a recent wiki dump available here: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

If you are using Linux or macOS you can pipe the output from bzip2 directly into your preprocessor script of choice meaning that you won't have to unpack the archive at all. A good wiki preprocessor script which will do a much better job for most languages is available here: https://github.com/attardi/wikiextractor.

P.S.: The parameters used for the pre-trained models are actually contained in the .bin file. If you want a baseline to start from, you can read them out easily (see args.cc lines 207-221). Or... make up your own! It is a good idea to tinker with these hyper-parameters until you got something that works best for your particular application.

ardeego on 7 May 2017

👍1

Hello @poppingtonic,

We now have a script called 'get-wikimedia.sh', that you can use to download and process a recent wikipedia dump of any language. This script applies the preprocessing we used to create the published word vectors.

The parameters we used to build the word vectors are the default skip-gram settings, except with a dimensionality of 300 as indicated on the top of the list of word vectors (we now understand that this could be more visible).

Thanks,
Christian