Fasttext: Pretrained models not "cleaned"?

Created on 5 Jan 2018  Â·  5Comments  Â·  Source: facebookresearch/fastText

I hope this is the right place for my question. If not, feel free to tell me so.

Yesterday I started to fasttext pretrained model for the German language.
I noticed, that there are different words for "berlin", like ""berlin" or "berlin/" or "berlin/deutschland" or "berlin/heidelberg". The first two cases seem to me, like there was no real cleaning of the tokens/words done beforehand. Why was that decision made? How can I clean them afterwards?
In my understanding of how fasttext is trained (by looking at neighboring words), there should be no real difference between ""berlin" and "berlin/", so can I simply add them up and make the average? Or decide for one of those?

For the third and fourth it seems like some kind of clarification which berlin is meant or a numbering of different cities. Normally I would assume the / would be replaced by a space so that the words would have been split. Can I clean this afterwards or do I ignore those keys?

Most helpful comment

I have found the same issue that there are symbols within the tokens in the aligned word embeddings. Below I am comparing the number of tokens containing '…' or '»' or '\xa0—' for eight languages and for aligned word embeddings https://fasttext.cc/docs/en/aligned-vectors.html and non-aligned word embeddings https://fasttext.cc/docs/en/english-vectors.html.

It seems that the non-aligned word embeddings are much more clean and may be a better option if you don't have a cross-lingual task.

pattern | English | Chinese | Japanese | Russian | Spanish | French | German | Italian
-- | -- | -- | -- | -- | -- | -- | -- | --
  |   |   |   |   |   |   |   |  
Aligned Word Embeddings
total number of tokens | 2,519,370 | 332,647 | NA | 1,888,423 | 985,667 | 1,152,449 | 2,275,233 | 871,053
'…' | 1,327 | 1 | NA | 4,818 | 325 | 4,260 | 1,160 | 176
'»' | 736 | 1 | NA | 140,548 | 17,783 | 14,403 | 1,862 | 5,185
'\xa0—' | 1 | 0 | NA | 84,165 | 3 | 9 | 1 | 0
  |   |   |   |   |   |   |   |  
Un-aligned Word Embeddings
total number of tokens | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000
'…' | 1 | 209 | 1,204 | 1 | 1 | 1 | 1 | 1
'»' | 1 | 162 | 110 | 1 | 1 | 1 | 1 | 1
'\xa0—' | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
  |   |   |   |   |   |   |   |  
wiki-news-300d-1M
total number of tokens | 999,994 |   |   |   |   |   |   |  
'…' | 1 |   |   |   |   |   |   |  
'»' | 1 |   |   |   |   |   |   |  
'\xa0—' | 0 |   |   |   |   |   |   |  

All 5 comments

This is also true for french model. Seems like no unicode conversion as been made, as suggested in fastText/python/README.md

I guess we should download the original dataset, clean everything, and re-train the model on our own.

I've read somewhere that the script applied to get the text for the pretrained model are the same as what we get using the "get-wikimedia.sh" script. In it, a preprocess is applied to the text in order to parse text from xml file. I guess the preprocess is not quite accurate and leave several symbols. Like in French it leaves some { etc

You can download the text file using get-wikimedia.sh and then apply another script to reclean the file. (Transform all unwanted chars into spaces for example).

I have the same Problem, i want to quantize the german vectors and ask myself how the wiki dump was cleaned. If I clean it with wikifil.pl, all characters like "öäüß" are deleted from the text due to the cleaning command tr/a-z/ /cs;. I would also be useful to know if fasttext divides the text into sentences for building word vectors. In order to the tutorial, one has to pass a single text file, but the formatting of the text file is not specified. When cleaning with with wikifil.pl, the whole text in the file will be formatted into one line, but how does fasttext split the sequences with that?

@tocab I just modified my script and changed the line tr/a-z/ /cs; to tr/a-zäöüß/ /cs; which worked pretty well.
As i tried to compare my model to the pretrained model https://fasttext.cc/docs/en/pretrained-vectors.html using the nearest neighbors function of fasttext it seems like the cleaning script they used doesn't remove the german quotation marks, which might be related to this problem.

I have found the same issue that there are symbols within the tokens in the aligned word embeddings. Below I am comparing the number of tokens containing '…' or '»' or '\xa0—' for eight languages and for aligned word embeddings https://fasttext.cc/docs/en/aligned-vectors.html and non-aligned word embeddings https://fasttext.cc/docs/en/english-vectors.html.

It seems that the non-aligned word embeddings are much more clean and may be a better option if you don't have a cross-lingual task.

pattern | English | Chinese | Japanese | Russian | Spanish | French | German | Italian
-- | -- | -- | -- | -- | -- | -- | -- | --
  |   |   |   |   |   |   |   |  
Aligned Word Embeddings
total number of tokens | 2,519,370 | 332,647 | NA | 1,888,423 | 985,667 | 1,152,449 | 2,275,233 | 871,053
'…' | 1,327 | 1 | NA | 4,818 | 325 | 4,260 | 1,160 | 176
'»' | 736 | 1 | NA | 140,548 | 17,783 | 14,403 | 1,862 | 5,185
'\xa0—' | 1 | 0 | NA | 84,165 | 3 | 9 | 1 | 0
  |   |   |   |   |   |   |   |  
Un-aligned Word Embeddings
total number of tokens | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000 | 2,000,000
'…' | 1 | 209 | 1,204 | 1 | 1 | 1 | 1 | 1
'»' | 1 | 162 | 110 | 1 | 1 | 1 | 1 | 1
'\xa0—' | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
  |   |   |   |   |   |   |   |  
wiki-news-300d-1M
total number of tokens | 999,994 |   |   |   |   |   |   |  
'…' | 1 |   |   |   |   |   |   |  
'»' | 1 |   |   |   |   |   |   |  
'\xa0—' | 0 |   |   |   |   |   |   |  

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AhmedIdr picture AhmedIdr  Â·  3Comments

poppingtonic picture poppingtonic  Â·  3Comments

yasonk picture yasonk  Â·  3Comments

ereday picture ereday  Â·  3Comments

loretoparisi picture loretoparisi  Â·  3Comments