Fasttext: Are there specific format and text preprocessing steps recommended to train FastText for word representation?

Created on 15 Aug 2018 · 2Comments · Source: facebookresearch/fastText

Are there text preprocessing steps recommended to learn word representations using FastText? For example, is tekonization required? Also, should the text be in lower case?

I have huge clean scientific text that I am interested in creating word representation using FastText, and I wonder if preprocessing will improve quality of the representations?

Also, the sample file in the FastText tutorial has all the words in a single line. Is that something required or does the tool take multiple lines of text as well?

Source

negacy

Most helpful comment

Hi @negacy,

As preprocessing, we recommend to tokenize the training data. This will definitely improve the quality of the representations.

Lower casing is optional, and depends on your application (whether you want different word representations for the upper- and lower-cased words).

It is not required to have all the words on a single line, and we recommend to keep the original newlines from your dataset.

Best,
Edouard.

EdouardGrave on 15 Aug 2018

❤3 👍3

All 2 comments

Hi @negacy,

As preprocessing, we recommend to tokenize the training data. This will definitely improve the quality of the representations.

Lower casing is optional, and depends on your application (whether you want different word representations for the upper- and lower-cased words).

It is not required to have all the words on a single line, and we recommend to keep the original newlines from your dataset.

Best,
Edouard.

EdouardGrave on 15 Aug 2018

❤3 👍3

Hello @EdouardGrave,
In my dataset I have both \n and \n\n. I removed only the \n\n before training. Then during the testing phase I ran print_results(*model.test(FILE_TEST)) and everything worked fine, but when I do predicted = model.predict(text) on the complete corpus I get this error:

 File "/usr/local/lib/python3.6/dist-packages/fasttext-0.8.22-py3.6-linux-x86_64.egg/fastText/FastText.py", line 126, in check
    "predict processes one line at a time (remove \'\\n\')"
ValueError: predict processes one line at a time (remove '\n')

Does model.test do some kind of pre-processing removing the \n? Which are the differences in pre-processing between model.test and model.predict? Will I get different results depending on this preprocessing?