Are there text preprocessing steps recommended to learn word representations using FastText? For example, is tekonization required? Also, should the text be in lower case?
I have huge clean scientific text that I am interested in creating word representation using FastText, and I wonder if preprocessing will improve quality of the representations?
Also, the sample file in the FastText tutorial has all the words in a single line. Is that something required or does the tool take multiple lines of text as well?
Hi @negacy,
As preprocessing, we recommend to tokenize the training data. This will definitely improve the quality of the representations.
Lower casing is optional, and depends on your application (whether you want different word representations for the upper- and lower-cased words).
It is not required to have all the words on a single line, and we recommend to keep the original newlines from your dataset.
Best,
Edouard.
Hello @EdouardGrave,
In my dataset I have both \n and \n\n. I removed only the \n\n before training. Then during the testing phase I ran print_results(*model.test(FILE_TEST)) and everything worked fine, but when I do predicted = model.predict(text) on the complete corpus I get this error:
File "/usr/local/lib/python3.6/dist-packages/fasttext-0.8.22-py3.6-linux-x86_64.egg/fastText/FastText.py", line 126, in check
"predict processes one line at a time (remove \'\\n\')"
ValueError: predict processes one line at a time (remove '\n')
Does model.test do some kind of pre-processing removing the \n? Which are the differences in pre-processing between model.test and model.predict? Will I get different results depending on this preprocessing?
Most helpful comment
Hi @negacy,
As preprocessing, we recommend to tokenize the training data. This will definitely improve the quality of the representations.
Lower casing is optional, and depends on your application (whether you want different word representations for the upper- and lower-cased words).
It is not required to have all the words on a single line, and we recommend to keep the original newlines from your dataset.
Best,
Edouard.