wikifil.pl strips off consecutive spaces. But it makes the resulted file not readable.
If the input is one sentence per line, how does it affect the result of fasttext?
Does it require the removal of all the punctuations?
Does fasttext accept non-English letters (such as Greek letters) embedded in English?
Sometimes whether a letter is in upper-case or lower-case actually contains some information about the word. Does the case in the input have to be of all lowercase?
Thanks.
Hi @pengyu,
FastText does not require the removal of all punctuation. FastText accept any text input in utf-8 format. You are correct that for some application, keeping the casing is important (the input does not have to be lowercase).
Best,
Edouard.
@EdouardGrave Could you also give us feedback to the question:
If the input is one sentence per line, how does it affect the result of fasttext?
I guess I want to know if word samples are selected across linebreaks. Thanks!
Hi @cpury,
Word samples are not selected across newlines. Thus, words from the previous and next lines do not influence the learning on the current line (please see #518 for a longer answer to this question).
Best,
Edouard
Most helpful comment
Hi @cpury,
Word samples are not selected across newlines. Thus, words from the previous and next lines do not influence the learning on the current line (please see #518 for a longer answer to this question).
Best,
Edouard