Fasttext: About the input format of `fastext`

Created on 24 Feb 2018  路  3Comments  路  Source: facebookresearch/fastText

wikifil.pl strips off consecutive spaces. But it makes the resulted file not readable.

If the input is one sentence per line, how does it affect the result of fasttext?

Does it require the removal of all the punctuations?

Does fasttext accept non-English letters (such as Greek letters) embedded in English?

Sometimes whether a letter is in upper-case or lower-case actually contains some information about the word. Does the case in the input have to be of all lowercase?

Thanks.

Most helpful comment

Hi @cpury,

Word samples are not selected across newlines. Thus, words from the previous and next lines do not influence the learning on the current line (please see #518 for a longer answer to this question).

Best,
Edouard

All 3 comments

Hi @pengyu,

FastText does not require the removal of all punctuation. FastText accept any text input in utf-8 format. You are correct that for some application, keeping the casing is important (the input does not have to be lowercase).

Best,
Edouard.

@EdouardGrave Could you also give us feedback to the question:

If the input is one sentence per line, how does it affect the result of fasttext?

I guess I want to know if word samples are selected across linebreaks. Thanks!

Hi @cpury,

Word samples are not selected across newlines. Thus, words from the previous and next lines do not influence the learning on the current line (please see #518 for a longer answer to this question).

Best,
Edouard

Was this page helpful?
0 / 5 - 0 ratings