Fasttext: How should the text file be structured for training?

Created on 22 Oct 2018  路  8Comments  路  Source: facebookresearch/fastText

Hi.
How should the text file be structured for training?
Are there new lines allowed?
Are blank lines ok?
Should there be one sentence per line?

Thanks
Philip

Usage

Most helpful comment

Hi @PhilipMay,

In general, it is a good idea to apply some sort of pre-processing / text normalization on your data before feeding it to fastText. This will make the learned classifier more robust. Examples of such pre-processing include tokenization, lowercasing, numbers/date normalization, etc. The kind of pre-processing you should apply is application (and language) dependent, which is why we decided not to include it in fastText. For example, casing might be important for some applications, but not for others.

Here are a few third party tokenizers: the Stanford Tokenizer, the Moses Tokenizer.

Best,
Edouard.

All 8 comments

Hi @PhilipMay ,

Following is the structure of a file used for supervised training.

 __label__A This is sentence 1. 
 __label__B This is sentence 2. 

For unsupervised training the file can have lines without labels.

This is sentence 1. 
This is sentence 2. 
  • Yes, new lines are allowed.
  • Not sure about blank lines, but you can do a quick experiment and check.
  • Please see #545 for you last question.

Best,
Anubhav

@a11apurva ok. Thanks. :-)

@a11apurva: For unsupervised training: sould there be a space infront of the .?
More like this:

This is sentence 1 .
This is sentence 2 .

Hi @PhilipMay,

In general, it is a good idea to apply some sort of pre-processing / text normalization on your data before feeding it to fastText. This will make the learned classifier more robust. Examples of such pre-processing include tokenization, lowercasing, numbers/date normalization, etc. The kind of pre-processing you should apply is application (and language) dependent, which is why we decided not to include it in fastText. For example, casing might be important for some applications, but not for others.

Here are a few third party tokenizers: the Stanford Tokenizer, the Moses Tokenizer.

Best,
Edouard.

@EdouardGrave I think we can close this one as we have the answer provided by @a11apurva

Ok. Thanks.

In word embeddings training, are new lines changing anything, i.e. a context crosses new lines?

A context is not crossing new lines (#518).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yasonk picture yasonk  路  3Comments

mino98 picture mino98  路  3Comments

poppingtonic picture poppingtonic  路  3Comments

premrajnarkhede picture premrajnarkhede  路  3Comments

leonardgithub picture leonardgithub  路  4Comments