Fasttext: How should the text file be structured for training?

Created on 22 Oct 2018 · 8Comments · Source: facebookresearch/fastText

Hi.
How should the text file be structured for training?
Are there new lines allowed?
Are blank lines ok?
Should there be one sentence per line?

Thanks
Philip

Usage

Source

PhilipMay

Most helpful comment

Hi @PhilipMay,

In general, it is a good idea to apply some sort of pre-processing / text normalization on your data before feeding it to fastText. This will make the learned classifier more robust. Examples of such pre-processing include tokenization, lowercasing, numbers/date normalization, etc. The kind of pre-processing you should apply is application (and language) dependent, which is why we decided not to include it in fastText. For example, casing might be important for some applications, but not for others.

Here are a few third party tokenizers: the Stanford Tokenizer, the Moses Tokenizer.

Best,
Edouard.

EdouardGrave on 15 Jan 2019

👍5

All 8 comments

Hi @PhilipMay ,

Following is the structure of a file used for supervised training.

 __label__A This is sentence 1. 
 __label__B This is sentence 2.

For unsupervised training the file can have lines without labels.

This is sentence 1. 
This is sentence 2.

Yes, new lines are allowed.
Not sure about blank lines, but you can do a quick experiment and check.
Please see #545 for you last question.

Best,
Anubhav

a11apurva on 22 Nov 2018

👍3

@a11apurva ok. Thanks. :-)

PhilipMay on 22 Nov 2018

@a11apurva: For unsupervised training: sould there be a space infront of the .?
More like this:

This is sentence 1 .
This is sentence 2 .

PhilipMay on 8 Jan 2019

Hi @PhilipMay,

Here are a few third party tokenizers: the Stanford Tokenizer, the Moses Tokenizer.

Best,
Edouard.

EdouardGrave on 15 Jan 2019

👍5

@EdouardGrave I think we can close this one as we have the answer provided by @a11apurva

fclesio on 27 Aug 2019

Ok. Thanks.

PhilipMay on 30 Aug 2019

In word embeddings training, are new lines changing anything, i.e. a context crosses new lines?

djstrong on 11 Oct 2019

A context is not crossing new lines (#518).

djstrong on 11 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Version is somehow behind the one in PyPi

yasonk · 3Comments

wordNgrams in unsupervised mode (cbow and skipgram)

mino98 · 3Comments

How to recreate the English pretrained word vectors using enwik9

poppingtonic · 3Comments

"Unsupported compiler"

premrajnarkhede · 3Comments

Question: How to analyze sentence similarity under fastText?

leonardgithub · 4Comments