Hi.
How should the text file be structured for training?
Are there new lines allowed?
Are blank lines ok?
Should there be one sentence per line?
Thanks
Philip
Hi @PhilipMay ,
Following is the structure of a file used for supervised training.
__label__A This is sentence 1.
__label__B This is sentence 2.
For unsupervised training the file can have lines without labels.
This is sentence 1.
This is sentence 2.
Best,
Anubhav
@a11apurva ok. Thanks. :-)
@a11apurva: For unsupervised training: sould there be a space infront of the .?
More like this:
This is sentence 1 .
This is sentence 2 .
Hi @PhilipMay,
In general, it is a good idea to apply some sort of pre-processing / text normalization on your data before feeding it to fastText. This will make the learned classifier more robust. Examples of such pre-processing include tokenization, lowercasing, numbers/date normalization, etc. The kind of pre-processing you should apply is application (and language) dependent, which is why we decided not to include it in fastText. For example, casing might be important for some applications, but not for others.
Here are a few third party tokenizers: the Stanford Tokenizer, the Moses Tokenizer.
Best,
Edouard.
@EdouardGrave I think we can close this one as we have the answer provided by @a11apurva
Ok. Thanks.
In word embeddings training, are new lines changing anything, i.e. a context crosses new lines?
A context is not crossing new lines (#518).
Most helpful comment
Hi @PhilipMay,
In general, it is a good idea to apply some sort of pre-processing / text normalization on your data before feeding it to fastText. This will make the learned classifier more robust. Examples of such pre-processing include tokenization, lowercasing, numbers/date normalization, etc. The kind of pre-processing you should apply is application (and language) dependent, which is why we decided not to include it in fastText. For example, casing might be important for some applications, but not for others.
Here are a few third party tokenizers: the Stanford Tokenizer, the Moses Tokenizer.
Best,
Edouard.