Hello,
Today I found that the version 0.1.0 (the version downloaded in December) do not give the same prediction probabilities whether a line ends with newline (\n). In fact, the probabilities are affected because of the \n. Moreover, this problem affects the label prediction of unseen words during the training.
To understand better, here is an example using as input creat89 and edge hill. It should be noted that the word creat89 doesn't exist in the training model, thus it's word vector is full of 0. First using the STDIN method:
creat89 -> __label__A 1 __label__B 1.95313e-08
edge hill -> __label__A 0.585938 __label__B 0.412109
If we use a file where the input is (note there is no newline at the end of the file):
creat89
edge hill
We get the following output from FastTex:
__label__A 1 __label__B 1.95313e-08
__label__A 1.95313e-08 __label__B 0.998047
As we ca see, the probability for edge hill changed because there is no new line at the end. So what happens when we have a file (again, no new line at the end of the file):
creat89
edge hill
creat89
We get the following output from FastTex:
__label__A 1 __label__B 1.95313e-08
__label__A 0.585938 __label__B 0.412109
[JUST A NEW LINE PRINTED]
We can observe that edge hill gets again the same probability as in the STDIN. However, creat89 in the first case has the probabilities from the STDIN, but in the second case, FastText did not printed anything, the reason, the lack of \n. In fact, if the file or the STDIN is only a newline character (\n), we get the following prediction:
__label__A 1 __label__B 1.95313e-08
There is certainly a problem that consists in using \n as a character that belongs to the string, while it is not part of it.
I've just confirmed this myself. My predictions change depending on the presence of a trailing newline character.
I've also noticed that manually stripping non alphanum characters from the input and eliminating multiple whitespace also modifies the classification result. Things are much more accurate now now and the above issue also gone by doing this.
Yes, I understand that. In fact, we need to apply a kind of pre-processing to the documents or text before using them in FastText (either for training or prediction). However, the system should not take into account the new line as part of the document if it is used at the same time to trigger the prediction. Moreover, we can see that the prediction can be triggered by the new line or the EOF character, nonetheless the EOF is not used as part of the text and do not affect the prediction.
As well, I'm worried regarding the prediction of unseen words, which in fact are been predicted by the new line character. This could change if the use of word n-grams would be activated for the classifier methods too.
Hi @creat89,
Thank you for reporting this issue.
As of now, fastText assumes that each example (either at train or test time) is ended by the end of line character \n, and we thus advise to make sure that it is the case in your data. We will make fastText more robust to this in a future release.
Best,
Edouard
@EdouardGrave let's suppose we would like to model the \n EOL character, as well as other special control character or non alphanumeric symbols. Recent results in modeling characters like newline or a whitespace between tokens has been proved to improve the accuracy of tokenization phase - see https://github.com/google/sentencepiece#whitespace-is-treated-as-a-basic-symbol).
What I have done in my text pre-processing pipeline was to apply this normalization function
removeAllPunctations = function (text, newline = " ") {
text = StringUtil.trim(text);
text = StringUtil.removeDiacritics(text)
text = text.toLowerCase()
.replace(/(?:\\[rn]|[\r\n]+)+/g, newline) // \r \n
.replace(/\t+/g, '\t').replace(/\t\s/g, ' ').replace(/\t/g, ' ') // tabs
.replace(/\\+/g, ' ') // slash
.replace(/”/g, ' ') // right double quote
.replace(/“/g, ' ') // left double quote
.replace(/'/g, ' ') // single quote
.replace(/"/g, ' ') // double quote
.replace(/\‘/g, ' ') // quote
.replace(/\./g, ' ') // dot
.replace(/,/g, ' ') // comma
.replace(/\(/g, ' ') // left par
.replace(/\)/g, ' ') // right par
.replace(/\[/g, ' ') // left par
.replace(/\]/g, ' ') // right par
.replace(/\</g, ' ') // left par
.replace(/\>/g, ' ') // right par
.replace(/\«/g, ' ') // LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
.replace(/\»/g, ' ') // RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
.replace(/!/g, ' ') // exclamation mark
.replace(/\?/g, ' ') // question mark
.replace(/;/g, ' ') // semicolon
.replace(/:/g, ' ') // colon
.replace(/[\.\.\.]/g, ' ') // ellipsis
.replace(/…/g, ' '); // ellipsis
var isUnicode = StringUtil.isDoubleByte(text);
if (!isUnicode) {
text = text.replace(/[^a-zA-Z0-9]+/g, ' ') // remove all not latin1 chars
}
return text;
}//removeAllPunctations
It's quite similar to the normalization function that you use in your examples, except that it handles non latin1 chars with a DoubleByte guard, it removes diacritics and eventually replaces the newline with a specific placeholder char (like a |) or chars sequence to avoid collisions (like let's say __CLRF__). My question is, what happens in this case, considering a skipgram model with subword set to 2,3 as min value?
Will the newline placeholder symbol (a char or a sequence of char) affect the embedding? If so how to threat text where newline delimiter is meaningful (it delimits one or more paragraphs of a text document that is semantically coherent, hence the model must see this train example as a whole document). Thanks for your help in advance!
Most helpful comment
Hi @creat89,
Thank you for reporting this issue.
As of now, fastText assumes that each example (either at train or test time) is ended by the end of line character
\n, and we thus advise to make sure that it is the case in your data. We will make fastText more robust to this in a future release.Best,
Edouard