I think it would be good for an option of allowing newlines to prevent across-lines training.
I created word vectors using a dataset where each entry was per line. Across-line correlations would be noise. The loss was low, but could be lower if it wasn't trying to predict across lines.
(I assume that newlines don't affect training... although I'd be happy to be told otherwise.)
Hi @tom-adsfund,
fastText do use newlines to separate examples (for both supervised and unsupervised modes). Thus, when learning word vectors using skipgram or cbow, the words from the previous and next lines do not influence the learning of the current line. See line 347 of dictionary.cc for the corresponding code (newlines are replaced by EOS in the readWord method).
Best,
Edouard
@EdouardGrave OK, that's excellent. Thanks for the reply.
Most helpful comment
Hi @tom-adsfund,
fastText do use newlines to separate examples (for both supervised and unsupervised modes). Thus, when learning word vectors using
skipgramorcbow, the words from the previous and next lines do not influence the learning of the current line. See line 347 of dictionary.cc for the corresponding code (newlines are replaced byEOSin the readWord method).Best,
Edouard