As tangentially noted by @kootenpv in #129, there is no delimiter, no clear way to know which part of the output line is the sentence, and which part is the vector.
For example:
This is not a test. -0.039222 -0.002648 -0.028442 0.039124 -0.0020073 -0.0052479 -0.020197 -0.028812 0.035525 -0.00065622 0.057748 0.026362 0.038559 0.10918 0.034084 -0.086161 0.01623 -0.064122
Imagine if the sentence is 'This is not a test. -0.039222'. The correct parse can be inferred by looking across all the lines, but that is quite a bit of code and may error out or make a false assumption.
So for now, code that uses this output should know what the sentence was, or know the number of dimensions.
I don't have a solution for this, I just wanted to note how insane it is.
I have a (really slow) workaround. Reverse, take the number of -dim you set (split on space), then reverse again and spit that out to a file.
./fasttext print-sentence-vectors model.bin < sentences.txt | rev | cut -d ' ' -f 1-301 | rev > docvecs
Hello @bittlingmayer,
Thank you for your post. This has been fixed within recent commits. You will now only see the vector, but not the sentence itself. For example:
$ ./fasttext print-sentence-vectors model.bin
one two
2.6772 -3.0886
I'm going to close this issue now, but please feel encouraged to reopen it at any time if you don't consider this issue to be resolved.
Thanks,
Christian
Hello @cpuhrsch
which release are you referring too? I just downloaded the version v0.1.0 (2nd Dec) and it still there!
Thanks,
Mohammad
It looks like it works only for supervised models. If I train and print-sentence-vectors for skipgram model, input sentences (stdin) are in output.
IMHO this is somewhat resolved by the fact that there are now official Python bindings, so bash scripts are no longer necessary.
Most helpful comment
Hello @cpuhrsch
which release are you referring too? I just downloaded the version v0.1.0 (2nd Dec) and it still there!
Thanks,
Mohammad