Running from CLI:
$ fasttext supervised -input my_data.txt -output model
model.bin is 4M
while:
ft = fastText.train_supervised(
input='my_data.txt',
thread=1,
lr=0.5,
epoch=1,
wordNgrams=1
)
ft.save_model('model.bin')
model.bin is 804M.
I am using python2.7
Hello @montenegrodr,
Thank you for your post. I've not been able to reproduce this issue on my end. Here is a detailed example that I used to try and reproduce this issue. I added some newlines for readability.
$ cat result_py/example.py
import fastText as ft
m = ft.train_supervised(input="data/dbpedia.train",
dim=10,
lr=0.1,
wordNgrams=2,
minCount=1,
bucket=10000000,
epoch=5,
thread=4)
m.save_model('result_py/dbpedia.bin')
$ python2.7 result_py/example.py
Read 32M words
Number of words: 803537
Number of labels: 14
Progress: 100.0% words/sec/thread: 2215201 lr: 0.000000 loss: 0.097203 eta: 0h0m 14m
$ stat -c%s result_py/dbpedia.bin
447481878
$ ./classification-example.sh
make: Nothing to be done for `opt'.
Read 32M words
Number of words: 803537
Number of labels: 14
Progress: 100.0% words/sec/thread: 2506776 lr: 0.000000 loss: 0.096848 eta: 0h0m 14m
N 70000
P@1 0.985
R@1 0.985
Number of examples: 70000
$ stat -c%s result/dbpedia.bin
447481878
The extension and cli have been complied with the same version of gcc, which is 4.8.5 and on Linux.
Please let me know if this issue remains and please add some additional information about your environment. I suggest we use above example to try and reproduce this issue, but please feel free to supply a different example that I can use to reproduce this issue on my end.
Thank you,
Christian
thanks @cpuhrsch for your prompt reply. I got the same results as you have using dbpedia dataset. But there's something weird when dataset is small. Could you try running the snippet below:
import fastText as ft
import os
with open('my_data.txt', 'w') as f:
f.write('__label__1 foo\n__label__2 bar\n')
m = ft.train_supervised(input='my_data.txt')
m.save_model('model.bin')
print os.stat('model.bin').st_size
this outputs here:
800002206
thanks.
@montenegrodr I have the same issue. Could you print your wc -l < my_data.txt and size for comparison to mine, thanks!
@loretoparisi:
$ stat my_data.txt
File: ‘my_data.txt’
Size: 30 Blocks: 8 IO Block: 4096 regular file
Device: 10303h/66307d Inode: 2663873 Links: 1
Access: (0664/-rw-rw-r--) Uid: ( 1001/ robson) Gid: ( 1001/ robson)
Access: 2017-11-13 10:28:53.860070944 +0000
Modify: 2017-11-13 10:28:53.840070943 +0000
Change: 2017-11-13 10:28:53.840070943 +0000
Birth: -
$ wc -l < my_data.txt
2
EDIT: this is from my_data of second script. I don't have anymore my_data from the first post.
Hello @montenegrodr,
I have indeed been able to reproduce this issue on my end and am working on a fix. I plan to push this asap. Thank you for posting this and helping us make the Python bindings more stable / better.
Thanks,
Christian
Hello @montenegrodr,
I found the bug. For now you can fix it locally by setting bucket to 0 if wordNgrams is less than or equal to 1 and maxn is 0, which is always the case for the default settings. I'll push the fix soon.
Thanks,
Christian
Hello @montenegrodr,
The fix has been pushed. I'm closing the issue for now, but please feel free to reopen it at any point if this issue remains. Thank you again for posting this and being an active member of our community!
Thanks,
Christian
@cpuhrsch hi, I maintain the R wrapper (https://github.com/pommedeterresautee/fastrtext/) and I have issues too with bucket settings in supervised mode with minn and maxn set (when maxn and minn are not set there is no issue).
Should I understand the fix (https://github.com/facebookresearch/fastText/commit/35679dc3d3b7cf3c641aebadafb2ab7f0fc0a2ca) as not using bucket parameter when learning in supervised mode when maxn and minn are set?
Hi @cpuhrsch
This issue still reproducible.
model = train_supervised(input='test.txt', epoch=50, lr=0.1, wordNgrams=2, verbose=2, minCount=1)
model.save_model('model.bin')
Output
Read 0M words
Number of words: 4
Number of labels: 2
Progress: 100.0% words/sec/thread: 567 lr: 0.000000 avg.loss: 0.643121 ETA: 0h 0m 0s
File size
-rw-r--r-- 1 staff 763M May 31 22:59 test.dump
Most helpful comment
Hello @montenegrodr,
The fix has been pushed. I'm closing the issue for now, but please feel free to reopen it at any point if this issue remains. Thank you again for posting this and being an active member of our community!
Thanks,
Christian