Fasttext: save_model() generates huge file

Created on 9 Nov 2017  Â·  9Comments  Â·  Source: facebookresearch/fastText

Running from CLI:

$ fasttext supervised -input my_data.txt -output model

model.bin is 4M

while:

ft = fastText.train_supervised(
    input='my_data.txt',
    thread=1,
    lr=0.5,
    epoch=1,
    wordNgrams=1
)

ft.save_model('model.bin')

model.bin is 804M.

I am using python2.7

Most helpful comment

Hello @montenegrodr,

The fix has been pushed. I'm closing the issue for now, but please feel free to reopen it at any point if this issue remains. Thank you again for posting this and being an active member of our community!

Thanks,
Christian

All 9 comments

Hello @montenegrodr,

Thank you for your post. I've not been able to reproduce this issue on my end. Here is a detailed example that I used to try and reproduce this issue. I added some newlines for readability.

$ cat result_py/example.py
import fastText as ft

m = ft.train_supervised(input="data/dbpedia.train",
                        dim=10,
                        lr=0.1,
                        wordNgrams=2,
                        minCount=1,
                        bucket=10000000,
                        epoch=5,
                        thread=4)
m.save_model('result_py/dbpedia.bin')

$ python2.7 result_py/example.py
Read 32M words
Number of words:  803537
Number of labels: 14
Progress: 100.0%  words/sec/thread: 2215201  lr: 0.000000  loss: 0.097203  eta: 0h0m 14m

$ stat -c%s result_py/dbpedia.bin
447481878

$ ./classification-example.sh
make: Nothing to be done for `opt'.
Read 32M words
Number of words:  803537
Number of labels: 14
Progress: 100.0%  words/sec/thread: 2506776  lr: 0.000000  loss: 0.096848  eta: 0h0m 14m
N       70000
P@1     0.985
R@1     0.985
Number of examples: 70000

$ stat -c%s result/dbpedia.bin
447481878

The extension and cli have been complied with the same version of gcc, which is 4.8.5 and on Linux.

Please let me know if this issue remains and please add some additional information about your environment. I suggest we use above example to try and reproduce this issue, but please feel free to supply a different example that I can use to reproduce this issue on my end.

Thank you,
Christian

thanks @cpuhrsch for your prompt reply. I got the same results as you have using dbpedia dataset. But there's something weird when dataset is small. Could you try running the snippet below:

import fastText as ft
import os

with open('my_data.txt', 'w') as f:
    f.write('__label__1 foo\n__label__2 bar\n')

m = ft.train_supervised(input='my_data.txt')
m.save_model('model.bin')
print os.stat('model.bin').st_size

this outputs here:

800002206

thanks.

@montenegrodr I have the same issue. Could you print your wc -l < my_data.txt and size for comparison to mine, thanks!

@loretoparisi:

$ stat my_data.txt 
  File: ‘my_data.txt’
  Size: 30              Blocks: 8          IO Block: 4096   regular file
Device: 10303h/66307d   Inode: 2663873     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1001/  robson)   Gid: ( 1001/  robson)
Access: 2017-11-13 10:28:53.860070944 +0000
Modify: 2017-11-13 10:28:53.840070943 +0000
Change: 2017-11-13 10:28:53.840070943 +0000
 Birth: -
$ wc -l < my_data.txt
2

EDIT: this is from my_data of second script. I don't have anymore my_data from the first post.

Hello @montenegrodr,

I have indeed been able to reproduce this issue on my end and am working on a fix. I plan to push this asap. Thank you for posting this and helping us make the Python bindings more stable / better.

Thanks,
Christian

Hello @montenegrodr,

I found the bug. For now you can fix it locally by setting bucket to 0 if wordNgrams is less than or equal to 1 and maxn is 0, which is always the case for the default settings. I'll push the fix soon.

Thanks,
Christian

Hello @montenegrodr,

The fix has been pushed. I'm closing the issue for now, but please feel free to reopen it at any point if this issue remains. Thank you again for posting this and being an active member of our community!

Thanks,
Christian

@cpuhrsch hi, I maintain the R wrapper (https://github.com/pommedeterresautee/fastrtext/) and I have issues too with bucket settings in supervised mode with minn and maxn set (when maxn and minn are not set there is no issue).
Should I understand the fix (https://github.com/facebookresearch/fastText/commit/35679dc3d3b7cf3c641aebadafb2ab7f0fc0a2ca) as not using bucket parameter when learning in supervised mode when maxn and minn are set?

Hi @cpuhrsch
This issue still reproducible.

model = train_supervised(input='test.txt', epoch=50, lr=0.1, wordNgrams=2, verbose=2, minCount=1)
model.save_model('model.bin')

Output

Read 0M words 
Number of words:  4 
Number of labels: 2 
Progress: 100.0% words/sec/thread:     567 lr:  0.000000 avg.loss:  0.643121 ETA:   0h 0m 0s

File size
-rw-r--r-- 1 staff 763M May 31 22:59 test.dump

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ereday picture ereday  Â·  3Comments

a11apurva picture a11apurva  Â·  3Comments

PGryllos picture PGryllos  Â·  4Comments

shriiitk picture shriiitk  Â·  3Comments

AhmedIdr picture AhmedIdr  Â·  3Comments