Fasttext: save_model() generates huge file

Created on 9 Nov 2017 · 9Comments · Source: facebookresearch/fastText

Running from CLI:

$ fasttext supervised -input my_data.txt -output model

model.bin is 4M

while:

ft = fastText.train_supervised(
    input='my_data.txt',
    thread=1,
    lr=0.5,
    epoch=1,
    wordNgrams=1
)

ft.save_model('model.bin')

model.bin is 804M.

I am using python2.7

Source

montenegrodr

Most helpful comment

Hello @montenegrodr,

The fix has been pushed. I'm closing the issue for now, but please feel free to reopen it at any point if this issue remains. Thank you again for posting this and being an active member of our community!

Thanks,
Christian

cpuhrsch on 15 Nov 2017

👍2 ❤1

All 9 comments

Hello @montenegrodr,

Thank you for your post. I've not been able to reproduce this issue on my end. Here is a detailed example that I used to try and reproduce this issue. I added some newlines for readability.

$ cat result_py/example.py
import fastText as ft

m = ft.train_supervised(input="data/dbpedia.train",
                        dim=10,
                        lr=0.1,
                        wordNgrams=2,
                        minCount=1,
                        bucket=10000000,
                        epoch=5,
                        thread=4)
m.save_model('result_py/dbpedia.bin')

$ python2.7 result_py/example.py
Read 32M words
Number of words:  803537
Number of labels: 14
Progress: 100.0%  words/sec/thread: 2215201  lr: 0.000000  loss: 0.097203  eta: 0h0m 14m

$ stat -c%s result_py/dbpedia.bin
447481878

$ ./classification-example.sh
make: Nothing to be done for `opt'.
Read 32M words
Number of words:  803537
Number of labels: 14
Progress: 100.0%  words/sec/thread: 2506776  lr: 0.000000  loss: 0.096848  eta: 0h0m 14m
N       70000
P@1     0.985
R@1     0.985
Number of examples: 70000

$ stat -c%s result/dbpedia.bin
447481878

The extension and cli have been complied with the same version of gcc, which is 4.8.5 and on Linux.

Please let me know if this issue remains and please add some additional information about your environment. I suggest we use above example to try and reproduce this issue, but please feel free to supply a different example that I can use to reproduce this issue on my end.

Thank you,
Christian

cpuhrsch on 10 Nov 2017

thanks @cpuhrsch for your prompt reply. I got the same results as you have using dbpedia dataset. But there's something weird when dataset is small. Could you try running the snippet below:

import fastText as ft
import os

with open('my_data.txt', 'w') as f:
    f.write('__label__1 foo\n__label__2 bar\n')

m = ft.train_supervised(input='my_data.txt')
m.save_model('model.bin')
print os.stat('model.bin').st_size

this outputs here:

800002206

thanks.

montenegrodr on 10 Nov 2017

👍1

@montenegrodr I have the same issue. Could you print your wc -l < my_data.txt and size for comparison to mine, thanks!

loretoparisi on 13 Nov 2017

@loretoparisi:

$ stat my_data.txt 
  File: ‘my_data.txt’
  Size: 30              Blocks: 8          IO Block: 4096   regular file
Device: 10303h/66307d   Inode: 2663873     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1001/  robson)   Gid: ( 1001/  robson)
Access: 2017-11-13 10:28:53.860070944 +0000
Modify: 2017-11-13 10:28:53.840070943 +0000
Change: 2017-11-13 10:28:53.840070943 +0000
 Birth: -

$ wc -l < my_data.txt
2

EDIT: this is from my_data of second script. I don't have anymore my_data from the first post.

montenegrodr on 13 Nov 2017

Hello @montenegrodr,

I have indeed been able to reproduce this issue on my end and am working on a fix. I plan to push this asap. Thank you for posting this and helping us make the Python bindings more stable / better.

Thanks,
Christian

cpuhrsch on 15 Nov 2017

👍1

Hello @montenegrodr,

I found the bug. For now you can fix it locally by setting bucket to 0 if wordNgrams is less than or equal to 1 and maxn is 0, which is always the case for the default settings. I'll push the fix soon.

Thanks,
Christian

cpuhrsch on 15 Nov 2017

Hello @montenegrodr,

Thanks,
Christian

cpuhrsch on 15 Nov 2017

👍2 ❤1

@cpuhrsch hi, I maintain the R wrapper (https://github.com/pommedeterresautee/fastrtext/) and I have issues too with bucket settings in supervised mode with minn and maxn set (when maxn and minn are not set there is no issue).
Should I understand the fix (https://github.com/facebookresearch/fastText/commit/35679dc3d3b7cf3c641aebadafb2ab7f0fc0a2ca) as not using bucket parameter when learning in supervised mode when maxn and minn are set?

pommedeterresautee on 20 Nov 2017

Hi @cpuhrsch
This issue still reproducible.

model = train_supervised(input='test.txt', epoch=50, lr=0.1, wordNgrams=2, verbose=2, minCount=1)
model.save_model('model.bin')

Output

Read 0M words 
Number of words:  4 
Number of labels: 2 
Progress: 100.0% words/sec/thread:     567 lr:  0.000000 avg.loss:  0.643121 ETA:   0h 0m 0s

File size
-rw-r--r-- 1 staff 763M May 31 22:59 test.dump

erhanbaris on 31 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

About pre-trained embeddings from cbow model

ereday · 3Comments

Which algorithm is being used for the classification task ?

a11apurva · 3Comments

Dimension of pretrained vectors does not match -dim option

PGryllos · 4Comments

Python fasttext build failure

shriiitk · 3Comments

Print out the best parameters from autotune

AhmedIdr · 3Comments