Transformers: [model weights caching] model upload doesn't check model weights hash

Created on 2 Sep 2020  路  15Comments  路  Source: huggingface/transformers

I have re-uploaded model weights via transformers-cli upload and noticed that when I tried to use it - it didn't get re-downloaded, and instead continued to use the cached version.

The problem seems to come from the fact that the other uploaded files haven't changed, only the model weights.

I double checked that the md5sum of the old weights file is different from the new one.

I re-uploaded the whole folder using:

transformers-cli upload fsmt-wmt19-en-de

If I hunt down the cached files (not an easy task), and delete those, it does re-download the new version.

If I diff the cached weights file and the updated cache file, which gets re-downloaded if I move away the original cached file, they aren't the same.:

Binary files 
before/d97352d9f1f96ee4c6055f203812035b4597258a837db1f4f0803a2932cc3071.53ce64c7097bfcd85418af04a21b4a897c78c8440de3af078e577727ad9de3a0 
and  
 after/d97352d9f1f96ee4c6055f203812035b4597258a837db1f4f0803a2932cc3071.53ce64c7097bfcd85418af04a21b4a897c78c8440de3af078e577727ad9de3a0 
differ

Could we please include the model weights file in the hash calculation?

Thank you.

wontfix

All 15 comments

I can confirm it was previously checking the model weights and re-downloading if the weights had been changed. Investigating.

This is due to the CDN caching files, with a 24 hour delay. After 24 hours it should download your file, but if you want it now you can use the use_cdn flag and set it to False. You can see the documentation for this here.

Thank you for the hint, @LysandreJik. So from_pretrained(mname, use_cdn=False)

But that might be tricky for end users who won't know that the code base has changed yet the model weights they get are out sync.

Is there a way to signal CDN to invalidate the cache for some files? It could then be done from the upload util.

FWIW, I wrote a one liner to force cache update for the 4 models I'm working at the moment.

PYTHONPATH="src" python -c 'from transformers import AutoModel; [AutoModel.from_pretrained("stas/fsmt-wmt19-"+p, use_cdn=False) for p in ["en-ru","ru-en","en-de","de-en"]]'

I now have that in my script, so I don't need to think about it.

@LysandreJik, unfortunately this doesn't solve the issue

AutoModel.from_pretrained(mname, use_cdn=False)

Indeed forces a download of the recently updated model - but then if this flag is no longer used in the application - it still downloads the CDN cached version and ends up using the wrong version.

So, basically, this results in 2 copies (different hashes) sitting in the cache dir.

And normal usage w/o using use_cdn=False looks up the old version and not the new one. (so things like run_eval.py still use the old one)

Thanks.

can you run AutoModel.from_pretrained(mname, use_cdn=False) in a debugger and check whether the downloaded url is a https://cdn.huggingface.co or a https://s3.amazonaws.com/models.huggingface.co url?

I can do that, but I already checked that it downloads the updated model w/ use_cdn=False. But then if you run it again w/o use_cdn=False it ignores the new download and uses the old model again (if I delete the cached version, it redownloads the old cached version w/o use_cdn=False ).

Oh yeah ok, I see. Can you run_eval.py on a local folder path then?

Can you run_eval.py on a local folder path then?

Yes. Except others can't as they don't have my local copy.

e.g. @sshleifer wants to eval my PR https://github.com/huggingface/transformers/pull/6940, but now has to wait till tomorrow for CDN to expire (or hack around it).

Last night I uploaded an experimental model, which proved to be invalid, thought I re-downloaded it OK as it was working OK and made a PR, except I was testing against the non-current cached version, which was a good one.

Can we please re-open this ticket? It hasn't been resolved

Can we add a --no_cdn boolean flag to run_eval.py that would then call AutoModelForSeq2SeqLM.from_pretrained(use_cdn=False)?

In our dev workflow we mostly don't use the cdn while the files are still in-flux. Cloudfront invalidation comes with its own set of issues so it's better to view cdn as a means to distribute permanent files. (for this reason we don't serve config.json files from Cloudfront)

Can we add a --no_cdn boolean flag to run_eval.py that would then call AutoModelForSeq2SeqLM.from_pretrained(use_cdn=False)?

It could be done. I have a feeling then there will be others.

Perhaps an alternative solution would be to introduce an env var, that would transparently override cdn cache in any situation w/o needing to change every script? TRANSFORMERS_USE_CDN=False?

In our dev workflow we mostly don't use the cdn while the files are still in-flux. Cloudfront invalidation comes with its own set of issues so it's better to view cdn as a means to distribute permanent files. (for this reason we don't serve config.json files from Cloudfront)

Understood!

How do you let others onto testing the model files? Putting them on dropbox or something and sharing the link?

No, just S3 links!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chuanmingliu picture chuanmingliu  路  3Comments

fyubang picture fyubang  路  3Comments

yspaik picture yspaik  路  3Comments

lcswillems picture lcswillems  路  3Comments

lemonhu picture lemonhu  路  3Comments