Transformers: cannot access to pretrained vocab file on S3

Created on 30 Nov 2018 · 4Comments · Source: huggingface/transformers

Hi, thanks for develop well-made pytorch version of BERT.
Unfortunately, pretrained vocab files are not reachable.

error traceback is below.

File "/usr/local/lib/python3.6/dist-packages/pytorch_pretrained_bert/tokenization.py", line 124, in from_pretrained
resolved_vocab_file = cached_path(vocab_file)
File "/usr/local/lib/python3.6/dist-packages/pytorch_pretrained_bert/file_utils.py", line 88, in cached_path
return get_from_cache(url_or_filename, cache_dir)
File "/usr/local/lib/python3.6/dist-packages/pytorch_pretrained_bert/file_utils.py", line 178, in get_from_cache
.format(url, response.status_code))
OSError: HEAD request failed for url https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt with status code 404

Source

zeze-zzz

Most helpful comment

I found temporary solution for this issue.
BertTokenizer.from_pretrained method accepts local file instead of model_name
ex) BertTokenizer.from_pretrained('/dir/to/vocab/bert-base-uncased-vocab.txt')

vocab txt file can be downloaded from google bert repo.

zeze-zzz on 30 Nov 2018

👍3

All 4 comments

I have the same issue.

OSError: HEAD request failed for url https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt with status code 404

It would be nice to be able to cache the vocab files as well as the model weights out of the box.

timniven on 30 Nov 2018

vocab txt file can be downloaded from google bert repo.

zeze-zzz on 30 Nov 2018

👍3

The files are back. Sorry, wrong manipulation while adding the new models.

thomwolf on 30 Nov 2018

👍1

I found temporary solution for this issue.
BertTokenizer.from_pretrained method accepts local file instead of model_name
ex) BertTokenizer.from_pretrained('/dir/to/vocab/bert-base-uncased-vocab.txt')

Well, this solution doesn't seem to be working now, I get

OSError: Model name 'path/to/model/vocab.txt' was not found in tokenizers model name list (bart-/model/large, bart-large-mnli, bart-large-cnn, bart-large-xsum). We assumed 'path/to/model/vocab.txt' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.