I need to run the package on a machine without internet. Copied over the ".pytorch_pretrained_bert" folder from one machine to another.
Installed anaconda3 and tried to run tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103').
Got this error:
Model name 'transfo-xl-wt103' was not found in model name list (transfo-xl-wt103). We assumed 'transfo-xl-wt103' was a path or url but couldn't find files https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin at this path or url.
Do I need to copy anything else to the second machine to make it load from the cache folder?
Ubuntu 16.04, pytorch 1.0
It is likely due to the script is not able to find the vocabulary file. So you should download the vocab first and copy it over. Then when you load the tokenizer you need to specify the path to the vocab. So if you vocab is in "/tmp/transformer_xl/" you do:
tokenizer = TransfoXLTokenizer.from_pretrained('/tmp/transformer_xl')
This works, just not what I expected. I copied over everything in .pytorch_pretrained_bert and though it would load without parameters. Now I have a bunch of file named like this, I have to figure out which model it belongs.
"12642ff7d0279757d8356bfd86a729d9697018a0c93ad042de1d0d2cc17fd57b.e9704971f27275ec067a00a67e6a5f0b05b4306b3f714a96e9f763d8fb612671"
I will add a section in the readme detailing how to load a model from drive.
Basically, you can just download the models and vocabulary from our S3 following the links at the top of each file (modeling_transfo_xl.py and tokenization_transfo_xl.py for Transformer-XL) and put them in one directory with the filename also indicated at the top of each file.
Here is the process in your case:
mkdir model
cd model
wget -O pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-pytorch_model.bin
wget -O config.json https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-config.json
wget -O vocab.bin https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-vocab.bin
# optional, only if you run the evaluation script run_transfo_xl.py which uses the pre-processed wt103 corpus:
# wget -O corpus.bin https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-corpus.bin
Now just load the model and tokenizer by pointing to this directory:
tokenizer = TransfoXLTokenizer.from_pretrained('./model/')
model = TransfoXLModel.from_pretrained('./model/')
I'll see if I can relax the requirement on the internet connection in the next release.
The network connection check has been relaxed in the now merged #500.
It will be included in the next PyPI release (probably next week).
In the meantime you can install from master.
I have a similar issue with the BERT multilingual cased model:
ERROR - pytorch_pretrained_bert.modeling - Model name 'bert-base-multilingual-cased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese).
We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz' was a path or url but couldn't find any file associated to this path or url.
Then I tried to execute the following code block in my Jupyter notebook:
import requests
url = "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz"
response = requests.get(url, allow_redirects=True, verify=False)
I had to set verify to false, because otherwise I get an SSL certificate error. But even now, this does not work, because the site is blocked due to our company security settings, i.e. response.status_code returns 403.
Is there a possibility that you might publish the file in your github repo or that we could load the model from somewhere else?
Strange, I can't reproduce this.
I've checked again that every model is public on our S3.
Can you try again?
I retried, once using the google tensorflow hub address and once with the Amazon S3 address for the BERT model.
I specified the proxy information like this:
proxyDict = { "http" : "http://<proxy-user>:<proxy-password>@<proxy-domain>",
"https" : "http://<proxy-user>:<proxy-password>@<proxy-domain>"}
with our company-specific settings for proxy-user, proxy-password and proxy-domain.
Then I executed the following code:
import requests
url_google = "https://tfhub.dev/google/bert_multi_cased_L-12_H-768_A-12/1"
url_amazons3 = "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz"
response_1 = requests.get(url_google, allow_redirects=True, verify=False, proxies = proxyDict)
response_2 = requests.get(url_amazons3, allow_redirects=True, verify=False, proxies = proxyDict)
print("response status code for google address: {}".format(response_1.status_code))
print("response status code for amazon s3 address: {}".format(response_2.status_code))
and this is what I got:
response status code for google address: 200
response status code for amazon s3 address: 403
So unfortunately, it does not seem to work out for me. I might use the convert function you provided, but it would be nicer to be able to load the model directly from the S3.
Is it only for this model (bert-base-multilingual-cased) or are you blocked from accessing all the pretrained models and tokenizers?
I am blocked from accessing all the pretrained models.
I tested it by looping through the values of PRETRAINED_MODEL_ARCHIVE_MAP dictionary and all requests return the status code 403.
I haven't tried it yet but maybe torch hub could help (#506)
Can you try to update to PyTorch 1.1.0 (to get torch.hub) and test this:
import torch
tokenizer = torch.hub.load('ailzhang/pytorch-pretrained-BERT:hubconf', 'bertTokenizer', 'bert-base-cased', do_basic_tokenize=False, force_reload=False)
Well, it does not seem to work.
I had to add
import urllib
proxy_support = urllib.request.ProxyHandler({ "http" : "http://<proxy-user>:<proxy-password>@<proxy-domain>",
"https" : "http://<proxy-user>:<proxy-password>@<proxy-domain>"})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
into ~/my-virtual-env/lib/site-packages/python3.6/torch/hub.py with my-virtual-env being my pip virtual environment.
Then executing the command you suggested prints the following to the console:
Downloading: "https://github.com/ailzhang/pytorch-pretrained-BERT/archive/hubconf.zip" to /home/U118693/.cache/torch/hub/hubconf.zip
The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
Model name 'bert-base-cased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt' was a path or url but couldn't find any file associated to this path or url.
There is now a folder ailzhang_pytorch-pretrained-BERT_hubconf in the /home/U118693/.cache/torch/hub/ directory, but there still seems to be issues in finding that bert-cased-vocab.txt file.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
I will add a section in the readme detailing how to load a model from drive.
Basically, you can just download the models and vocabulary from our S3 following the links at the top of each file (
modeling_transfo_xl.pyandtokenization_transfo_xl.pyfor Transformer-XL) and put them in one directory with the filename also indicated at the top of each file.Here is the process in your case:
Now just load the model and tokenizer by pointing to this directory: