Well I am trying to generate embedding for a large sentence. I get this error
Traceback (most recent call last):
all_encoder_layers, _ = model(input_ids, token_type_ids=None, attention_mask=input_mask)
File "/Users/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(input, *kwargs)
File "/Users/venv/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 611, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/Users/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(input, *kwargs)
File "/Users/venv/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 196, in forward
position_embeddings = self.position_embeddings(position_ids)
File "/Users/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(input, *kwargs)
File "/Users/venv/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 110, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/Users/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1110, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at /Users/soumith/code/builder/wheel/pytorch-src/aten/src/TH/generic/THTensorMath.cpp:352
I find that max_position_embeddings (default size 512) is getting exceeded. Which is taken from the config that is downloaded as part of the initial step. Initially the download was done to the default location PYTORCH_PRETRAINED_BERT_CACHE where I was not able to find the config.json other than the model file and vocab.txt (named with random characters). I did it to a specific location in local with the cache_dir param, here also I was facing the same problem of finding the bert_config.json.
Also I found a file in both the default cache and local cache, named with junk characters of JSON type. When I tried opening it, I could just see this
_{"url": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz", "etag": "\"61343686707ed78320e9e7f406946db2-49\""}_
Any help to modify the config.json would be appreciated.
Or if this is been caused for a different reason, Please let me know.
The problem is because of the max_position_embeddings default size is 512 and it is exceeding in the case of my input as I mentioned. For now I have just made hack by hard coding it directly in the modelling.py file directly 馃槄. Yet need to know, where to find the bert_config.json file and changing it there would be the correct way of doing it.
The config file is located in the .tar.gz archive that is getting downloaded, cached, and then extracted on the fly as you create a BertModel instance with the static from_pretrained() constructor.
You'll see a log message like
extracting archive file /home/USERNAME/.pytorch_pretrained_bert/bert-base-cased.tar.gz to temp dir /tmp/tmp96bkwrj0
If you extract that archive yourself, you'll find the bert_config.json file. The thing, though, is that it doesn't make sense to modify this file, as it is tied to the pretrained models. If you increase max_position_embeddings in the config, you won't be able to use the pretrained models.
Instead, you will have to train a model from scratch, which may or -- more likely -- may not be feasible depending on the hardware you have access to.
Yeah as you said, while debugging I noticed that every time the .tar.gz file was extracted to a new temp cache location and from there models are fetched. Even in that case we are not able to find the json file where it was extracted. Also I think max_position_embeddings does not relate with the model training because, when I changed its value(before loading the model with torch.load) like this
config.__dict__['max_position_embeddings'] = 2048
from 512 to 2048 (hard coded way) the code ran properly without any error.
And the lines in modelling.py tells that it can be customised if required. But I don't see a way parameterising it so that it will be changed while fetching the config, because it is loaded like this.
It would be great if customisations are supported for the applicable options.
It does not make sense to customize options when using pretrained models, it only makes sense when training your own model from scratch.
You cannot use the pretrained models with another max_position_embeddings than 512, because the pretrained models contain pretrained embeddings for 512 positions.
The original transformer paper introduced a positional encoding which allows extrapolation to arbitrary input lengths, but this was not used in BERT.
You can override max_position_embeddings, but this won't have any effect. The model will probably run fine for shorter inputs, but you will get a RuntimeError: cuda runtime error (59) for an input longer than 512 word pieces, because the embedding lookup here will attempt to use an index that is too large.
Indeed, it doesn't make sense to go over 512 tokens for a pre-trained model.
If you have longer text, you should try the sliding window approach detailed on the original Bert repo: https://github.com/google-research/bert/issues/66
Most helpful comment
Indeed, it doesn't make sense to go over 512 tokens for a pre-trained model.
If you have longer text, you should try the sliding window approach detailed on the original Bert repo: https://github.com/google-research/bert/issues/66