Not sure whether it's a bug, therefore tagged as question. I want to load the Bert embeddings by calling
from flair.embeddings import BertEmbeddings
bert_embeddings = BertEmbeddings('bert-base-multilingual-uncased')
It gives the following error:
Model name 'bert-base-multilingual-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt' was a path or url but couldn't find any file associated to this path or url.
What I do not understand:
1) The string I pass as argument clearly IS in the list.
2) When I open the link, the text file seems to contain lots of weird tokens and special characters.
Why is that?
I think you must make sure, that you're using a recent version of pytorch-pretrained-bert, so you should try a pip install --upgrade pytorch-pretrained-bert :)
Thanks. I did that, the error persists though. It is still showing me the same error message...
Hm that is strange. I just ran the code on a fresh colab notebook at it works. Did you install from pip or are you working on the master branch?
I used
pip install flair
pip install --upgrade pytorch-pretrained-bert
with Python 3.7 and PyTorch 1.0.1
Maybe there's an older version of flair installed, could you try to run pip install --upgrade flair?
No, pip install --upgrade flair tells me that all requirements are up-to-date...
This is strange, here's what I tried to reproduce it:
$ python3.7 -m venv /tmp/flair-venv
$ source /tmp/flair-venv/bin/activate
(flair-venv) $ pip install --upgrade flair
(flair-venv) $ pip install --upgrade pytorch-pretrained-bert
(flair-venv) $ python
Python 3.7.1 (default, Oct 22 2018, 11:21:55)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from flair.embeddings import BertEmbeddings
>>> bert_embeddings = BertEmbeddings('bert-base-multilingual-uncased')
>>>
Could you give us more information about your Python environment :)
I am using Python 3.7 via a remote interpreter on Ubuntu 18.04 with conda version 4.6.14.
Please let me know if any other specific information is relevant.
Is there a link where I can directly download the Bert embeddings as used by Flair?
Hi, what version of pytorch-pretrained-bert do you have?
import pytorch_pretrained_bert
pytorch_pretrained_bert.__version__
>>> import pytorch_pretrained_bert
>>> pytorch_pretrained_bert.__version__
'0.6.2'
This might be somehow related to a bug in pytorch_pretrained_bert v0.6.2.
I cannot reproduce OP's error with
flair v0.4.1pytorch_pretrained_bert v0.6.2BertEmbeddings('bert-base-multilingual-uncased')but get an AttributeError (which, I must admit, would be a different issue) when embedding a sentence:
File "flair/models/sequence_tagger_model.py", line 300, in predict
tags, _ = self.forward_labels_and_loss(batch, sort=False)
File "flair/models/sequence_tagger_model.py", line 268, in forward_labels_and_loss
feature, lengths, tags = self.forward(sentences, sort=sort)
File "flair/models/sequence_tagger_model.py", line 315, in forward
self.embeddings.embed(sentences)
File "flair/embeddings.py", line 130, in embed
embedding.embed(sentences)
File "flair/embeddings.py", line 63, in embed
self._add_embeddings_internal(sentences)
File "flair/embeddings.py", line 1143, in _add_embeddings_internal
max([self.tokenizer.tokenize(sentence.to_tokenized_string()) for sentence in sentences], key=len))
File "flair/embeddings.py", line 1143, in <listcomp>
max([self.tokenizer.tokenize(sentence.to_tokenized_string()) for sentence in sentences], key=len))
File "pytorch_pretrained_bert/tokenization.py", line 109, in tokenize
if self.do_basic_tokenize:
AttributeError: 'BertTokenizer' object has no attribute 'do_basic_tokenize'
However, everything works fine with pytorch_pretrained_bert v0.6.1. So I guess the whole thing might solve itself with v0.6.3?
BTW: I don't know what's going wrong here, because the BertTokenizer _does_ have an attribute do_base_tokenize – but it's the wrong place here to start discussing that anyway.
@Janinanu Is there any version information found for flair when you execute:
import flair
print(flair.__version__)
in your virtual environment?
@severinsimmler Could you provide a full code snippet for that error? I would really like to reproduce it (maybe we can add some nice unit tests for that cases) :)
@stefan-it, I think I just found a fix for my bug in the flair code, will make a PR with some more details :)
Sorry, false alarm... I definitely can't reproduce OP's error, and the following example works out just fine with the versions I mentioned above:
>>> from flair.data import Sentence
>>> from flair.embeddings import BertEmbeddings
>>> sentence = Sentence("This is a sentence.")
>>> embedding = BertEmbeddings("bert-base-multilingual-cased")
>>> embedding.embed(sentence)
My use case was loading a sequence tagger model _trained_ with pytorch_pretrained_bert v0.6.1, but _predicting_ with v0.6.2:
>>> from flair.data import Sentence
>>> from flair.models import SequenceTagger
>>> tagger = SequenceTagger.load_from_file("model.pt")
>>> sentence = Sentence("This is a sentence.")
>>> tagger.predict(sentence)
AttributeError: 'BertTokenizer' object has no attribute 'do_basic_tokenize'
The AttributeError is obvious, because the BertTokenizer in v0.6.1 (= loaded from the model.pt) indeed had no do_basic_tokenize attribute, but the object in v0.6.2 does have.
For some reason, it now works. I don't know why and how though. Thanks everyone :)