Model I am using (Bert, XLNet ...):
Bertweet
Steps to reproduce the behavior:
OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
the tokenizer should be loaded correctly.
https://huggingface.co/vinai/bertweet-base?text=Paris+is+the+%3Cmask%3E+of+France.
transformers version: 2.10.0This model is defined as a roberta model but its tokenizer seems to be a Wordpiece tokenizer (based on the vocab.txt file), whereas Roberta uses a Byte-level BPE.
This is not currently supported out of the box by our AutoTokenizer/AutoModel features (model type ≠ tokenizer type) nor by our Pipelines but I'd like to support this in the future.
For now, you'll have to initialize this tokenizer + model independently.
BertTokenizer.from_pretrained("...")
AutoModel.from_pretrained("...")
Also cc'ing model author @datquocnguyen
I am working on it (I just have uploaded the model to huggingface yesterday).
I will create pull requests soon, so that users can make use of the following scripts:
tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")
bertweet = BertweetModel.from_pretrained("vinai/bertweet-base")
Please stay tuned!
@julien-c @datquocnguyen Thanks for your answer.
I just tried the AutoModel, I had some weird "CUDA illegal memory access error" after 2 steps. It works fine with other models such as electra or roberta. I do not know if it is related to some wrong encoding with the tokenizer (I am using the fairseq tokenizer as the tokenizer from huggingface is not working even with BertTokenizer) or something else.
update: I may have found the issue. It may come from the max length which seems to be 130, contrary to regular Bert Base model. I was using a longer length sequence.
I am working on it (I just have uploaded the model to huggingface yesterday).
I will create pull requests soon, so that users can make use of the following scripts:tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base") bertweet = BertweetModel.from_pretrained("vinai/bertweet-base")Please stay tuned!
Looking forward to it !
@nightlessbaron @Shiro-LK @julien-c FYI, I have just created a pull request #6129 for adding BERTweet and PhoBERT into transformers
@nightlessbaron In case you want to use BERTweet right away, you might have a look at this fork https://github.com/datquocnguyen/transformers
Cheers,
Dat.
@datquocnguyen Looks like the error still exists. From https://huggingface.co/vinai/bertweet-base, I run
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
It gives:
OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
Also, https://huggingface.co/vinai/bertweet-base?text=Paris+is+the+%3Cmask%3E+of+France gives an error
I had the same error @steveguang had. Is there any solution?
@steveguang @SergioBarretoJr your issue has now been solved.
Also to @Shiro-LK @nightlessbaron Please check https://github.com/datquocnguyen/transformers
@julien-c Please help review this pull request #6129 BERTweet now works in Auto mode and without an additional dependency fastBPE.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This was solved by @datquocnguyen
Most helpful comment
I am working on it (I just have uploaded the model to huggingface yesterday).
I will create pull requests soon, so that users can make use of the following scripts:
Please stay tuned!