Transformers: BerTweet tokenizer issue

Created on 22 Jul 2020 · 11Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...):
Bertweet

To reproduce

Steps to reproduce the behavior:

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
2.
3.

OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

Expected behavior

the tokenizer should be loaded correctly.
https://huggingface.co/vinai/bertweet-base?text=Paris+is+the+%3Cmask%3E+of+France.

Environment info

transformers version: 2.10.0
Platform: ubuntu 18.04
Python version: 3.7
PyTorch version (GPU?): 1.5
Tensorflow version (GPU?):
Using GPU in script?: v100
Using distributed or parallel set-up in script?: no

Modeling wontfix

Source

Shiro-LK

Most helpful comment

I am working on it (I just have uploaded the model to huggingface yesterday).
I will create pull requests soon, so that users can make use of the following scripts:

tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")
bertweet = BertweetModel.from_pretrained("vinai/bertweet-base")

Please stay tuned!

datquocnguyen on 22 Jul 2020

👍3

All 11 comments

This model is defined as a roberta model but its tokenizer seems to be a Wordpiece tokenizer (based on the vocab.txt file), whereas Roberta uses a Byte-level BPE.

This is not currently supported out of the box by our AutoTokenizer/AutoModel features (model type ≠ tokenizer type) nor by our Pipelines but I'd like to support this in the future.

julien-c on 22 Jul 2020

For now, you'll have to initialize this tokenizer + model independently.

BertTokenizer.from_pretrained("...")
AutoModel.from_pretrained("...")

Also cc'ing model author @datquocnguyen

julien-c on 22 Jul 2020

👍1

I am working on it (I just have uploaded the model to huggingface yesterday).
I will create pull requests soon, so that users can make use of the following scripts:

tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")
bertweet = BertweetModel.from_pretrained("vinai/bertweet-base")

Please stay tuned!

datquocnguyen on 22 Jul 2020

👍3

@julien-c @datquocnguyen Thanks for your answer.
I just tried the AutoModel, I had some weird "CUDA illegal memory access error" after 2 steps. It works fine with other models such as electra or roberta. I do not know if it is related to some wrong encoding with the tokenizer (I am using the fairseq tokenizer as the tokenizer from huggingface is not working even with BertTokenizer) or something else.

update: I may have found the issue. It may come from the max length which seems to be 130, contrary to regular Bert Base model. I was using a longer length sequence.

Shiro-LK on 22 Jul 2020

I am working on it (I just have uploaded the model to huggingface yesterday).
I will create pull requests soon, so that users can make use of the following scripts:
tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")
bertweet = BertweetModel.from_pretrained("vinai/bertweet-base")
Please stay tuned!

Looking forward to it !

nightlessbaron on 27 Jul 2020

@nightlessbaron @Shiro-LK @julien-c FYI, I have just created a pull request #6129 for adding BERTweet and PhoBERT into transformers
@nightlessbaron In case you want to use BERTweet right away, you might have a look at this fork https://github.com/datquocnguyen/transformers
Cheers,
Dat.

datquocnguyen on 28 Jul 2020

❤1

@datquocnguyen Looks like the error still exists. From https://huggingface.co/vinai/bertweet-base, I run
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
It gives:
OSError: Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

Also, https://huggingface.co/vinai/bertweet-base?text=Paris+is+the+%3Cmask%3E+of+France gives an error

steveguang on 5 Aug 2020

I had the same error @steveguang had. Is there any solution?

SergioBarretoJr on 23 Aug 2020

@steveguang @SergioBarretoJr your issue has now been solved.
Also to @Shiro-LK @nightlessbaron Please check https://github.com/datquocnguyen/transformers
@julien-c Please help review this pull request #6129 BERTweet now works in Auto mode and without an additional dependency fastBPE.

datquocnguyen on 31 Aug 2020

🎉1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.