Transformers: google/reformer-enwik8 tokenizer was not found in tokenizers model name list

Created on 14 Jul 2020 · 4Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (Bert, XLNet ...): Reformer

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Enter https://huggingface.co/google/reformer-enwik8
Look at "Hosted inference API"

The model's tokenizer cannot be found; I'm getting the same error in scripts as the one displayed on your webpage:

⚠️ This model could not be loaded by the inference API. ⚠️ Error loading tokenizer Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url. OSError("Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.")

Expected behavior

Tokenizer loaded without issues

Environment info

transformers version: latest
Platform: your own
Python version: ?
PyTorch version (GPU?): ?
Tensorflow version (GPU?): ?
Using GPU in script?: ?
Using distributed or parallel set-up in script?: ?

wontfix

Source

pzelasko

👍2

Most helpful comment

google/reformer-enwik8 is the only model that is a char language model and does not need a tokenizer. If you take a look here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8 , you can see that the model does not need a tokenier but a simple python encode and decode function.

@julien-c @mfuntowicz - how do you think we can include char lms to pipelines? Should we maybe introduce a is_char_lm config variable? Or just wrap a dummy tokenizer around the python encode and decode functions?

patrickvonplaten on 15 Jul 2020

👍3

All 4 comments

That's because only the crime and punishment modell has an uploaded tokenizer.

flozi00 on 14 Jul 2020

patrickvonplaten on 15 Jul 2020

👍3

Add a tokenizer_class optional attribute to config.json which overrides the type of Tokenizer that's instantiated when calling .from_pretrained()?

julien-c on 15 Jul 2020

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.