Model I am using (Bert, XLNet ...): Reformer
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
The model's tokenizer cannot be found; I'm getting the same error in scripts as the one displayed on your webpage:
鈿狅笍 This model could not be loaded by the inference API. 鈿狅笍
Error loading tokenizer Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url. OSError("Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.")
Tokenizer loaded without issues
transformers version: latestThat's because only the crime and punishment modell has an uploaded tokenizer.
google/reformer-enwik8 is the only model that is a char language model and does not need a tokenizer. If you take a look here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8 , you can see that the model does not need a tokenier but a simple python encode and decode function.
@julien-c @mfuntowicz - how do you think we can include char lms to pipelines? Should we maybe introduce a is_char_lm config variable? Or just wrap a dummy tokenizer around the python encode and decode functions?
Add a tokenizer_class optional attribute to config.json which overrides the type of Tokenizer that's instantiated when calling .from_pretrained()?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
google/reformer-enwik8is the only model that is a char language model and does not need a tokenizer. If you take a look here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8 , you can see that the model does not need a tokenier but a simple python encode and decode function.@julien-c @mfuntowicz - how do you think we can include char lms to
pipelines? Should we maybe introduce ais_char_lmconfig variable? Or just wrap a dummy tokenizer around the python encode and decode functions?