Transformers: google/reformer-enwik8 tokenizer was not found in tokenizers model name list

Created on 14 Jul 2020  路  4Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Model I am using (Bert, XLNet ...): Reformer

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • [x] the official example scripts: (give details below)
  • [ ] my own modified scripts: (give details below)

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Enter https://huggingface.co/google/reformer-enwik8
  2. Look at "Hosted inference API"

The model's tokenizer cannot be found; I'm getting the same error in scripts as the one displayed on your webpage:

鈿狅笍 This model could not be loaded by the inference API. 鈿狅笍 Error loading tokenizer Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url. OSError("Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.")

Expected behavior

Tokenizer loaded without issues

Environment info

  • transformers version: latest
  • Platform: your own
  • Python version: ?
  • PyTorch version (GPU?): ?
  • Tensorflow version (GPU?): ?
  • Using GPU in script?: ?
  • Using distributed or parallel set-up in script?: ?
wontfix

Most helpful comment

google/reformer-enwik8 is the only model that is a char language model and does not need a tokenizer. If you take a look here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8 , you can see that the model does not need a tokenier but a simple python encode and decode function.

@julien-c @mfuntowicz - how do you think we can include char lms to pipelines? Should we maybe introduce a is_char_lm config variable? Or just wrap a dummy tokenizer around the python encode and decode functions?

All 4 comments

That's because only the crime and punishment modell has an uploaded tokenizer.

google/reformer-enwik8 is the only model that is a char language model and does not need a tokenizer. If you take a look here: https://huggingface.co/google/reformer-enwik8#reformer-language-model-on-character-level-and-trained-on-enwik8 , you can see that the model does not need a tokenier but a simple python encode and decode function.

@julien-c @mfuntowicz - how do you think we can include char lms to pipelines? Should we maybe introduce a is_char_lm config variable? Or just wrap a dummy tokenizer around the python encode and decode functions?

Add a tokenizer_class optional attribute to config.json which overrides the type of Tokenizer that's instantiated when calling .from_pretrained()?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings