Transformers: Tokenizer not found after conversion from TF checkpoint to PyTorch

Created on 14 Aug 2019  路  3Comments  路  Source: huggingface/transformers

馃悰 Bug

Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • [x] the official example scripts: run_generation.py, convert_tf_checkpoint_to_pytorch.py
  • [ ] my own modified scripts: (give details)

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: Text generation. I finetuned a gpt2 model using Tensorflow and I converted the checkpoint using the convert_tf_checkpoint_to_pytorch.py script to PyTorch. Running run_generation.py from the examples folder results in an error. It seems like the tokenizer is not loaded from the converted model. (Maybe it is not saved?)

To Reproduce

Steps to reproduce the behavior:

  1. Have a tensorflow checkpoint.
  2. Convert it with python pytorch_transformers gpt2 path/to/checkpoint path/to/save/model
  3. Run python run_generation.py --model_type gpt2 --model_name_or_path path/to/saved/model --top_p 0.9 --prompt "Hello Huggingface"

This results in the following error:

Traceback (most recent call last): File "run_generation.py", line 195, in <module> main() File "run_generation.py", line 175, in main context_tokens = tokenizer.encode(raw_text) AttributeError: 'NoneType' object has no attribute 'encode'

Expected behavior

Text generation like using "gpt2" as model_name_or_path.

Environment

  • OS: Windows 10
  • Python version: 3.7
  • PyTorch version: 1.1
  • PyTorch Transformers version (or branch): 1.0
  • Using GPU ? Yes, but doesn't work with CPU either
  • Distributed of parallel setup ? No
  • Any other relevant information:

Additional context

I manged to get it working by substituting the loading of the tokenizer with "gpt2", that way the tokenizer is loaded not from my fine-tuned model, but from the cache of the 117M version. Is the tokenizer actually trained?
Right now I have 3 files in the models folder: config.json, pytorch_model.bin and vocab.bpe. Am I missing a file?

wontfix

Most helpful comment

Shouldn't the tokenizer then be loaded from args.model_type and not args.model_name_or_path? Or do they differ from gpt2 to gpt2-medium?

All 3 comments

Hi, no the tokenizer is not trained. You can just load the original gpt2 one.

Shouldn't the tokenizer then be loaded from args.model_type and not args.model_name_or_path? Or do they differ from gpt2 to gpt2-medium?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ereday picture ereday  路  3Comments

hsajjad picture hsajjad  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments

lemonhu picture lemonhu  路  3Comments

siddsach picture siddsach  路  3Comments