Transformers: Tokenizer not found after conversion from TF checkpoint to PyTorch

Created on 14 Aug 2019 · 3Comments · Source: huggingface/transformers

🐛 Bug

Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:

[x] the official example scripts: run_generation.py, convert_tf_checkpoint_to_pytorch.py
[ ] my own modified scripts: (give details)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: Text generation. I finetuned a gpt2 model using Tensorflow and I converted the checkpoint using the convert_tf_checkpoint_to_pytorch.py script to PyTorch. Running run_generation.py from the examples folder results in an error. It seems like the tokenizer is not loaded from the converted model. (Maybe it is not saved?)

To Reproduce

Steps to reproduce the behavior:

Have a tensorflow checkpoint.
Convert it with python pytorch_transformers gpt2 path/to/checkpoint path/to/save/model
Run python run_generation.py --model_type gpt2 --model_name_or_path path/to/saved/model --top_p 0.9 --prompt "Hello Huggingface"

This results in the following error:

Traceback (most recent call last): File "run_generation.py", line 195, in <module> main() File "run_generation.py", line 175, in main context_tokens = tokenizer.encode(raw_text) AttributeError: 'NoneType' object has no attribute 'encode'

Expected behavior

Text generation like using "gpt2" as model_name_or_path.

Environment

OS: Windows 10
Python version: 3.7
PyTorch version: 1.1
PyTorch Transformers version (or branch): 1.0
Using GPU ? Yes, but doesn't work with CPU either
Distributed of parallel setup ? No
Any other relevant information:

Additional context

I manged to get it working by substituting the loading of the tokenizer with "gpt2", that way the tokenizer is loaded not from my fine-tuned model, but from the cache of the 117M version. Is the tokenizer actually trained?
Right now I have 3 files in the models folder: config.json, pytorch_model.bin and vocab.bpe. Am I missing a file?

wontfix

Source

HansBambel

Most helpful comment

Shouldn't the tokenizer then be loaded from args.model_type and not args.model_name_or_path? Or do they differ from gpt2 to gpt2-medium?

HansBambel on 19 Aug 2019

👍2

All 3 comments

Hi, no the tokenizer is not trained. You can just load the original gpt2 one.

thomwolf on 19 Aug 2019

👍1

Shouldn't the tokenizer then be loaded from args.model_type and not args.model_name_or_path? Or do they differ from gpt2 to gpt2-medium?

HansBambel on 19 Aug 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.