Transformers: 01_how-to-train.ipynb broken

Created on 17 Jun 2020  路  8Comments  路  Source: huggingface/transformers

馃悰 Bug

To reproduce

Steps to reproduce the behavior:

  1. Go to https://github.com/huggingface/transformers/tree/master/examples
  2. Click the colab for language-modeling: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
  3. Run notebook

Expected behavior

The notebook finishes succesfuly

What I get is:

Exception                                 Traceback (most recent call last)
<ipython-input-5-52625a7c86e5> in <module>()
      1 get_ipython().system('mkdir EsperBERTo')
----> 2 tokenizer.save("EsperBERTo")

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
    330                 A path to the destination Tokenizer file
    331         """
--> 332         return self._tokenizer.save(path, pretty)
    333 
    334     def to_str(self, pretty: bool = False):

Exception: Is a directory (os error 21)

Environment info

  • transformers version: 2.11.0
  • Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.5.0+cu101 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: NA
  • Using distributed or parallel set-up in script?: NA

Most helpful comment

Yes, tokenizers 0.8.0 introduces the full tokenizer serialization, whereas before it saved the "model" only (vocab.json + merges.txt for BPE). So the save method should be used like that: .save("tokenizer.json") and it saves the entire tokenizer to a JSON file.
We need to update the Notebook to use this new serialization method, but in the meantime, the only thing needed to make it work exactly like before is to replace:

!mkdir EsperBERTo
tokenizer.save("EsperBERTo")

by

!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

All 8 comments

@orestisfl thanks for raising this, I also was scratching my head as the same tokenizer.save("EsperBERTo") worked for me a few days ago and not anymore. I could save the tokenizer using tokenizer.save("EsperBERTo/vocab.txt"), but then I can't load it. If I try to load I get:

TypeError: sep_token not found in the vocabulary

I'm using BertWordPieceTokenizer (not ByteLevelBPETokenizer used in your example)

It worked with tokenizers version 0.7.0, I just checked - I got version 0.8.0rc1 currently installed.
I'll downgrade to 0.7.0 for now.

Was this BC intended @n1t0?

Yes, tokenizers 0.8.0 introduces the full tokenizer serialization, whereas before it saved the "model" only (vocab.json + merges.txt for BPE). So the save method should be used like that: .save("tokenizer.json") and it saves the entire tokenizer to a JSON file.
We need to update the Notebook to use this new serialization method, but in the meantime, the only thing needed to make it work exactly like before is to replace:

!mkdir EsperBERTo
tokenizer.save("EsperBERTo")

by

!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

mind updating it before we forget? Thanks!

Sure, updated it with the quick change I mentioned. Will do a better update later.

Hey there, thanks for the quick fix!
The notebook now crashes for me during training, however:

AttributeError                            Traceback (most recent call last)
<ipython-input-19-0c647bc3a8b8> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'trainer.train()')

11 frames
<decorator-gen-60> in time(self, line, cell, local_ns)

<timed eval> in <module>()

/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py in <listcomp>(.0)
    112         probability_matrix = torch.full(labels.shape, self.mlm_probability)
    113         special_tokens_mask = [
--> 114             self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
    115         ]
    116         probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)

AttributeError: 'RobertaTokenizerFast' object has no attribute 'get_special_tokens_mask'

Let me know if I should make a separate issue

This one is for me (this method was actually not working as intended under the hood for Fast-tokenizers...)

@thomwolf - just to confirm, I tried the change you made and it fixes a problem for me. Thanks!

AttributeError: 'BertTokenizerFast' object has no attribute 'get_special_tokens_mask'
Was this page helpful?
0 / 5 - 0 ratings