Transformers: 01_how-to-train.ipynb broken

Created on 17 Jun 2020 · 8Comments · Source: huggingface/transformers

🐛 Bug

To reproduce

Steps to reproduce the behavior:

Go to https://github.com/huggingface/transformers/tree/master/examples
Click the colab for language-modeling: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
Run notebook

Expected behavior

The notebook finishes succesfuly

What I get is:

Exception                                 Traceback (most recent call last)
<ipython-input-5-52625a7c86e5> in <module>()
      1 get_ipython().system('mkdir EsperBERTo')
----> 2 tokenizer.save("EsperBERTo")

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
    330                 A path to the destination Tokenizer file
    331         """
--> 332         return self._tokenizer.save(path, pretty)
    333 
    334     def to_str(self, pretty: bool = False):

Exception: Is a directory (os error 21)

Environment info

transformers version: 2.11.0
Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.0+cu101 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: NA
Using distributed or parallel set-up in script?: NA

Source

orestisfl

👍3

Most helpful comment

Yes, tokenizers 0.8.0 introduces the full tokenizer serialization, whereas before it saved the "model" only (vocab.json + merges.txt for BPE). So the save method should be used like that: .save("tokenizer.json") and it saves the entire tokenizer to a JSON file.
We need to update the Notebook to use this new serialization method, but in the meantime, the only thing needed to make it work exactly like before is to replace:

!mkdir EsperBERTo
tokenizer.save("EsperBERTo")

!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

n1t0 on 18 Jun 2020

👍5 ❤1 🎉1

All 8 comments

@orestisfl thanks for raising this, I also was scratching my head as the same tokenizer.save("EsperBERTo") worked for me a few days ago and not anymore. I could save the tokenizer using tokenizer.save("EsperBERTo/vocab.txt"), but then I can't load it. If I try to load I get:

TypeError: sep_token not found in the vocabulary

I'm using BertWordPieceTokenizer (not ByteLevelBPETokenizer used in your example)

It worked with tokenizers version 0.7.0, I just checked - I got version 0.8.0rc1 currently installed.
I'll downgrade to 0.7.0 for now.

Tomas0413 on 18 Jun 2020

👍4

Was this BC intended @n1t0?

julien-c on 18 Jun 2020

!mkdir EsperBERTo
tokenizer.save("EsperBERTo")

!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

n1t0 on 18 Jun 2020

👍5 ❤1 🎉1

mind updating it before we forget? Thanks!

julien-c on 18 Jun 2020

Sure, updated it with the quick change I mentioned. Will do a better update later.

n1t0 on 18 Jun 2020

Hey there, thanks for the quick fix!
The notebook now crashes for me during training, however:

AttributeError                            Traceback (most recent call last)
<ipython-input-19-0c647bc3a8b8> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'trainer.train()')

11 frames
<decorator-gen-60> in time(self, line, cell, local_ns)

<timed eval> in <module>()

/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py in <listcomp>(.0)
    112         probability_matrix = torch.full(labels.shape, self.mlm_probability)
    113         special_tokens_mask = [
--> 114             self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
    115         ]
    116         probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)

AttributeError: 'RobertaTokenizerFast' object has no attribute 'get_special_tokens_mask'

Let me know if I should make a separate issue

iantheconway on 18 Jun 2020

This one is for me (this method was actually not working as intended under the hood for Fast-tokenizers...)

thomwolf on 18 Jun 2020

🎉3 ❤1 👍1

@thomwolf - just to confirm, I tried the change you made and it fixes a problem for me. Thanks!

AttributeError: 'BertTokenizerFast' object has no attribute 'get_special_tokens_mask'

Tomas0413 on 19 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Limit on the input text length?

lcswillems · 3Comments

if crf needed when do ner?

alphanlp · 3Comments

Unable to get hidden states and attentions BertForSequenceClassification

delip · 3Comments

What should be the label of sub-word units in Token Classification with Bert

ereday · 3Comments

Dataset format and Best Practices For Language Model Fine-tuning

HanGuo97 · 3Comments