Steps to reproduce the behavior:
language-modeling
: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynbThe notebook finishes succesfuly
What I get is:
Exception Traceback (most recent call last)
<ipython-input-5-52625a7c86e5> in <module>()
1 get_ipython().system('mkdir EsperBERTo')
----> 2 tokenizer.save("EsperBERTo")
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
330 A path to the destination Tokenizer file
331 """
--> 332 return self._tokenizer.save(path, pretty)
333
334 def to_str(self, pretty: bool = False):
Exception: Is a directory (os error 21)
transformers
version: 2.11.0@orestisfl thanks for raising this, I also was scratching my head as the same tokenizer.save("EsperBERTo") worked for me a few days ago and not anymore. I could save the tokenizer using tokenizer.save("EsperBERTo/vocab.txt"), but then I can't load it. If I try to load I get:
TypeError: sep_token not found in the vocabulary
I'm using BertWordPieceTokenizer (not ByteLevelBPETokenizer used in your example)
It worked with tokenizers version 0.7.0, I just checked - I got version 0.8.0rc1 currently installed.
I'll downgrade to 0.7.0 for now.
Was this BC intended @n1t0?
Yes, tokenizers
0.8.0
introduces the full tokenizer serialization, whereas before it saved the "model" only (vocab.json + merges.txt for BPE). So the save method should be used like that: .save("tokenizer.json")
and it saves the entire tokenizer to a JSON file.
We need to update the Notebook to use this new serialization method, but in the meantime, the only thing needed to make it work exactly like before is to replace:
!mkdir EsperBERTo
tokenizer.save("EsperBERTo")
by
!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")
mind updating it before we forget? Thanks!
Sure, updated it with the quick change I mentioned. Will do a better update later.
Hey there, thanks for the quick fix!
The notebook now crashes for me during training, however:
AttributeError Traceback (most recent call last)
<ipython-input-19-0c647bc3a8b8> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'trainer.train()')
11 frames
<decorator-gen-60> in time(self, line, cell, local_ns)
<timed eval> in <module>()
/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py in <listcomp>(.0)
112 probability_matrix = torch.full(labels.shape, self.mlm_probability)
113 special_tokens_mask = [
--> 114 self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
115 ]
116 probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
AttributeError: 'RobertaTokenizerFast' object has no attribute 'get_special_tokens_mask'
Let me know if I should make a separate issue
This one is for me (this method was actually not working as intended under the hood for Fast-tokenizers...)
@thomwolf - just to confirm, I tried the change you made and it fixes a problem for me. Thanks!
AttributeError: 'BertTokenizerFast' object has no attribute 'get_special_tokens_mask'
Most helpful comment
Yes,
tokenizers
0.8.0
introduces the full tokenizer serialization, whereas before it saved the "model" only (vocab.json + merges.txt for BPE). So the save method should be used like that:.save("tokenizer.json")
and it saves the entire tokenizer to a JSON file.We need to update the Notebook to use this new serialization method, but in the meantime, the only thing needed to make it work exactly like before is to replace:
by