I was trying to train a language model by following this link https://huggingface.co/blog/how-to-train , tokenizer trained successfully but when I'm loading calling this trained tokenizer in LineByLineTextDataset, it showing below error

I have the same problem following the same example. Is it possible to adapt the LineByLineTextDataset class?
Same issue here
Hi, could you provide your software versions? transformers and tokenizers?
print(transformers.__version__)
3.3.1
print(tokenizers.__version__)
0.9.0.rc1
transformers==3.3.1 has a strict dependency on tokenizers==0.8.1.rc2. Using that version on the colab for that script I don't get that error.
Do you get the same error when running the colab?

I have the right versions but I still couldn't get it running @LysandreJik
I am trying to use a BERT model. Hence I used the BPE tokenizer as is, instead of running a RobertaTokenizerFast command.
Alternatively if I use BertTokenizerFast, I get an error saying 'sep_token' is missing. How can I successfully change this code for a Bert model instead of Roberta?
What am I missing here? Thanks for helping out.
they same for me, error persists with
print(transformers.__version__)
3.3.1
print(tokenizers.__version__)
0.8.1.rc2
Same thing here with transformers 3.3.1 and tokenizers 0.8.1.rc2.
I have the same issue with transformers 3.4.0 and tokenizers 0.9.2
Same issue with
tokenizers==0.9.2
transformers==3.4.0
Self-contained script to reproduce the error
file_path = "test.txt"
with open(file_path, "w") as f:
lorem_ipsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n " \
"Pellentesque ultrices scelerisque sem, lobortis laoreet nisi semper eget.\n " \
"Curabitur egestas hendrerit neque, et rhoncus enim vulputate blandit.\n Nunc efficitur " \
"posuere neque id ornare.\n Sed viverra nisi nec pulvinar accumsan. Nulla faucibus arcu " \
"nisl, non bibendum libero congue eu.\n Mauris eget dignissim arcu, sed porttitor nunc. " \
"Vivamus venenatis nisl ac leo maximus, in aliquam risus auctor.\n "
f.write(lorem_ipsum)
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=file_path, vocab_size=20, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
from datasets import load_dataset
datasets = load_dataset('text', data_files=file_path)
datasets = datasets.map(lambda e: tokenizer(e['text']))
@n1t0 do you have some advice here?
Hmm that right, we have not yet incorporated a way to load a custom tokenizer from tokenizers in transformers.
I'll work on this in the coming days/weeks when @n1t0 has some time for #8073.
I guess the simplest way to use a custom tokenizer from the tokenizers library in transformers would be to add a new CustomTokenizer specific class with some doc.
Another way would be to have a __call__ method identical to the one of transformers in tokenizers but here @n1t0 is the master, not me.
I think the only viable way to really support the tokenizers from tokenizers, is to wrap them in what is expected throughout transformers: a PreTrainedTokenizerBase.
I used to be able to advise some way to do it (cf here) but this recently changed, so it doesn't seem possible anymore without changing the private _tokenizer.
We could also add a __call__ to tokenizers, and we probably will at some point, but that wouldn't fix this problem. Also, I think it's important to note that even with a __call__ method, the input/outputs would most probably still be different and it couldn't be used as a drop-in replacement.
It should be possible to do something like this for now:
# Save the tokenizer you trained
tokenizer.save("byte-level-BPE.tokenizer.json")
# Load it using transformers
tokenizer = PreTrainedTokenizerFast(tokenizer_file="byte-level-BPE.tokenizer.json")
And then you should be able to use it with the LineByLineTextDataset
It should be possible to do something like this for now:
# Save the tokenizer you trained tokenizer.save("byte-level-BPE.tokenizer.json") # Load it using transformers tokenizer = PreTrainedTokenizerFast(tokenizer_file="byte-level-BPE.tokenizer.json")And then you should be able to use it with the
LineByLineTextDataset
Yes, it works! There is no error when I use the TextDataSet or LineByLineTextDataSet.
Most helpful comment
It should be possible to do something like this for now:
And then you should be able to use it with the
LineByLineTextDataset