Transformers: TypeError: 'ByteLevelBPETokenizer' object is not callable

Created on 18 Sep 2020 · 16Comments · Source: huggingface/transformers

I was trying to train a language model by following this link https://huggingface.co/blog/how-to-train , tokenizer trained successfully but when I'm loading calling this trained tokenizer in LineByLineTextDataset, it showing below error

Source

yadavpp

👍3 😕1

Most helpful comment

It should be possible to do something like this for now:

# Save the tokenizer you trained
tokenizer.save("byte-level-BPE.tokenizer.json")

# Load it using transformers
tokenizer = PreTrainedTokenizerFast(tokenizer_file="byte-level-BPE.tokenizer.json")

And then you should be able to use it with the LineByLineTextDataset

n1t0 on 1 Nov 2020

👍2 🎉1 😄1

All 16 comments

I have the same problem following the same example. Is it possible to adapt the LineByLineTextDataset class?

maxpel on 29 Sep 2020

Same issue here

Prithvi103 on 30 Sep 2020

Hi, could you provide your software versions? transformers and tokenizers?

LysandreJik on 30 Sep 2020

print(transformers.__version__)
3.3.1

print(tokenizers.__version__)
0.9.0.rc1

maxpel on 30 Sep 2020

transformers==3.3.1 has a strict dependency on tokenizers==0.8.1.rc2. Using that version on the colab for that script I don't get that error.

Do you get the same error when running the colab?

LysandreJik on 30 Sep 2020

I have the right versions but I still couldn't get it running @LysandreJik
I am trying to use a BERT model. Hence I used the BPE tokenizer as is, instead of running a RobertaTokenizerFast command.

Alternatively if I use BertTokenizerFast, I get an error saying 'sep_token' is missing. How can I successfully change this code for a Bert model instead of Roberta?

What am I missing here? Thanks for helping out.

Prithvi103 on 1 Oct 2020

they same for me, error persists with

print(transformers.__version__)
3.3.1

print(tokenizers.__version__)
0.8.1.rc2

maxpel on 1 Oct 2020

👍1

Same thing here with transformers 3.3.1 and tokenizers 0.8.1.rc2.

ajaykarpur on 17 Oct 2020

I have the same issue with transformers 3.4.0 and tokenizers 0.9.2

spacemanidol on 27 Oct 2020

👍1

Same issue with

tokenizers==0.9.2
transformers==3.4.0

Self-contained script to reproduce the error

file_path = "test.txt"
with open(file_path, "w") as f:
    lorem_ipsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n " \
                  "Pellentesque ultrices scelerisque sem, lobortis laoreet nisi semper eget.\n " \
                  "Curabitur egestas hendrerit neque, et rhoncus enim vulputate blandit.\n Nunc efficitur " \
                  "posuere neque id ornare.\n Sed viverra nisi nec pulvinar accumsan. Nulla faucibus arcu " \
                  "nisl, non bibendum libero congue eu.\n Mauris eget dignissim arcu, sed porttitor nunc. " \
                  "Vivamus venenatis nisl ac leo maximus, in aliquam risus auctor.\n "
    f.write(lorem_ipsum)
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=file_path, vocab_size=20, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])
from datasets import load_dataset
datasets = load_dataset('text', data_files=file_path)
datasets = datasets.map(lambda e: tokenizer(e['text']))

iacolippo on 30 Oct 2020

👍1

@n1t0 do you have some advice here?

LysandreJik on 30 Oct 2020

Hmm that right, we have not yet incorporated a way to load a custom tokenizer from tokenizers in transformers.

I'll work on this in the coming days/weeks when @n1t0 has some time for #8073.

I guess the simplest way to use a custom tokenizer from the tokenizers library in transformers would be to add a new CustomTokenizer specific class with some doc.

thomwolf on 31 Oct 2020

Another way would be to have a __call__ method identical to the one of transformers in tokenizers but here @n1t0 is the master, not me.

thomwolf on 31 Oct 2020

I think the only viable way to really support the tokenizers from tokenizers, is to wrap them in what is expected throughout transformers: a PreTrainedTokenizerBase.
I used to be able to advise some way to do it (cf here) but this recently changed, so it doesn't seem possible anymore without changing the private _tokenizer.

We could also add a __call__ to tokenizers, and we probably will at some point, but that wouldn't fix this problem. Also, I think it's important to note that even with a __call__ method, the input/outputs would most probably still be different and it couldn't be used as a drop-in replacement.

n1t0 on 1 Nov 2020

It should be possible to do something like this for now:

# Save the tokenizer you trained
tokenizer.save("byte-level-BPE.tokenizer.json")

# Load it using transformers
tokenizer = PreTrainedTokenizerFast(tokenizer_file="byte-level-BPE.tokenizer.json")

And then you should be able to use it with the LineByLineTextDataset

n1t0 on 1 Nov 2020

👍2 🎉1 😄1

It should be possible to do something like this for now:
# Save the tokenizer you trained
tokenizer.save("byte-level-BPE.tokenizer.json")

# Load it using transformers
tokenizer = PreTrainedTokenizerFast(tokenizer_file="byte-level-BPE.tokenizer.json")
And then you should be able to use it with the LineByLineTextDataset

Yes, it works! There is no error when I use the TextDataSet or LineByLineTextDataSet.