Transformers: TypeError: 'ByteLevelBPETokenizer' object is not callable

Created on 18 Sep 2020  路  16Comments  路  Source: huggingface/transformers

I was trying to train a language model by following this link https://huggingface.co/blog/how-to-train , tokenizer trained successfully but when I'm loading calling this trained tokenizer in LineByLineTextDataset, it showing below error

image

Most helpful comment

It should be possible to do something like this for now:

# Save the tokenizer you trained
tokenizer.save("byte-level-BPE.tokenizer.json")

# Load it using transformers
tokenizer = PreTrainedTokenizerFast(tokenizer_file="byte-level-BPE.tokenizer.json")

And then you should be able to use it with the LineByLineTextDataset

All 16 comments

I have the same problem following the same example. Is it possible to adapt the LineByLineTextDataset class?

Same issue here

Hi, could you provide your software versions? transformers and tokenizers?

print(transformers.__version__)
3.3.1

print(tokenizers.__version__)
0.9.0.rc1

transformers==3.3.1 has a strict dependency on tokenizers==0.8.1.rc2. Using that version on the colab for that script I don't get that error.

Do you get the same error when running the colab?

image
I have the right versions but I still couldn't get it running @LysandreJik
I am trying to use a BERT model. Hence I used the BPE tokenizer as is, instead of running a RobertaTokenizerFast command.

Alternatively if I use BertTokenizerFast, I get an error saying 'sep_token' is missing. How can I successfully change this code for a Bert model instead of Roberta?

What am I missing here? Thanks for helping out.

they same for me, error persists with

print(transformers.__version__)
3.3.1

print(tokenizers.__version__)
0.8.1.rc2

Same thing here with transformers 3.3.1 and tokenizers 0.8.1.rc2.

I have the same issue with transformers 3.4.0 and tokenizers 0.9.2

Same issue with

tokenizers==0.9.2
transformers==3.4.0

Self-contained script to reproduce the error

file_path = "test.txt"
with open(file_path, "w") as f:
    lorem_ipsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n " \
                  "Pellentesque ultrices scelerisque sem, lobortis laoreet nisi semper eget.\n " \
                  "Curabitur egestas hendrerit neque, et rhoncus enim vulputate blandit.\n Nunc efficitur " \
                  "posuere neque id ornare.\n Sed viverra nisi nec pulvinar accumsan. Nulla faucibus arcu " \
                  "nisl, non bibendum libero congue eu.\n Mauris eget dignissim arcu, sed porttitor nunc. " \
                  "Vivamus venenatis nisl ac leo maximus, in aliquam risus auctor.\n "
    f.write(lorem_ipsum)
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=file_path, vocab_size=20, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])
from datasets import load_dataset
datasets = load_dataset('text', data_files=file_path)
datasets = datasets.map(lambda e: tokenizer(e['text']))

@n1t0 do you have some advice here?

Hmm that right, we have not yet incorporated a way to load a custom tokenizer from tokenizers in transformers.

I'll work on this in the coming days/weeks when @n1t0 has some time for #8073.

I guess the simplest way to use a custom tokenizer from the tokenizers library in transformers would be to add a new CustomTokenizer specific class with some doc.

Another way would be to have a __call__ method identical to the one of transformers in tokenizers but here @n1t0 is the master, not me.

I think the only viable way to really support the tokenizers from tokenizers, is to wrap them in what is expected throughout transformers: a PreTrainedTokenizerBase.
I used to be able to advise some way to do it (cf here) but this recently changed, so it doesn't seem possible anymore without changing the private _tokenizer.

We could also add a __call__ to tokenizers, and we probably will at some point, but that wouldn't fix this problem. Also, I think it's important to note that even with a __call__ method, the input/outputs would most probably still be different and it couldn't be used as a drop-in replacement.

It should be possible to do something like this for now:

# Save the tokenizer you trained
tokenizer.save("byte-level-BPE.tokenizer.json")

# Load it using transformers
tokenizer = PreTrainedTokenizerFast(tokenizer_file="byte-level-BPE.tokenizer.json")

And then you should be able to use it with the LineByLineTextDataset

It should be possible to do something like this for now:

# Save the tokenizer you trained
tokenizer.save("byte-level-BPE.tokenizer.json")

# Load it using transformers
tokenizer = PreTrainedTokenizerFast(tokenizer_file="byte-level-BPE.tokenizer.json")

And then you should be able to use it with the LineByLineTextDataset

Yes, it works! There is no error when I use the TextDataSet or LineByLineTextDataSet.

Was this page helpful?
0 / 5 - 0 ratings