Transformers: BertTokenizer: ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Created on 14 Jun 2020 · 12Comments · Source: huggingface/transformers

🐛 Bug

Information

Tokenizer I am using is BertTokenizer and I've also tried using AlbertTokenizer, but it does not have any effect. So I'm thinking that the bug is in the base tokenizer

Language I am using the model on is English, but I don't believe that's the issue.

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Version: transformers==2.11.0
Run this code

from transformers import BertModel, BertTokenizer
text = 'A quick brown fox jumps over' # Just a dummy text
BertTokenizer.encode_plus(
    text.split(' '),
    None,
    add_special_tokens = True,
    max_length = 512)

This should be the error

Traceback (most recent call last):
  File "classification.py", line 23, in <module>
    max_length = 512)
  File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1576, in encode_plus
    first_ids = get_input_ids(text)
  File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1556, in get_input_ids
    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

And yes, I've tried just inputting a string, and I still got the same error.

Expected behavior

I want the encoder_plus function to return an encoded version of the input sequence.

Environment info

transformers version: 2.11.0
Platform: Windows
Python version: 3.7.4
PyTorch version (GPU?): 1.5.0+cpu
Tensorflow version (GPU?): (Not used)
Using GPU in script?: Nope
Using distributed or parallel set-up in script?: No

Source

mariusjohan

Most helpful comment

I'm still facing the same issue:
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
While trying to run run_squad.py. I'm trying to train and test it with:
https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

SarangSanjayGujar-lilly on 25 Jun 2020

👍2

All 12 comments

The mistake is on me. I forgot to download the tokenizer😂

mariusjohan on 14 Jun 2020

I am getting the same error. What exactly do you mean by download the tokenizer? Doesn't it come with the transformers package?

vkrishnamurthy11 on 18 Jun 2020

I think what he meant was that use used the class, and not the instance, to encode text. You should always initialize the class:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# or
tokenizer = BertTokenizer(vocabfile)

# now you can encode
text = 'A quick brown fox jumps over' # Just a dummy text
model_inputs = tokenizer.encode_plus(text)

LysandreJik on 18 Jun 2020

@LysandreJik, while I have you. I know this aint the right place to ask you, but.

I’ve seen that you’re about to release the Electra modeling for question answering, and I’ve written a small script for training the electra discriminator for question answering, and I’m about to train the model.
so Would it be useful for you if I trained the model, or are you already doing that?

mariusjohan on 18 Jun 2020

👍1

Hi @mariusjohan, we welcome all models here :) The hub is a very easy way to share models. The way you're training it will surely be different to other trainings, so sharing it on the hub with details of how you trained it is always welcome!

LysandreJik on 19 Jun 2020

👍1

Ok, this is still not working for me. I am running the run_squad.py script and I keep getting the error.

Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(args, *kwds))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(args))
File "/kriviv-10T/transformers/transformers/src/transformers/data/processors/squad.py", line 142, in squad_convert_example_to_features
return_token_type_ids=True,
File "/kriviv-10T/transformers/transformers/src/transformers/tokenization_utils_base.py", line 1521, in encode_plus
*kwargs,
File "/kriviv-10T/transformers/transformers/src/transformers/tokenization_utils.py", line 356, in _encode_plus
second_ids = get_input_ids(text_pair) if text_pair is not None else None
File "/kriviv-10T/transformers/transformers/src/transformers/tokenization_utils.py", line 343, in get_input_ids
f"Input {text} is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

vkrishnamurthy11 on 19 Jun 2020

The reason I got the error, was because I forgot to initialize the tokenize module, and therefore it thinks the self argument is the input_ids and then you’re not giving it the real input_ids argument. And ofc, the system was way complex than the example I gave, so maybe try to check how the tokenization module is giving. Maybe also check your inputs and so on if you haven’t already. Sadly I can first fix it in a few hours.

mariusjohan on 19 Jun 2020

@vkrishnamurthy11 Did it help?

mariusjohan on 20 Jun 2020

SarangSanjayGujar-lilly on 25 Jun 2020

👍2

Facing the same issue as Sarang when training on Squad using run_squad.py . Is this a known bug?

dhruvluci on 26 Jun 2020

Same issue here when running squad_convert_examples_to_features in my own code.

Archetype90 on 20 Jul 2020

I don't know if it helps, but the reason was because I failed to use _.from_pretrained_ function. Maybe check for that. So maybe print out the _self_ argument

mariusjohan on 21 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings