Tokenizer I am using is BertTokenizer and I've also tried using AlbertTokenizer, but it does not have any effect. So I'm thinking that the bug is in the base tokenizer
Language I am using the model on is English, but I don't believe that's the issue.
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
transformers==2.11.0from transformers import BertModel, BertTokenizer
text = 'A quick brown fox jumps over' # Just a dummy text
BertTokenizer.encode_plus(
text.split(' '),
None,
add_special_tokens = True,
max_length = 512)
Traceback (most recent call last):
File "classification.py", line 23, in <module>
max_length = 512)
File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1576, in encode_plus
first_ids = get_input_ids(text)
File "D:\Programmering\Python\lib\site-packages\transformers\tokenization_utils.py", line 1556, in get_input_ids
"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
And yes, I've tried just inputting a string, and I still got the same error.
I want the encoder_plus function to return an encoded version of the input sequence.
transformers version: 2.11.0The mistake is on me. I forgot to download the tokenizer馃槀
I am getting the same error. What exactly do you mean by download the tokenizer? Doesn't it come with the transformers package?
I think what he meant was that use used the class, and not the instance, to encode text. You should always initialize the class:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# or
tokenizer = BertTokenizer(vocabfile)
# now you can encode
text = 'A quick brown fox jumps over' # Just a dummy text
model_inputs = tokenizer.encode_plus(text)
@LysandreJik, while I have you. I know this aint the right place to ask you, but.
I鈥檝e seen that you鈥檙e about to release the Electra modeling for question answering, and I鈥檝e written a small script for training the electra discriminator for question answering, and I鈥檓 about to train the model.
so Would it be useful for you if I trained the model, or are you already doing that?
Hi @mariusjohan, we welcome all models here :) The hub is a very easy way to share models. The way you're training it will surely be different to other trainings, so sharing it on the hub with details of how you trained it is always welcome!
Ok, this is still not working for me. I am running the run_squad.py script and I keep getting the error.
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(args, *kwds))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(args))
File "/kriviv-10T/transformers/transformers/src/transformers/data/processors/squad.py", line 142, in squad_convert_example_to_features
return_token_type_ids=True,
File "/kriviv-10T/transformers/transformers/src/transformers/tokenization_utils_base.py", line 1521, in encode_plus
*kwargs,
File "/kriviv-10T/transformers/transformers/src/transformers/tokenization_utils.py", line 356, in _encode_plus
second_ids = get_input_ids(text_pair) if text_pair is not None else None
File "/kriviv-10T/transformers/transformers/src/transformers/tokenization_utils.py", line 343, in get_input_ids
f"Input {text} is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
The reason I got the error, was because I forgot to initialize the tokenize module, and therefore it thinks the self argument is the input_ids and then you鈥檙e not giving it the real input_ids argument. And ofc, the system was way complex than the example I gave, so maybe try to check how the tokenization module is giving. Maybe also check your inputs and so on if you haven鈥檛 already. Sadly I can first fix it in a few hours.
@vkrishnamurthy11 Did it help?
I'm still facing the same issue:
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
While trying to run run_squad.py. I'm trying to train and test it with:
https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
Facing the same issue as Sarang when training on Squad using run_squad.py . Is this a known bug?
Same issue here when running squad_convert_examples_to_features in my own code.
I don't know if it helps, but the reason was because I failed to use _.from_pretrained_ function. Maybe check for that. So maybe print out the _self_ argument
Most helpful comment
I'm still facing the same issue:
ValueError: Input [] is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
While trying to run run_squad.py. I'm trying to train and test it with:
https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json