Model I am using is: mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es
Language I am using the model on: Spanish
The problem arises when using:
The tasks I am working on is:
Steps to reproduce the behavior:
from transformers import *
# Build a pipeline for QA
nlp = pipeline('question-answering', model='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
tokenizer='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es')
nlp(
{
'question': 'que queso es?',
'context': 'Se utilizo en el dia de hoy un queso Emmental'
}
)
This was working two days ago.
Error log
```html
convert squad examples to features: 0%| | 0/1 [00:00, ?it/s]WARNING:transformers.tokenization_utils:Disabled padding because no padding token set (pad_token: [PAD], pad_token_id: 1).
To remove this error, you can add a new pad token and then resize model embedding:
tokenizer.pad_token = '
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(args, *kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(args))
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 141, in squad_convert_example_to_features
truncation_strategy="only_second" if tokenizer.padding_side == "right" else "only_first",
File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 1796, in encode_plus
*kwargs,
File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 1722, in batch_encode_plus
tokens = self._tokenizer.encode(*batch_text_or_text_pairs[0])
File "/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py", line 141, in encode
return self._tokenizer.encode(sequence, pair)
TypeError
"""
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
8 nlp({
9 'question': question,
---> 10 'context': context
11 })
12 )
11 frames
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in encode()
139 An Encoding
140 """
--> 141 return self._tokenizer.encode(sequence, pair)
142
143 def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:
TypeError:
```
transformers version: 2.5.0Hi @ankandrew,
Thanks for reporting the issue. Effectively, the QA pipeline is not compatible with fast tokenizers for technical reasons (and I'm currently working on a fix for this).
As a workaround for now, you can disable fast tokenizers when allocating the pipeline:
nlp = pipeline(
'question-answering',
model='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
tokenizer=(
'mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
{"use_fast": False}
)
)
nlp(
{
'question': 'que queso es?',
'context': 'Se utilizo en el dia de hoy un queso Emmental'
}
)
> {'score': 0.36319364208159755, 'start': 31, 'end': 44, 'answer': 'queso Emmental'}
Also cc'ing @mrm8488 for information while it's in the process of being fixed
Thank for the information!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Also cc'ing @mrm8488 for information while it's in the process of being fixed