Transformers: Error arises when using pipeline with community model

Created on 20 Feb 2020 · 4Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using is: mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es

Language I am using the model on: Spanish

The problem arises when using:

[ ] the official example scripts:
[x] my own modified scripts:

The tasks I am working on is:

[ ] an official GLUE/SQUaD task:
[x] my own task or dataset:

Steps to reproduce the behavior:

from transformers import *

# Build a pipeline for QA
nlp = pipeline('question-answering', model='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
               tokenizer='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es')

nlp(
    {
      'question': 'que queso es?',
      'context': 'Se utilizo en el dia de hoy un queso Emmental'
    }
)

This was working two days ago.

Error log

```html
convert squad examples to features: 0%| | 0/1 [00:00WARNING:transformers.tokenization_utils:Disabled padding because no padding token set (pad_token: [PAD], pad_token_id: 1).
To remove this error, you can add a new pad token and then resize model embedding:
tokenizer.pad_token = ''

model.resize_token_embeddings(len(tokenizer))

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(args, *kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(args))
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 141, in squad_convert_example_to_features
truncation_strategy="only_second" if tokenizer.padding_side == "right" else "only_first",
File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 1796, in encode_plus
*kwargs,
File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 1722, in batch_encode_plus
tokens = self._tokenizer.encode(*batch_text_or_text_pairs[0])
File "/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py", line 141, in encode
return self._tokenizer.encode(sequence, pair)
TypeError
"""

The above exception was the direct cause of the following exception:

TypeError Traceback (most recent call last)
in ()
8 nlp({
9 'question': question,
---> 10 'context': context
11 })
12 )

11 frames
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in encode()
139 An Encoding
140 """
--> 141 return self._tokenizer.encode(sequence, pair)
142
143 def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:

TypeError:
```

Environment info

transformers version: 2.5.0
Python version:3.6.9
Torch version (GPU?): 1.4.0, running on CPU

Tokenization wontfix

Source

ankandrew

Most helpful comment

Also cc'ing @mrm8488 for information while it's in the process of being fixed

julien-c on 20 Feb 2020

👍2

All 4 comments

Hi @ankandrew,

Thanks for reporting the issue. Effectively, the QA pipeline is not compatible with fast tokenizers for technical reasons (and I'm currently working on a fix for this).

As a workaround for now, you can disable fast tokenizers when allocating the pipeline:

nlp = pipeline(
    'question-answering', 
    model='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
    tokenizer=(
        'mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',  
        {"use_fast": False}
    )
)

nlp(
    {
        'question': 'que queso es?',
        'context': 'Se utilizo en el dia de hoy un queso Emmental'
    }
)
> {'score': 0.36319364208159755, 'start': 31, 'end': 44, 'answer': 'queso Emmental'}

mfuntowicz on 20 Feb 2020

👍2

Also cc'ing @mrm8488 for information while it's in the process of being fixed

julien-c on 20 Feb 2020

👍2

Thank for the information!

mrm8488 on 20 Feb 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.