Transformers: Using the T5 model with huggingface's mask-fill pipeline

Created on 26 Apr 2020  ·  11Comments  ·  Source: huggingface/transformers

Does anyone know if it is possible to use the T5 model with hugging face's mask-fill pipeline? The below is how you can do it using the default model but i can't seem to figure out how to do is using the T5 model specifically?

from transformers import pipeline
nlp_fill = pipeline('fill-mask')
nlp_fill('Hugging Face is a French company based in ' + nlp_fill.tokenizer.mask_token)

Trying this for example raises the error "TypeError: must be str, not NoneType" because nlp_fill.tokenizer.mask_token is None.

nlp_fill = pipeline('fill-mask',model="t5-base", tokenizer="t5-base")
nlp_fill('Hugging Face is a French company based in ' + nlp_fill.tokenizer.mask_token)

Stack overflow question

wontfix

All 11 comments

Correct me if I'm wrong @patrickvonplaten, but I don't think T5 is trained on masked language modeling (and does not have a mask token) so will not work with this pipeline.

Yeah, T5 is not trained on the conventional "Bert-like" masked language modeling objective. It does a special encoder-decoder masked language modeling (see docs here), but this is not really supported in combination with the mask-fill pipeline at the moment.

Hi @patrickvonplaten, is there any plan to support T5 with the mask-fill pipeline in the near future?

T5 is an encoder-decoder model so I don't really see it as a fitting model for the mask-fill task.

Could we use the following workaround?

  • <extra_id_0> could be considered as a mask token
  • Candidate sequences for the mask-token could be generated using a code, like:
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

T5_PATH = 't5-base' # "t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # My envirnment uses CPU

t5_tokenizer = T5Tokenizer.from_pretrained(T5_PATH)
t5_config = T5Config.from_pretrained(T5_PATH)
t5_mlm = T5ForConditionalGeneration.from_pretrained(T5_PATH, config=t5_config).to(DEVICE)

# Input text
text = 'India is a <extra_id_0> of the world. </s>'

encoded = t5_tokenizer.encode_plus(text, add_special_tokens=True, return_tensors='pt')
input_ids = encoded['input_ids'].to(DEVICE)

# Generaing 20 sequences with maximum length set to 5
outputs = t5_mlm.generate(input_ids=input_ids, 
                          num_beams=200, num_return_sequences=20,
                          max_length=5)

_0_index = text.index('<extra_id_0>')
_result_prefix = text[:_0_index]
_result_suffix = text[_0_index+12:]  # 12 is the length of <extra_id_0>

def _filter(output, end_token='<extra_id_1>'):
    # The first token is <unk> (inidex at 0) and the second token is <extra_id_0> (indexed at 32099)
    _txt = t5_tokenizer.decode(output[2:], skip_special_tokens=False, clean_up_tokenization_spaces=False)
    if end_token in _txt:
        _end_token_index = _txt.index(end_token)
        return _result_prefix + _txt[:_end_token_index] + _result_suffix
    else:
        return _result_prefix + _txt + _result_suffix

results = list(map(_filter, outputs))
results

Output:

['India is a cornerstone of the world. </s>',
 'India is a part of the world. </s>',
 'India is a huge part of the world. </s>',
 'India is a big part of the world. </s>',
 'India is a beautiful part of the world. </s>',
 'India is a very important part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a unique part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a beautiful country in of the world. </s>',
 'India is a part of the of the world. </s>',
 'India is a small part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a country in the of the world. </s>',
 'India is a large part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a significant part of the world. </s>',
 'India is a part of the world. </s>']

@girishponkiya Thanks for your example! Unfortunately, I can't reproduce your results. I get

['India is a _0> of the world. </s>',
 'India is a  ⁇ extra of the world. </s>',
 'India is a India is  of the world. </s>',
 'India is a  ⁇ extra_ of the world. </s>',
 'India is a a  of the world. </s>',
 'India is a [extra_ of the world. </s>',
 'India is a India is an of the world. </s>',
 'India is a of the world of the world. </s>',
 'India is a India. of the world. </s>',
 'India is a is a of the world. </s>',
 'India is a India  ⁇  of the world. </s>',
 'India is a Inde is  of the world. </s>',
 'India is a ] of the of the world. </s>',
 'India is a . of the world. </s>',
 'India is a _0 of the world. </s>',
 'India is a is  ⁇  of the world. </s>',
 'India is a india is  of the world. </s>',
 'India is a India is the of the world. </s>',
 'India is a -0> of the world. </s>',
 'India is a  ⁇ _ of the world. </s>']

Tried on CPU, GPU, 't5-base' and 't5-3b' — same thing.

Could you please mention the version of torch, transformers and tokenizers?

I used the followings:

  • torch: 1.5.0+cu101
  • transformers: 2.8.0
  • tokenizers: 0.7.0

tokenizers in the latest version of transformers has a bug. Looking at your output, I believe you are using a buggy version of tokenizers.

@girishponkiya I'm using

transformers 2.9.0
tokenizers 0.7.0
torch 1.4.0

Tried tokenizers-0.5.2 transformers-2.8.0 — now it works, thank you!

Thanks to @takahiro971. He pointed out this bug in #4021.

@girishponkiya thanks a lot for your above code. Your example works but if I run your above code instead with the text :

text = "<extra_id_0> came to power after defeating Stalin"

I get the following error:

/usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py in _generate_beam_search(self, input_ids, cur_len, max_length, min_length, do_sample, early_stopping, temperature, top_k, top_p, repetition_penalty, no_repeat_ngram_size, bad_words_ids, bos_token_id, pad_token_id, eos_token_id, decoder_start_token_id, batch_size, num_return_sequences, length_penalty, num_beams, vocab_size, encoder_outputs, attention_mask)
   1354             # test that beam scores match previously calculated scores if not eos and batch_idx not done
   1355             if eos_token_id is not None and all(
-> 1356                 (token_id % vocab_size).item() is not eos_token_id for token_id in next_tokens[batch_idx]
   1357             ):
   1358                 assert torch.all(

UnboundLocalError: local variable 'next_tokens' referenced before assignment

Any ideas of the cause?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings