transformers version: #9c0afdaf7b091c341072b432ad6ee17ba7a5016bmT5: @patrickvonplaten
Generating from mT5-small gives (nearly) empty output:
from transformers import MT5ForConditionalGeneration, T5Tokenizer
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
article = "translate to french: The capital of France is Paris."
batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], return_tensors="pt")
output_ids = model.generate(input_ids=batch.input_ids, num_return_sequences=1, num_beams=8, length_penalty=0.1)
tokenizer.decode(output_ids[0])
>>> <pad> <extra_id_0></s>
Using the same input for T5 gives reasonable output:
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")
article = "translate to french: The capital of France is Paris."
batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], return_tensors="pt")
output_ids = model.generate(input_ids=batch.input_ids, num_return_sequences=1, num_beams=8, length_penalty=0.1)
tokenizer.decode(output_ids[0])
>>> <pad> La capitale de la France est Paris.</s>
My understanding is that mT5 is trained in the same way as T5, and should work in a very similar way?
mT5 is not pretrained on downstream tasks like T5 was - see: https://huggingface.co/transformers/master/model_summary.html#mt5
So it not surprising that mT5 won't work well out-of-the-box without fine-tuning.
Ah, I hadn't realised that. But in that case, wouldn't the expected output be a reconstruction of the input?
Ah, I hadn't realised that. But in that case, wouldn't the expected output be a reconstruction of the input?
Hard to say if the model does not include any sentinel tokens (<extra_id_1>) and if one uses generate() instead of just the forward pass.... . Wolud be interesting to play around with the two pre-trained model variants though and see what differences they show...
I agree that I would only get reconstruction if the decoding setup matched training :) Can you point me at any documentation that describes what special tokens are expected? I dug around in your implementation and the official repo but couldn't see anything. The output of tokenizer.prepare_seq2seq_batch() is the same for src and tgt as well (presumably because it uses the T5 tokenizer - does it not need its own?)
Edit: Looking again, it seems like the sentinel tokens are just the equivalent of [MASK]? In which case the model should be able to reconstruct the input if it has access to the full (un-noised) sequence.
Maybe these pointers help:
mT5 is pretrained exactly like T5 only without the downstream supersived training mixin. I think the T5 paper should explain in detail how this in done.
Does anybody have some more pointers on how to use (train) the mT5 model that has been added to master for text generation? Anything explaining how the finetuning is done in practice using Huggingface Transformers would be greatly appreciated!
Hey @Rijgersberg, what exactly do you mean by text generation ? GPT2-like open-end text generation?
Well not open-end text generation in the sense of "writing", but using text-to-text generation to perform all types of different NLP tasks with little to no training. Basically what the GPT-3-paper calls "few shot learning".
Specifically, I would be interested in replicating the WT5?! Training Text-to-Text Models to Explain their Predictions results in languages other than English. But I'm having some trouble understanding what the differences between the T5 and mT5 models in Transformers mean for accomplishing that task.
Hey @tomhosking how did you use MT5ForConditionalGeneration, T5Tokenizer
I used
pip install transformers
But it is showing
ImportError: cannot import name 'MT5ForConditionalGeneration'
How can we install it?馃
@parthplc You can specify version of package You would like to install. For me it was experimental: transformers==4.0.0rc1 and it works fine.
For training mT5 model for generating summary You can check out this post. It worked for me.
[edit]
I forgot to mention, the only modification You have to make is to replace T5ForConditionalGeneration with MT5ForConditionalGeneration.
Well not open-end text generation in the sense of "writing", but using text-to-text generation to perform all types of different NLP tasks with little to no training. Basically what the GPT-3-paper calls "few shot learning".
Specifically, I would be interested in replicating the WT5?! Training Text-to-Text Models to Explain their Predictions results in languages other than English. But I'm having some trouble understanding what the differences between the T5 and mT5 models in Transformers mean for accomplishing that task.
In this case, I would just fine-tune mT5 with the normal causal language modeling objective meaning:
from transformers import MT5ForConditionalGeneration, T5Tokenizer
mt5 = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
mt5_tok = T5Tokenizer.from_pretrained("google/mt5-base")
input_ids = mt5_tok("explain sentiment: I went to see this movie with my husband, and we both thought the acting was terrible!", return_tensors="pt").input_ids # in the language of your choice
labels = mt5_tok("negative explanation: the acting was terrible.", return_tensors="pt").input_ids # in the language of your choice
loss = mt5(input_ids=input_ids, labels=labels).loss
I took one of the visual examples of the paper you mentioned.
In short, there is no difference in how mt5 and t5 should be fine-tuned.
Also, @mrm8488 already successfully fine-tuned an mT5 model: https://twitter.com/mrm8488/status/1329478063768350723
sorry to ping you here @mrm8488 - but maybe you have some tips/tricks for mt5 fine-tuning?
Also pinging our T5 fine-tuning expert @patil-suraj
Well not open-end text generation in the sense of "writing", but using text-to-text generation to perform all types of different NLP tasks with little to no training. Basically what the GPT-3-paper calls "few shot learning".
I'm not sure if you can use mT5 with no training (fine-tuning), since it was not pre-trained with any supervised objective like T5.
One experiment to try is to fine-tune mT5 on the english data and see if it works for your language without any language specific fine-tuning (In my experiments, T5 trained on English SQuAD for que gen gave interesting results for French and German without any language specific fine-tuning).
But for better results you should fine-tune mT5 on the language specific dataset.
And also as Patrick said, you can fine-tune mT5 and T5 the same way.
The major differences between mT5 and T5 are
mT5 is based on T51.1Hi, I slightly modified the script provided by @patil-suraj to fine-tune [T5 on SQUAD] (https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) and after many epochs (I think I am missing anything/doing something wrong) I got 'decent' results fine-tuning mT5-small on tydiQA for multilingual QA https://huggingface.co/mrm8488/mT5-small-finetuned-tydiqa-for-xqa. The PR with the model card for more details is not approved yet.
Hi, I slightly modified the script provided by @ patil-suraj to fine-tune [
T5on SQUAD] (https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) and after many epochs (I think I am missing anything/doing something wrong) I got 'decent' results fine-tuning mT5-small on tydiQA for multilingual QA https://huggingface.co/mrm8488/mT5-small-finetuned-tydiqa-for-xqa. The PR with the model card for more details is not approved yet.
just merged it :-) BTW, you can now directly create the model cards online - no need for PRs anymore ;-)
Well not open-end text generation in the sense of "writing", but using text-to-text generation to perform all types of different NLP tasks with little to no training. Basically what the GPT-3-paper calls "few shot learning".
I'm not sure if you can use mT5 with no training (fine-tuning), since it was not pre-trained with any supervised objective like
T5.One experiment to try is to fine-tune
mT5on the english data and see if it works for your language without any language specific fine-tuning (In my experiments,T5trained on English SQuAD for que gen gave interesting results for French and German without any language specific fine-tuning).But for better results you should fine-tune
mT5on the language specific dataset.And also as Patrick said, you can fine-tune
mT5andT5the same way.
The major differences betweenmT5andT5are
mT5is based onT51.1- pre-trained on 101 languages
- no supervised pre-training
hey @patil-suraj @mrm8488 how can we finetune mT5 for other languages. Let's suppose we have language translation problem for any language other than English and if we finetune using T5 tokenizer we would be replacing each word with unk tokens. how will it be fine-tuned? eg.
print(tokenizer.decode(data['source_ids']))
print(tokenizer.decode(data['target_ids']))
English to Hindi: Tell me the name of the ninth month.</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
<unk> <unk> <unk> <unk> <unk> <unk> </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
@parthplc - I don't really understand your question. Since mT5 was trained on 101 languages it's tokenizer can obviously handle all those languages, e.g.:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/mt5-small")
tok.decode(tok("Der Satz wird auch definiert als sprachliche Einheit, die aus Subjekt und Pr盲dikat besteht. Dies soll auf Aristoteles zur眉ckgehen. Entsprechend definiert die traditionelle Grammatik den Satz als bestehend aus: Satzaussage (Pr盲dikat), Satzerg盲nzung (Objekt) und Satzgegenstand (Subjekt).").input_ids) # gives no <unk> symbols
Hopefully, this makes more sense now
Most helpful comment
just merged it :-) BTW, you can now directly create the model cards online - no need for PRs anymore ;-)