Transformers: Lack of funetune examples for T5 model

Created on 18 May 2020  Â·  39Comments  Â·  Source: huggingface/transformers

🚀 Feature request

It seems like examples under transformers/examples doesn't support T5 except for translation.

Motivation

We need more examples! It should be easy for some simple benchmarks.

Your contribution

None currently.. But I am working on it!

Most helpful comment

@ghomasHudson @Chenyzsjtu
DecaNLP sounds good. So we can include one generative task and one non-generative.
Let's see what @patrickvonplaten says then I'll move ahead with this.

Till then can you check my fine-tuning examples and give me some feedback. Here are the notebooks.

For SQuAD here
For (IMDB sentiment, Emotion classification, SWAG multiple choice) here

All 39 comments

I've setup T5 fine-tuning using lightning and also HF's new Trainer. I can submit a PR for that. Would like to hear from @patrickvonplaten

It would be awesome if you could open a PR for this!

Great! I'll organize my examples and submit PR as soon as I finish it.

@Chenyzsjtu @patrickvonplaten Could you please suggest me a good task for this ? I've fine-tuned T5 on mostly non-generative tasks (IMDB sentiment, Emotion classification, SWAG multiple choice, SQuAD1.1) and 2 generative tasks, cnn/dm and question generation. Which tasks should I consider adding ?

The GLUE and SuperGLUE tasks would be an obvious choice (mainly classification though). The DecaNLP tasks also have a nice mix of classification and generation.

@Chenyzsjtu @patrickvonplaten Could you please suggest me a good task for this ? I've fine-tuned T5 on mostly non-generative tasks (IMDB sentiment, Emotion classification, SWAG multiple choice, SQuAD1.1) and 2 generative tasks, cnn/dm and question generation. Which tasks should I consider adding ?

There are many benchmarks tested in the original paper. Since we only need a example for demonstration purpose, a single task in GLUE or SuperGLUE should be enough.
Mayber MRPC? It needs less training steps, and was finetuned by itself rather than by the GLUE mixture as descriped in paper. Plus, it is also the example for bert here in examples/text-classification.

@ghomasHudson @Chenyzsjtu
DecaNLP sounds good. So we can include one generative task and one non-generative.
Let's see what @patrickvonplaten says then I'll move ahead with this.

Till then can you check my fine-tuning examples and give me some feedback. Here are the notebooks.

For SQuAD here
For (IMDB sentiment, Emotion classification, SWAG multiple choice) here

That's a great notebook!

Also note that you can now also use our nlp library, here: https://github.com/huggingface/nlp which will reduce your whole data preprocessing code to just a couple of lines. I think we have all the datasets you are using in your notebook(s) in the library :-).

I think @sshleifer and @julien-c have worked more on the examples lately, so they probably would know better how to integrate it. @julien-c, @sshleifer - do you think we can add a pytorch lightning T5 notebook to our examples?

@patrickvonplaten
Yes, using nlp library makes more sense. The SQuAD notebook above uses nlp library for data processing. Just ~10 lines of data processing code, and also uses HF trainer instead of lightning. So I have both the trainers ready, lightning as well as HF trainer.

IMO we should use HF trainer instead of lightning since most of the examples now use HF trainer. Converting above tasks in HF trainer is fairly easy.

Only just saw the SQuAD notebook - amazing!

Ok, we had some internal discussions on how to add notebooks and decided to add a table to the README as shown in this PR: https://github.com/huggingface/transformers/pull/4441. @patil-suraj I use your SQuAD notebook as an example of how a notebook could be added. Can you maybe check if that's ok for you?

If that's fine for you I'll merge the PR and you can also add the other notebook for IMDB, Emotion classification, ... in a new PR - I would be awesome if you could also use nlp there, but you don't have to add it. Everything that's useful is welcome :-)

@patrickvonplaten
Thank you for considering this! This sounds good to me.
I'll also use the nlp library in the other notebook and open another PR for that.

@patrickvonplaten
Thank you for considering this! This sounds good to me.
I'll also use the nlp library in the other notebook and open another PR for that.

That sounds awesome :-)

I’ve also worked on an example notebook for tweet sentiment span extraction with T5 that I can share around this weekend (kaggle compe dataset).

Would it be ok to PR this as well? Would I have to add the dataset to nlp? 🙂

For sure feel free to open a PR :-) It would be nice if you use nlp, but that's definitely not a must!
We are happy about every community notebook :-)

@patil-suraj
Thanks a lot for your contribution of fine-tuning notebook!
I notice that in the notebook your final performance for SQuAD1.1 on t5-base is:
"{'exact_match': 81.56102175969725, 'f1': 89.96016967193422}"
but in the paper it is: F1/EM = 92.08/85.44
It seems that there is something we need to take care of here.

@Chenyzsjtu
The goal of the notebook was to get T5 working on TPU and show how we can fine-tune it for QA. So I didn't pay much attention to exact metrics. You can train it by following the learning rate and number of epochs used in the paper. That might increase it.

@Chenyzsjtu
The goal of the notebook was to get T5 working on TPU and show how we can fine-tune it for QA. So I didn't pay much attention to exact metrics. You can train it by following the learning rate and number of epochs used in the paper. That might increase it.

I will have a try. Thanks!

@Chenyzsjtu
The goal of the notebook was to get T5 working on TPU and show how we can fine-tune it for QA. So I didn't pay much attention to exact metrics. You can train it by following the learning rate and number of epochs used in the paper. That might increase it.

There is one more tiny problem...
Have you tried evaluating the very first checkpoint (the pretrained model itself) on SQuAD?
It seems that your posted finetune-performance
"{'exact_match': 81.56102175969725, 'f1': 89.96016967193422}"
is worse than that of the pretrained model, which is
83.122/90.958

Hmm, interesting. I'll have a look.

@patil-suraj hi, I'm very new to t5. How can use t5 for sentiment classification (simply just binary). I want to try on this data sets but don't know how to approach. I have bit understanding in nlp. Would anyone please suggest. AFAIK, t5 performs text-to-text, so if I want to make binary (numeric), I've to map the 1 and 0 as positive and negative.

Hi @Lincoln93
You are right, you can map 0 and 1 as positive and negative and ask the model to predict the text.
Have a look at this notebook. It shows how to fine-tune t5 for binary as well as multi-class classification.

We have a bunch of T5 notebooks now thanks to you guys :-) Closing the issue...

@patil-suraj Very cool notebooks indeed!

Hi @patil-suraj awesome notebooks! I noticed you always call model.generate(...) to evaluate, I wonder, is there a reason for this, and is that really necessary for t5? why not just use simple inference? model(**inputs) like BERT and others do?

Hi @patil-suraj awesome notebooks! I noticed you always call model.generate(...) to evaluate, I wonder, is there a reason for this, and is that really necessary for t5? why not just use simple inference? model(**inputs) like BERT and others do?

You may need n-gram generation for more correct sentences?

Hi @patil-suraj awesome notebooks! I noticed you always call model.generate(...) to evaluate, I wonder, is there a reason for this, and is that really necessary for t5? why not just use simple inference? model(**inputs) like BERT and others do?

Hi @saareliad , BERT models are mostly used for discriminative tasks i.e (classification, token classification, span extraction), so you just need to call the model(**input) only once. Where as T5 is a seq-to-seq generative model, which generates a single token at a time.

So to sample a sequence without .generate

  1. feed in the start token as input_ids to forward
  2. sample the next token by argmax
  3. add that token to input_ids
  4. repeat until you reach max len or sample eos

this quickly becomes complicated if you want beam search, or other sampling methods like top-k, top-p, temperature etc. So .generate is actually a powerful wrapper for all SOTA decoding methods.

Check this awesome blog post by @patrickvonplaten to see what .generate has to offer

Thanks @patil-suraj ,

If we reduce the problem just to SQUAD, If I'm not wrong the extra .generate features are not used there at all?

For example, according the the code of your squad example:

answers = []
for batch in tqdm(dataloader):
  outs = model.generate(input_ids=batch['input_ids'], 
                        attention_mask=batch['attention_mask'],
                        max_length=16,
                        early_stopping=True)
  outs = [tokenizer.decode(ids) for ids in outs]
  answers.extend(outs)

since I didn't see there are beams for squad, early_stopping=True is not needed, and what happens is, more or less, the loop you described?

I ask because I experience similar problem to what you had with TPU - I have to choose between running generate on CPU or running the aforementioned simplistic version on many (8-40) GPUs, which of course will be much faster even without using cache/past.

Hi,
Is there an example showing T5 is finetuned on multiple tasks? with allowing to access the model architecture? thanks

Hi @rabeehk
by multiple tasks do you mean multitask or different tasks ?
if it's the latter, the this community notebook shows how to fine-tune T5 for different tasks.

If multitask, then have a look at this project which fine-tunes T5 for question generation, QA and answer extraction.

Hi
I mean a mixture of multiple tasks like the original T5 paper on TPU so to
run efficiently for large scale and large datasets. Is there an
example/script by huggingface showing it?
thanks alot

On Thu, Oct 22, 2020, 4:10 PM Suraj Patil notifications@github.com wrote:

Hi @rabeehk https://github.com/rabeehk
by multiple tasks do you mean multitask or different tasks ?
if it's the latter, the this community notebook
https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb
shows how to fine-tune T5 for different tasks.

If multitask, then have a look at this project
https://github.com/patil-suraj/question_generation which fine-tunes T5
for question generation, QA and answer extraction.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/4426#issuecomment-714521374,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCA7OJV3PS6ECZXEGKTSMA4N3ANCNFSM4NDWJKVA
.

I'm pretty sure there isn't any examples which replicate the multitask training used by t5. This Notebook would be good start (You'd need to select the right tasks, format them in the t5 style, use T5ForConditionalGeneration as the model, and adjust everything so it's doing things with a single seq2seq model).

I'm doing something similar as part of my research so I might have something closer at some point.

i see some script on original author repo but it is not with data
parellism... so not usable for scale...

On Fri, Oct 23, 2020, 12:35 PM Thomas Hudson notifications@github.com
wrote:

I'm pretty sure there isn't any examples which replicate the multitask
training used by t5. This Notebook
https://colab.research.google.com/github/zphang/zphang.github.io/blob/master/files/notebooks/Multi_task_Training_with_Transformers_NLP.ipynb
would be good start (You'd need to select the right tasks, format them in
the t5 style, use T5ForConditionalGeneration as the model, and adjust
everything so it's doing things with a single seq2seq model).

I'm doing something similar as part of my research so I might have
something closer at some point.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/4426#issuecomment-715258017,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCD5YS22DOTF7NBMN7DSMFL5VANCNFSM4NDWJKVA
.

Hi, @ghomasHudson, I've tried did T5 multitask training using task prefixes for my que gen project with pretty good results.

@rabeehk
You can use the Seq2SeqTrainer which supports TPU training, just build your own multitask dataset using task prefixes.
This script shows how to use Seq2SeqTrainer, it should be easy to modify using your own dataset.

As T5 treats every task as text generation, we don't need any special changes to model. We just need a multi-task dataset and right sampling strategy if the number of examples are not balanced between different tasks.

Good to hear @patil-suraj - I've been getting good results out of some very basic scripts I wrote using transformers + datasets + pytorch_lightning.

I'd love to see multitask learning have proper support in huggingface with options for multiple sampling methods etc... (#4426 #4340 #6872 #7270 huggingface/datasets#217). Getting the implementation right is not trivial though.

do you think this script is working fine?
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/hf_model.py
this seems to be adding this, but not with data parallelism, I would
greatly appreciate it if huggingface group could have a look,
and trying to add this script to their repository, with data
parallelism thanks

On Fri, Oct 23, 2020 at 12:59 PM Thomas Hudson notifications@github.com
wrote:

Good to hear @patil-suraj https://github.com/patil-suraj - I've been
getting good results out of some very basic scripts I wrote using transformers

  • datasets + pytorchlightning.

I'd love to see multitask learning have proper support in huggingface with
options for multiple sampling methods etc... (#4426
https://github.com/huggingface/transformers/issues/4426 #4340
https://github.com/huggingface/transformers/issues/4340 #6872
https://github.com/huggingface/transformers/issues/6872 #7270
https://github.com/huggingface/transformers/issues/7270
huggingface/datasets#217
https://github.com/huggingface/datasets/issues/217). Getting the
implementation right is not trivial though.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/4426#issuecomment-715269787,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCF7EIGT4AYWJ26YRYDSMFOYVANCNFSM4NDWJKVA
.

Hi Everyone,
I am looking for an example showing how to train T5 on multiple datasets
using huggingface repo, hopefully at scale.
This example
https://colab.sandbox.google.com/github/zphang/zphang.github.io/blob/master/files/notebooks/Multi_task_Training_with_Transformers_NLP.ipynb
does not show it with T5, and this example does not work
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/hf_model.py

I appreciate assisting me with suggesting me the example showing this in
huggnigface repo, thanks.
Best
Rabeeh

On Fri, Oct 23, 2020 at 3:39 PM Rabeeh Karimi rabeeh.k68@gmail.com wrote:

do you think this script is working fine?
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/hf_model.py
this seems to be adding this, but not with data parallelism, I would
greatly appreciate it if huggingface group could have a look,
and trying to add this script to their repository, with data
parallelism thanks

On Fri, Oct 23, 2020 at 12:59 PM Thomas Hudson notifications@github.com
wrote:

Good to hear @patil-suraj https://github.com/patil-suraj - I've been
getting good results out of some very basic scripts I wrote using transformers

  • datasets + pytorchlightning.

I'd love to see multitask learning have proper support in huggingface
with options for multiple sampling methods etc... (#4426
https://github.com/huggingface/transformers/issues/4426 #4340
https://github.com/huggingface/transformers/issues/4340 #6872
https://github.com/huggingface/transformers/issues/6872 #7270
https://github.com/huggingface/transformers/issues/7270
huggingface/datasets#217
https://github.com/huggingface/datasets/issues/217). Getting the
implementation right is not trivial though.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/4426#issuecomment-715269787,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCF7EIGT4AYWJ26YRYDSMFOYVANCNFSM4NDWJKVA
.

Hi ghomasHudson, could you tell me please why getting implementation right is not trivial in your view? what are the existing difficutlies? thanks

Well if you look at huggingface/datasets#217, that's probably the most complete discussion.

My impression is that part of the difficulty is finding the right level of abstraction for this functionality (should it be a dataset-level feature? Dataloader? Something else?). My current intuition is that it belongs at the DataLoader-level as you need the concept of batches. Deciding on the API is a little tricky as we have to try and allow a range of sampling methods, and create a general enough framework that we don't over-fit to the ideas of t5.

It's fairly trivial to get this working for an individual example, but a little harder to implement this properly into the library in a general way.

Hi
Thanks for the reply. I read it. For now just being able to move forward
using huggingface repo for training T5, is there any avilable codes showing
how to handle mixture of tasks with simplified manner at least? thanks a
lot for your help

On Sat, Oct 24, 2020, 2:12 PM Thomas Hudson notifications@github.com
wrote:

Hi ghomasHudson, could you tell me please why getting implementation right
is not trivial in your view? what are the existing difficutlies? thanks

Well if you look at the discussion in huggingface/datasets#217
https://github.com/huggingface/datasets/issues/217, that's probably the
most complete discussion.

My impression is that part of the difficulty is finding the right level of
abstraction for this functionality (should it be a dataset-level feature?
Dataloader? Something else?). My current intuition is that it belongs at
the DataLoader-level as you need the concept of batches. Deciding on the
API is a little tricky as we have to try and allow a range of sampling
methods, and create a general enough framework that we don't over-fit to
the ideas of t5.

It's fairly trivial to get this working for an individual example, but a
little harder to implement this properly into the library in a general way.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/4426#issuecomment-715906142,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCFT7PECYHNFAKNLNZLSMLADTANCNFSM4NDWJKVA
.

Hi @ghomasHudson, would it be possible for me to set a quick 30 minutes chat with you? I could not find your email to contact you directly, I appreciate if I could ask more on dataset handling. it would be really helpful for me. thanks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

guanlongtianzi picture guanlongtianzi  Â·  3Comments

fyubang picture fyubang  Â·  3Comments

zhezhaoa picture zhezhaoa  Â·  3Comments

yspaik picture yspaik  Â·  3Comments

iedmrc picture iedmrc  Â·  3Comments