What is the best way to handle situations where a sequence in your dataset exceeds the max length defined for a model?
For example, if I'm working on an abstract summarization task with a Bert model having a max_position_embeddings=512 and tokenizer with max_len=512, how should I handle documents where the tokens to evaluate exceed 512?
Is there a recommended practice for this situation?
Thanks
Most people truncate the document at 512 tokens.
Most of the time it is enough. For example on CNNDM dataset, the lead-3 baseline give a pretty strong score, for simply using the first 3 sentences of the article as summary.
It indicates that most salient information are located at the beginning of the document (in this particular case).
But I'm also curious of the possible solutions to really handle longer sequences (truncating is not really handling it...)
Good information ... thanks.
Are any of the Transformer models available capable of summarization
tasks?
From what I can tell they all seem geared for classification, Language
modeling, question/answering type tasks.
On Sun, Oct 13, 2019 at 7:42 PM Cola notifications@github.com wrote:
Most people truncate the document at 512 tokens.
Most of the time it is enough. For example on CNNDM dataset, the lead-3
baseline give a pretty strong score, for simply using the first 3 sentences
of the article as summary.It indicates that most salient information are located at the beginning of
the document (in this particular case).—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1503?email_source=notifications&email_token=AAADNMH4PSVJXDZXNZVS5STQOPL75A5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBDHAEQ#issuecomment-541487122,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADNMHXAPKQLZKPX2DOALTQOPL75ANCNFSM4JAAO2KQ
.
You can take a look at this repo :
https://github.com/nlpyang/PreSumm
Nice paper/code ... thanks much for your time and the link!
-wg
On Mon, Oct 14, 2019 at 4:29 PM Cola notifications@github.com wrote:
You can take a look at this repo :
https://github.com/nlpyang/PreSumm—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1503?email_source=notifications&email_token=AAADNMAWWMBWPF4K5EKTMMDQOT6HLA5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBG5S6A#issuecomment-541972856,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADNMAKZ4LKHQJOUZKCOU3QOT6HLANCNFSM4JAAO2KQ
.
@Colanim Indeed, for newspaper articles most of the information is contained in the first sentences. This is how journalists are taught to write! The dataset does not really push the models to their limits. If only longer pieces like New Yorker articles were available in a big dataset...
@ohmeow I am currently working on the implementation of several seq2seq models that use transformers, and our first example will be abstractive summarization (PR #1455 )
I am also curious about solutions to the finite number of tokens limit :)
Good information ... thanks. Are any of the Transformer models available capable of summarization tasks? From what I can tell they all seem geared for classification, Language modeling, question/answering type tasks.
…
On Sun, Oct 13, 2019 at 7:42 PM Cola @.*> wrote: Most people truncate the document at 512 tokens. Most of the time it is enough. For example on CNNDM dataset, the lead-3 baseline give a pretty strong score, for simply using the first 3 sentences of the article as summary. It indicates that most salient information are located at the beginning of the document (in this particular case). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1503?email_source=notifications&email_token=AAADNMH4PSVJXDZXNZVS5STQOPL75A5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBDHAEQ#issuecomment-541487122>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMHXAPKQLZKPX2DOALTQOPL75ANCNFSM4JAAO2KQ .
maybe this repo also helps
https://github.com/caitian521/RCTransformer
Thanks Remi!
Yah I'm playing with your summarization code in huggingface as we speak.
Looking great! Would be nice to have fine-tuning scripts included for
reference as well.
Are you all working on implementing the extractive summarization and the
double-fine-tuning example for abstractive in the paper?
Thanks - wg
On Tue, Oct 15, 2019 at 12:32 PM Rémi Louf notifications@github.com wrote:
@Colanim https://github.com/Colanim Indeed, for newspaper articles most
of the information is contained in the first sentences. This is how
journalists are taught to write! The dataset does not really push the
models to their limits. If only longer pieces like New Yorker articles were
available in a big dataset...@ohmeow https://github.com/ohmeow I am currently working on the
implementation of several seq2seq models that use transformers, and our
first example will be abstractive summarization (PR #1455
https://github.com/huggingface/transformers/pull/1455 )I am also curious about solutions to the finite number of tokens limit :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1503?email_source=notifications&email_token=AAADNMEZGOZPKRB4RN43H73QOYLEDA5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBJ6M6A#issuecomment-542369400,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADNMAIQQQMAPB3F7BK5FLQOYLEDANCNFSM4JAAO2KQ
.
Glad it works! This is not on the roadmap at the moment, but we may come back to it later.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Thanks Remi!
Yah I'm playing with your summarization code in huggingface as we speak.
Looking great! Would be nice to have fine-tuning scripts included for
reference as well.
Are you all working on implementing the extractive summarization and the
double-fine-tuning example for abstractive in the paper?
Thanks - wg
On Tue, Oct 15, 2019 at 12:32 PM Rémi Louf notifications@github.com wrote: