Transformers: What is the best way to handle sequences > max_len for tasks like abstract summarization?

Created on 12 Oct 2019 · 9Comments · Source: huggingface/transformers

What is the best way to handle situations where a sequence in your dataset exceeds the max length defined for a model?

For example, if I'm working on an abstract summarization task with a Bert model having a max_position_embeddings=512 and tokenizer with max_len=512, how should I handle documents where the tokens to evaluate exceed 512?

Is there a recommended practice for this situation?

Thanks

wontfix

Source

ohmeow

Most helpful comment

Thanks Remi!

Yah I'm playing with your summarization code in huggingface as we speak.
Looking great! Would be nice to have fine-tuning scripts included for
reference as well.

Are you all working on implementing the extractive summarization and the
double-fine-tuning example for abstractive in the paper?

Thanks - wg

On Tue, Oct 15, 2019 at 12:32 PM Rémi Louf notifications@github.com wrote:

@Colanim https://github.com/Colanim Indeed, for newspaper articles most
of the information is contained in the first sentences. This is how
journalists are taught to write! The dataset does not really push the
models to their limits. If only longer pieces like New Yorker articles were
available in a big dataset...

@ohmeow https://github.com/ohmeow I am currently working on the
implementation of several seq2seq models that use transformers, and our
first example will be abstractive summarization (PR #1455
https://github.com/huggingface/transformers/pull/1455 )

I am also curious about solutions to the finite number of tokens limit :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1503?email_source=notifications&email_token=AAADNMEZGOZPKRB4RN43H73QOYLEDA5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBJ6M6A#issuecomment-542369400,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADNMAIQQQMAPB3F7BK5FLQOYLEDANCNFSM4JAAO2KQ
.

ohmeow on 11 Dec 2019

🎉2

All 9 comments

Most people truncate the document at 512 tokens.

Most of the time it is enough. For example on CNNDM dataset, the lead-3 baseline give a pretty strong score, for simply using the first 3 sentences of the article as summary.

It indicates that most salient information are located at the beginning of the document (in this particular case).

But I'm also curious of the possible solutions to really handle longer sequences (truncating is not really handling it...)

astariul-colanim on 14 Oct 2019

Good information ... thanks.

Are any of the Transformer models available capable of summarization
tasks?

From what I can tell they all seem geared for classification, Language
modeling, question/answering type tasks.

On Sun, Oct 13, 2019 at 7:42 PM Cola notifications@github.com wrote:

Most people truncate the document at 512 tokens.

Most of the time it is enough. For example on CNNDM dataset, the lead-3
baseline give a pretty strong score, for simply using the first 3 sentences
of the article as summary.

It indicates that most salient information are located at the beginning of
the document (in this particular case).

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1503?email_source=notifications&email_token=AAADNMH4PSVJXDZXNZVS5STQOPL75A5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBDHAEQ#issuecomment-541487122,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADNMHXAPKQLZKPX2DOALTQOPL75ANCNFSM4JAAO2KQ
.

ohmeow on 14 Oct 2019

You can take a look at this repo :
https://github.com/nlpyang/PreSumm

astariul-colanim on 15 Oct 2019

👍2

Nice paper/code ... thanks much for your time and the link!

-wg

On Mon, Oct 14, 2019 at 4:29 PM Cola notifications@github.com wrote:

You can take a look at this repo :
https://github.com/nlpyang/PreSumm

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1503?email_source=notifications&email_token=AAADNMAWWMBWPF4K5EKTMMDQOT6HLA5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBG5S6A#issuecomment-541972856,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADNMAKZ4LKHQJOUZKCOU3QOT6HLANCNFSM4JAAO2KQ
.

ohmeow on 15 Oct 2019

@Colanim Indeed, for newspaper articles most of the information is contained in the first sentences. This is how journalists are taught to write! The dataset does not really push the models to their limits. If only longer pieces like New Yorker articles were available in a big dataset...

@ohmeow I am currently working on the implementation of several seq2seq models that use transformers, and our first example will be abstractive summarization (PR #1455 )

I am also curious about solutions to the finite number of tokens limit :)

rlouf on 15 Oct 2019

Good information ... thanks. Are any of the Transformer models available capable of summarization tasks? From what I can tell they all seem geared for classification, Language modeling, question/answering type tasks.
…
On Sun, Oct 13, 2019 at 7:42 PM Cola @.*> wrote: Most people truncate the document at 512 tokens. Most of the time it is enough. For example on CNNDM dataset, the lead-3 baseline give a pretty strong score, for simply using the first 3 sentences of the article as summary. It indicates that most salient information are located at the beginning of the document (in this particular case). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1503?email_source=notifications&email_token=AAADNMH4PSVJXDZXNZVS5STQOPL75A5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBDHAEQ#issuecomment-541487122>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMHXAPKQLZKPX2DOALTQOPL75ANCNFSM4JAAO2KQ .

maybe this repo also helps
https://github.com/caitian521/RCTransformer

caitian521 on 11 Dec 2019

Thanks Remi!

Yah I'm playing with your summarization code in huggingface as we speak.
Looking great! Would be nice to have fine-tuning scripts included for
reference as well.

Are you all working on implementing the extractive summarization and the
double-fine-tuning example for abstractive in the paper?

Thanks - wg

On Tue, Oct 15, 2019 at 12:32 PM Rémi Louf notifications@github.com wrote:

@Colanim https://github.com/Colanim Indeed, for newspaper articles most
of the information is contained in the first sentences. This is how
journalists are taught to write! The dataset does not really push the
models to their limits. If only longer pieces like New Yorker articles were
available in a big dataset...

@ohmeow https://github.com/ohmeow I am currently working on the
implementation of several seq2seq models that use transformers, and our
first example will be abstractive summarization (PR #1455
https://github.com/huggingface/transformers/pull/1455 )

I am also curious about solutions to the finite number of tokens limit :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1503?email_source=notifications&email_token=AAADNMEZGOZPKRB4RN43H73QOYLEDA5CNFSM4JAAO2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBJ6M6A#issuecomment-542369400,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADNMAIQQQMAPB3F7BK5FLQOYLEDANCNFSM4JAAO2KQ
.

ohmeow on 11 Dec 2019

🎉2

Glad it works! This is not on the roadmap at the moment, but we may come back to it later.

rlouf on 12 Dec 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.