Transformers: Tokenizer encode to have an option to overflow from left

Created on 20 May 2020 · 8Comments · Source: huggingface/transformers

🚀 Feature request

Current tokenizer encode variants ( encode, batch_encode, batch_encode_plus) handle longer sequences than max_length by overflowing tokens from the right hand side and thus restricting the length to max_length. This feature request is to allow an option for the tokenizer encode methods to overflow tokens from the left hand side as well.

Motivation

For problems dealing with dialog, if one were to train an intent classification or next sentence prediction model and the dialog was longer than max_length, one would like to throw away the tokens from the beginning of the conversation as they are less relevant than the more recent messages.

This motivates the need for a encoder that works well with dialog data where more recent tokens are more valuable.

Your contribution

I could change the function truncate_sequences by adding a new truncation_strategy option that will truncate from left. But want to get feedback from the Huggingface team about this proposal.

Tokenization High-Level feature wontfix

Source

misrasaurabh1

👍3

Most helpful comment

Indeed, we should add an option to truncate on the left!
cc @n1t0 for our sprint of September.

thomwolf on 20 Aug 2020

🎉4 👍1

All 8 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 22 Jul 2020

@misrasaurabh1 What solution do you use now for this encoding dialog data problem?

LSinev on 20 Aug 2020

I use something like
self.tokenizer.encode(input)[-self.block_size:]
This throws a warning for length overflow so I deactivate it with logging.
Also one has to make attention masks separately as some models require this.

misrasaurabh1 on 20 Aug 2020

Indeed, we should add an option to truncate on the left!
cc @n1t0 for our sprint of September.

thomwolf on 20 Aug 2020

🎉4 👍1

perhaps add a truncation_side to https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer to be consistent with padding_side.

ldong87 on 26 Aug 2020

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 31 Oct 2020

@thomwolf @n1t0 Any plan for this? I just saw this because of the bot.

I think I can do this, seems like all the logic is here.
https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/tokenization_utils_base.py#L2766

But how about fast 🤗 Tokenizers? Will I need to also change the rust code?

And I noticed something that might be a bug, and can be improved:

https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/tokenization_utils_base.py#L2816-L2831

Here it loops num_tokens_to_remove times to decide how many tokens needs to be truncated for each sequence, which can be calculated without looping.

And in case stride is not 0, it seems to return up to stride*num_tokens_to_remove extra tokens to overflowing_tokens.
https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/tokenization_utils_base.py#L2801-L2803
Also it seems weird to me that overflowing_tokens will be mixed with tokens from ids and pair_ids. Perhaps it should be a tuple of list if TruncationStrategy is longest_first.

Note to self: overflowing_tokens is used in squad to construct another pair if the doc is too long. stride is also used in squad. I can't find other use of overflowing_tokens.

https://github.com/huggingface/transformers/blob/969859d5f67c7106de4d1098c4891c9b03694bbe/src/transformers/data/processors/squad.py#L154-L216

cccntu on 7 Nov 2020

👍2

One feedback about what's happening with this facility of left truncation being not available - its harder to use the datasets library and we have to do python Hackery which reduces the benefits of using the datasets library in the first place.

misrasaurabh1 on 29 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings