Current tokenizer encode variants ( encode, batch_encode, batch_encode_plus) handle longer sequences than max_length by overflowing tokens from the right hand side and thus restricting the length to max_length. This feature request is to allow an option for the tokenizer encode methods to overflow tokens from the left hand side as well.
For problems dealing with dialog, if one were to train an intent classification or next sentence prediction model and the dialog was longer than max_length, one would like to throw away the tokens from the beginning of the conversation as they are less relevant than the more recent messages.
This motivates the need for a encoder that works well with dialog data where more recent tokens are more valuable.
I could change the function truncate_sequences by adding a new truncation_strategy option that will truncate from left. But want to get feedback from the Huggingface team about this proposal.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@misrasaurabh1 What solution do you use now for this encoding dialog data problem?
I use something like
self.tokenizer.encode(input)[-self.block_size:]
This throws a warning for length overflow so I deactivate it with logging.
Also one has to make attention masks separately as some models require this.
Indeed, we should add an option to truncate on the left!
cc @n1t0 for our sprint of September.
perhaps add a truncation_side to https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer to be consistent with padding_side.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@thomwolf @n1t0 Any plan for this? I just saw this because of the bot.
I think I can do this, seems like all the logic is here.
https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/tokenization_utils_base.py#L2766
But how about fast 馃 Tokenizers? Will I need to also change the rust code?
And I noticed something that might be a bug, and can be improved:
Here it loops num_tokens_to_remove times to decide how many tokens needs to be truncated for each sequence, which can be calculated without looping.
And in case stride is not 0, it seems to return up to stride*num_tokens_to_remove extra tokens to overflowing_tokens.
https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/tokenization_utils_base.py#L2801-L2803
Also it seems weird to me that overflowing_tokens will be mixed with tokens from ids and pair_ids. Perhaps it should be a tuple of list if TruncationStrategy is longest_first.
Note to self: overflowing_tokens is used in squad to construct another pair if the doc is too long. stride is also used in squad. I can't find other use of overflowing_tokens.
One feedback about what's happening with this facility of left truncation being not available - its harder to use the datasets library and we have to do python Hackery which reduces the benefits of using the datasets library in the first place.
Most helpful comment
Indeed, we should add an option to truncate on the left!
cc @n1t0 for our sprint of September.