Transformers: Does max_seq_length specify the maxium number of words

Created on 10 Dec 2018 · 7Comments · Source: huggingface/transformers

I'm trying to figure out how the --max_seq_length parameter works in run_classifier. Based on the source, it seems like it represents the number of words? Is that correct?

Source

artemlos

Most helpful comment

max_seq_length specifies the maximum number of tokens of the input. The number of token is superior or equal to the number of words of an input.

For example, the following sentence:

The man hits the saxophone and demonstrates how to properly use the racquet.

is tokenized as follows:

the man hits the saxophone and demonstrates how to properly use the ra ##c ##quet .

And depending on the task 2 to 3 additional special tokens ([CLS] and [SEP]) are added to the input to format it.

rodgzilla on 10 Dec 2018

👍6

All 7 comments

max_seq_length specifies the maximum number of tokens of the input. The number of token is superior or equal to the number of words of an input.

For example, the following sentence:

The man hits the saxophone and demonstrates how to properly use the racquet.

is tokenized as follows:

the man hits the saxophone and demonstrates how to properly use the ra ##c ##quet .

And depending on the task 2 to 3 additional special tokens ([CLS] and [SEP]) are added to the input to format it.

rodgzilla on 10 Dec 2018

👍6

@rodgzilla thanks!

artemlos on 11 Dec 2018

could we make it smaller?

tsungruihon on 13 Feb 2019

So what if there are sentences where the maximum number of tokens is greater than max_seq_length?

Does that mean extra tokens beyond max_seq_length will get cut off?

echan00 on 23 Apr 2019

@tsungruihon yes, just use smaller sentences

@echan00 no automatic cut off but there is a warning from the tokenizer that your inputs are too long and the model will throw an error. You have to limit the size manually.

thomwolf on 23 Apr 2019

👍2

Hi All,

Does that mean we cannot use BERT for classifying long documents. The documents having 5-6 Paragraphs and each paragraph having 10-15 mins with about 10-12 words in each line ?

SaurabhBhatia0211 on 25 Jul 2020

@SaurabhBhatia0211
You can try splitting a document to smaller chunks (e.g. paragraphs or even lines), computing embedding for each of those chunks, and average those vectors to get the document representation.