Transformers: Does max_seq_length specify the maxium number of words

Created on 10 Dec 2018  路  7Comments  路  Source: huggingface/transformers

I'm trying to figure out how the --max_seq_length parameter works in run_classifier. Based on the source, it seems like it represents the number of words? Is that correct?

Most helpful comment

max_seq_length specifies the maximum number of tokens of the input. The number of token is superior or equal to the number of words of an input.

For example, the following sentence:

The man hits the saxophone and demonstrates how to properly use the racquet.

is tokenized as follows:

the man hits the saxophone and demonstrates how to properly use the ra ##c ##quet .

And depending on the task 2 to 3 additional special tokens ([CLS] and [SEP]) are added to the input to format it.

All 7 comments

max_seq_length specifies the maximum number of tokens of the input. The number of token is superior or equal to the number of words of an input.

For example, the following sentence:

The man hits the saxophone and demonstrates how to properly use the racquet.

is tokenized as follows:

the man hits the saxophone and demonstrates how to properly use the ra ##c ##quet .

And depending on the task 2 to 3 additional special tokens ([CLS] and [SEP]) are added to the input to format it.

@rodgzilla thanks!

could we make it smaller?

So what if there are sentences where the maximum number of tokens is greater than max_seq_length?

Does that mean extra tokens beyond max_seq_length will get cut off?

@tsungruihon yes, just use smaller sentences

@echan00 no automatic cut off but there is a warning from the tokenizer that your inputs are too long and the model will throw an error. You have to limit the size manually.

Hi All,

Does that mean we cannot use BERT for classifying long documents. The documents having 5-6 Paragraphs and each paragraph having 10-15 mins with about 10-12 words in each line ?

@SaurabhBhatia0211
You can try splitting a document to smaller chunks (e.g. paragraphs or even lines), computing embedding for each of those chunks, and average those vectors to get the document representation.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hsajjad picture hsajjad  路  3Comments

0x01h picture 0x01h  路  3Comments

fabiocapsouza picture fabiocapsouza  路  3Comments

siddsach picture siddsach  路  3Comments

zhezhaoa picture zhezhaoa  路  3Comments