@myleott I am trying to do batch prediction after finetuning Roberta sentence classification model. I do follow the Batched prediction example provided for the MNLI example. That's my code for batch prediction:
from fairseq.data.data_utils import collate_tokens
batch = collate_tokens([roberta.encode(line) for line in data['column']], pad_idx=1)
logprobs = roberta.predict('arxiv_head', batch)
print(logprobs.argmax(dim=1))
This yields:
ValueError: tokens exceeds maximum length: 774 > 512
Each input consists of multiple sentences, so some inputs' lengths might exceed the max_length. But I was expecting this should be fine, and model would automatically truncate the sentences with length longer than 512.
How can I fix that issue? There is also Finetuning RoBERTa on a custom classification task example. I believe it'd be useful and informative to add the code lines on how to do batch prediction for IMDB dataset.
Thanks.
But I was expecting this should be fine, and model would automatically truncate the sentences with length longer than 512.
How can I fix that issue?
You can truncate the sentence yourself, no?
@lematt1991 Thanks for your response. The issue is how I can finetune RoBERTa using input sentences with length >512 but It yields the error above when I do roberta.predict(). Any opinion on that?
You can't change the max sequence length when finetuning a model. You would need to retrain from scratch (setting --max-positions to your desired maximum length).
@lematt1991, so if I am not getting any error while finetuning, that means, the model is truncating my sentences to the max_length for training and validation automatically, right? However, this functionality does not work for inferencing, therefore, I need to truncate my sentences before doing roberta.predict()? is that correct?
You can check the logs of your finetuning run, but I believe the sentences that are longer than --max-positions will be filtered out and ignored, rather than truncated [1], but yes, you will need to truncate them prior to doing roberta.predict(...)
@lematt1991 thanks. I added --truncate-sequence flag to python train.py command as done in the Finetuning RoBERTa on a custom classification task example. I believe this truncates the sentences. But this flag does not work for roberta.predict().
Yep that sounds correct, but truncating sentences should be easy enough
Thanks. Just wanted to understand how finetuning works with longer sentences.