Flair: Text classification with BERT embeddings on long texts

Created on 15 Dec 2019 · 4Comments · Source: flairNLP/flair

I try to use FLair for text classification task. I've a dataset with long texts, and successfully use Flair with classical link: WordEmbeddings + DocumentLSTMEmbeddings.
But after it, I'd like to experiment with SOTA approaches and use BERT-like embeddings.

word_embeddings = [ BertEmbeddings('bert-base-multilingual-cased') ]
document_embeddings: DocumentLSTMEmbeddings = DocumentLSTMEmbeddings(word_embeddings,
                                                                     hidden_size=64,
                                                                     reproject_words=True
                                                                     )
classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)
trainer = ModelTrainer(classifier, corpus)
trainer.train('./BERTClassifierFiles/', max_epochs=10)

After training start I've got RuntimeError:

RuntimeError                              Traceback (most recent call last)
<ipython-input-46-5c4870a98333> in <module>
----> 1 trainer.train('./BERTClassifierFiles/', max_epochs=10)

13 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1482         # remove once script supports set_grad_enabled
   1483         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1484     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1485 
   1486 

RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

There is exists some approach to classify long text files with BERT-like embeddings? As I know, using ELMo may be solution, but I work with Russian language, and Russian pre-trained models not available in Flair at this moment.
I will be appreciate for some advice about using BERT-like embeddings in classification task for long-texts.

question wontfix

Source

OlegDurandin

Most helpful comment

Check https://andriymulyar.com/blog/bert-document-classification .

pommedeterresautee on 15 Dec 2019

👍2

All 4 comments

Check https://andriymulyar.com/blog/bert-document-classification .

pommedeterresautee on 15 Dec 2019

👍2

Check https://andriymulyar.com/blog/bert-document-classification .

Thank you a lot!
As I understand, there is no exists explicitly approach to classify long text in Flair now?

OlegDurandin on 16 Dec 2019

No. If you are using Flair LM, there is no limit. The limit is only on transformers based models.

pommedeterresautee on 17 Dec 2019

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.