Bert: fine-tuned for a document task

Created on 12 Nov 2018  路  9Comments  路  Source: google-research/bert

How do I use BERT to classify a document which contains several sentences? label(0, 1)
I think I need to use BERT encode every sentence and use LSTM or RNN to generate a article hidden state, then use the article hidden state to classify.
Any better ideas?

Most helpful comment

@daemon Your paper does not answer the above problem at all. You even argue the opposite of the problem here, that 512 tokens is more than enough for document classification. The exciting (apparently open) problem is how to utilize the ideas of BERT with token sets that are significantly longer than 512 tokens (documents).

All 9 comments

To classify a document just feed in the entire document to BERT (i.e., treat all of the concatenated sentences as "Segment A"). You should be able to just write your own DataProcessor in run_classifier.py and train the model without changing and TensorFlow code. So just set text_a to your document text and set text_b to None. You probably will want to set max_seq_length to a a longer value depending on the length of your documents (up to 512).

To classify a document just feed in the entire document to BERT (i.e., treat all of the concatenated sentences as "Segment A"). You should be able to just write your own DataProcessor in run_classifier.py and train the model without changing and TensorFlow code. So just set text_a to your document text and set text_b to None. You probably will want to set max_seq_length to a a longer value depending on the length of your documents (up to 512).

But what if most of my documents are longer than 512?

I have the same question here. From what I read and understood there is no way to feed documents longer than 512 and do classfication. It needs to do some other data processing so that max length <= 512. Right?

We just released a preprint that describes BERT for document classification. If you guys are interested, the codebase is here.

@daemon Your paper does not answer the above problem at all. You even argue the opposite of the problem here, that 512 tokens is more than enough for document classification. The exciting (apparently open) problem is how to utilize the ideas of BERT with token sets that are significantly longer than 512 tokens (documents).

How about preprocessing document and generate Document summary and use the summary to fine tune bert. Just a thought.

@msta

Your paper does not answer the above problem at all.

My reply was a shameless plug. I never claimed to solve the problem of handling documents longer than 512.

The exciting (apparently open) problem is how to utilize the ideas of BERT with token sets that are significantly longer than 512 tokens (documents).

To my knowledge, that's indeed open. I'm not sure it matters a lot for document classification, though, but it's worth exploring.

What would be wrong with feeding each sentence or paragraph into BERT and then running a classifier on the pooled output of the document (or even a CNN over the concatenated BERT output of the document)? I can think of complications for back-propagation but would it be feasible?

Trying to use BERT to build a document classifier at the moment.

@jordanparker6 were you able to solve this?

Was this page helpful?
0 / 5 - 0 ratings