Allennlp: How to use BERT as a model?

Created on 23 Dec 2018 · 8Comments · Source: allenai/allennlp

Is your feature request related to a problem? Please describe.
I want to fine-tune BERT in allennlp, but I find BERT is not a registered model of this library and what I can do at present is to generate embeddings using it (#2140), which may be a little inflexible for me.
Do I miss anything?

Describe the solution you'd like
Make BERT more allennlpic, so we can treat it as a normal component excpet that we don't train but just load/fine-tune pretrained models.

Describe alternatives you've considered
Turn to pytorch-pretrained-BERT for help (but I am a bit of greedy for the convenience of allennlp).

Additional context
None.

Source

WrRan

👀2

Most helpful comment

I'm having trouble integrating BERT into my POS tagger model. I think my trouble stems from the fact BERT uses wordpieces: the output doesn't match what the input of the POS model expects, i.e., one embedding per word. I believe that is what the offsets parameter can be useful, but it's not entirely clear to me how to configure it properly.

It would be much clearer if there were an example in AllenNLP of the BERT model being used for a downstream task like sequence tagging. Is there such an example?

Hyperparticle on 14 Jan 2019

👍6

All 8 comments

To do this with AllenNLP, you just use BERT as your TokenEmbedder, configure it so that it gets gradients, and put whatever prediction model you want on top. This is what @joelgrus did for NER (except without the fine tuning).

matt-gardner on 23 Dec 2018

Got it. Who knows what I was thinking about? :sweat_smile:
Thanks. @matt-gardner

WrRan on 23 Dec 2018

It would be much clearer if there were an example in AllenNLP of the BERT model being used for a downstream task like sequence tagging. Is there such an example?

Hyperparticle on 14 Jan 2019

👍6

there's not an example per se, but here's the config I used to train the NER model using BERT embeddings:

https://gist.github.com/joelgrus/7cdb8fb2d81483a8d9ca121d9c617514

hopefully that's helpful

joelgrus on 14 Jan 2019

👍5

@joelgrus

Looks like I was missing bert-offsets in my embedder_to_indexer_map. That helped, thanks!

Hyperparticle on 14 Jan 2019

there's not an example per se, but here's the config I used to train the NER model using BERT embeddings:

https://gist.github.com/joelgrus/7cdb8fb2d81483a8d9ca121d9c617514

hopefully that's helpful

@joelgrus
Thanks a lot. One more question about the NER config, when I set the "label_encoding": "None",
"constrain_crf_decoding": "False", does it mean that this is a bilstm+softmax model without crf？
Hope to get your reply.

aslicedbread on 7 Mar 2019

no, there's a CRF either way. the difference is that if you "constrain" the decoding, then the CRF has its weights set to disallow "impossible" transitions, like B-PER -> I-ORG

joelgrus on 12 Mar 2019

How to integrate BERT model for Reading Comprehension Problems?