Is your feature request related to a problem? Please describe.
I want to fine-tune BERT in allennlp, but I find BERT is not a registered model of this library and what I can do at present is to generate embeddings using it (#2140), which may be a little inflexible for me.
Do I miss anything?
Describe the solution you'd like
Make BERT more allennlpic, so we can treat it as a normal component excpet that we don't train but just load/fine-tune pretrained models.
Describe alternatives you've considered
Turn to pytorch-pretrained-BERT for help (but I am a bit of greedy for the convenience of allennlp).
Additional context
None.
To do this with AllenNLP, you just use BERT as your TokenEmbedder, configure it so that it gets gradients, and put whatever prediction model you want on top. This is what @joelgrus did for NER (except without the fine tuning).
Got it. Who knows what I was thinking about? :sweat_smile:
Thanks. @matt-gardner
I'm having trouble integrating BERT into my POS tagger model. I think my trouble stems from the fact BERT uses wordpieces: the output doesn't match what the input of the POS model expects, i.e., one embedding per word. I believe that is what the offsets parameter can be useful, but it's not entirely clear to me how to configure it properly.
It would be much clearer if there were an example in AllenNLP of the BERT model being used for a downstream task like sequence tagging. Is there such an example?
there's not an example per se, but here's the config I used to train the NER model using BERT embeddings:
https://gist.github.com/joelgrus/7cdb8fb2d81483a8d9ca121d9c617514
hopefully that's helpful
@joelgrus
Looks like I was missing bert-offsets in my embedder_to_indexer_map. That helped, thanks!
there's not an example per se, but here's the config I used to train the NER model using BERT embeddings:
https://gist.github.com/joelgrus/7cdb8fb2d81483a8d9ca121d9c617514
hopefully that's helpful
@joelgrus
Thanks a lot. One more question about the NER config, when I set the "label_encoding": "None",
"constrain_crf_decoding": "False", does it mean that this is a bilstm+softmax model without crf?
Hope to get your reply.
no, there's a CRF either way. the difference is that if you "constrain" the decoding, then the CRF has its weights set to disallow "impossible" transitions, like B-PER -> I-ORG
How to integrate BERT model for Reading Comprehension Problems?
Most helpful comment
I'm having trouble integrating BERT into my POS tagger model. I think my trouble stems from the fact BERT uses wordpieces: the output doesn't match what the input of the POS model expects, i.e., one embedding per word. I believe that is what the
offsetsparameter can be useful, but it's not entirely clear to me how to configure it properly.It would be much clearer if there were an example in AllenNLP of the BERT model being used for a downstream task like sequence tagging. Is there such an example?