Flair: Adding New Embeddings

Created on 11 Sep 2019 · 6Comments · Source: flairNLP/flair

Hi and thank you so much for this amazing library. Would it be possible to add new embeddings into it: clinicalBERT?

https://github.com/EmilyAlsentzer/clinicalBERT

Is it something we can do ourselves once we download the respective embeddings? I would be happy to help if you could point me how to with adding these.

Source

sanja7s

Most helpful comment

Hi @sanja7s,

this can be easily done with Flair.

Just download the clinical BERT package and extract it:

wget -O pretrained_bert_tf.tar.gz https://www.dropbox.com/s/8armk04fu16algz/pretrained_bert_tf.tar.gz\?dl\=1
tar -xzf pretrained_bert_tf.tar.gz
cd pretrained_bert_tf

I chose to use the Bio+Clinical BERT model, extract it with:

tar -xzf bert_pretrain_output_all_notes_150000.tar.gz

Then rename to a PyTorch-Transformers compatible "name" (that means: indicate, if the model is cased or uncased):

mv bert_pretrain_output_all_notes_150000 bert-base-clinical-cased

(I just looked at the vocabulary file to determine that they're using a cased model)

In order to be 100% compatible with PyTorch-Transformers, just rename the configuration file:

cd bert-base-clinical-cased
mv bert_config.json config.json
pwd # prints full path name - copy it and pass it to BertEmbeddings constructor later

Then this model can be used in Flair - via passing the complete path to the BertEmbeddings constructor:

from flair.embeddings import BertEmbeddings
from flair.data import Sentence

embeddings = BertEmbeddings("./bert-base-clinical-cased")
sentence = Sentence("OpenEHR12 is a standard for interoperability specifically between different electronic patient record systems", use_tokenizer=True)

embeddings.embed(sentence)

for token in sentence.tokens: 
  print(token.embedding)

This a) loads the clinical BERT model b) embeds a sentence and c) prints out the embedding for each layer.

I hope this helps :)

stefan-it on 12 Sep 2019

👍6 🚀1 ❤1

All 6 comments

Hi @sanja7s,

this can be easily done with Flair.

Just download the clinical BERT package and extract it:

wget -O pretrained_bert_tf.tar.gz https://www.dropbox.com/s/8armk04fu16algz/pretrained_bert_tf.tar.gz\?dl\=1
tar -xzf pretrained_bert_tf.tar.gz
cd pretrained_bert_tf

I chose to use the Bio+Clinical BERT model, extract it with:

tar -xzf bert_pretrain_output_all_notes_150000.tar.gz

Then rename to a PyTorch-Transformers compatible "name" (that means: indicate, if the model is cased or uncased):

mv bert_pretrain_output_all_notes_150000 bert-base-clinical-cased

(I just looked at the vocabulary file to determine that they're using a cased model)

In order to be 100% compatible with PyTorch-Transformers, just rename the configuration file:

cd bert-base-clinical-cased
mv bert_config.json config.json
pwd # prints full path name - copy it and pass it to BertEmbeddings constructor later

Then this model can be used in Flair - via passing the complete path to the BertEmbeddings constructor:

from flair.embeddings import BertEmbeddings
from flair.data import Sentence

embeddings = BertEmbeddings("./bert-base-clinical-cased")
sentence = Sentence("OpenEHR12 is a standard for interoperability specifically between different electronic patient record systems", use_tokenizer=True)

embeddings.embed(sentence)

for token in sentence.tokens: 
  print(token.embedding)

This a) loads the clinical BERT model b) embeds a sentence and c) prints out the embedding for each layer.

I hope this helps :)

stefan-it on 12 Sep 2019

👍6 🚀1 ❤1

Hi @stefan-it,

Thank you so much! It is running already as part of my model. This is amazing -- your library is the first one I have used, in which everything works from the first time! And plus there is your support. :)
Many many thanks and best wishes to you all!

sanja7s on 12 Sep 2019

Ok, @stefan-it, I have another question -- I tried to reproduce the same steps as you suggested for clinicalBERT to add BioBERT from here

https://github.com/naver/biobert-pretrained

I think I did everything in the same way, but I now do get this error:

self.__embedding_length += embedding.embedding_length File "/home/xxx/anaconda3/envs/flair_CLONE/lib/python3.6/site-packages/torch/nn/modules/module.py", line 539, in __getattr__ type(self).__name__, name)) AttributeError: 'BertEmbeddings' object has no attribute 'embedding_length'

Could you help me -- is there any simple suggestion on how to resolve this?
This is the structure of the new embedding file:

._bert_config.json

biobert_model.ckpt.data-00000-of-00001

biobert_model.ckpt.index

biobert_model.ckpt.meta

config.json

._vocab.txt

vocab.txt

By comparison with clinicalBERT, I see it is missing these files, which seem pytorch related

graph.pbtxt

pytorch_model.bin

while config.json files are the same.

Is it perhaps the problem? Is there still a way to make these embeddings work with Flair?

Many thanks!

sanja7s on 13 Sep 2019

Hi @sanja7s,

unfortunately, the BioBERT authors only provide the TensorFlow checkpoints. But that's no problem, we just need to convert them into PyTorch-compatible weights. For that reason, install the latest version of pytorch-transformers (like: pip install --upgrade pytorch-transformers).

Download and extract a BioBERT model (I used BioBERT v1.0 (+ PubMed 200K + PMC 270K) for testing). First, rename the biobert_v1.0_pubmed_pmc to bert-base-biobert-cased (for better naming).

mv biobert_v1.0_pubmed_pmc bert-base-biobert-cased
cd bert-base-biobert-cased

For the next steps: make sure that you have a recent TensorFlow version installed (needed to convert the TF checkpoints), Conversion can be started with:

pytorch_transformers bert biobert_model.ckpt bert_config.json pytorch_model.bin

Last output should show something like:

INFO:pytorch_transformers.modeling_bert:Initialize PyTorch weight ['bert', 'pooler', 'dense', 'bias']
INFO:pytorch_transformers.modeling_bert:Initialize PyTorch weight ['bert', 'pooler', 'dense', 'kernel']
Save PyTorch model to pytorch_model.bin

Last step: rename the configuration file:

mv bert_config.json config.json

Then BioBERT could be loaded and used in BertEmbeddings:

from flair.data import Sentence
from flair.embeddings import BertEmbeddings

embeddings = BertEmbeddings("/tmp/bert-base-biobert-cased")  # adjust path here

sentence = Sentence("Hemoglobin H inclusions can only be visualized with supravital stains and not Wright or Romanowsky stains.", use_tokenizer=True)

embeddings.embed(sentence)

for token in sentence.tokens:
  print(token.embedding)

stefan-it on 13 Sep 2019

🎉1 👍1

God bless you, @stefan-it! It worked again like a charm.
Cheers!

sanja7s on 13 Sep 2019

🎉1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.