Flair: Adding New Embeddings

Created on 11 Sep 2019  路  6Comments  路  Source: flairNLP/flair

Hi and thank you so much for this amazing library. Would it be possible to add new embeddings into it: clinicalBERT?

https://github.com/EmilyAlsentzer/clinicalBERT

Is it something we can do ourselves once we download the respective embeddings? I would be happy to help if you could point me how to with adding these.

Most helpful comment

Hi @sanja7s,

this can be easily done with Flair.

Just download the clinical BERT package and extract it:

wget -O pretrained_bert_tf.tar.gz https://www.dropbox.com/s/8armk04fu16algz/pretrained_bert_tf.tar.gz\?dl\=1
tar -xzf pretrained_bert_tf.tar.gz
cd pretrained_bert_tf

I chose to use the Bio+Clinical BERT model, extract it with:

tar -xzf bert_pretrain_output_all_notes_150000.tar.gz

Then rename to a PyTorch-Transformers compatible "name" (that means: indicate, if the model is cased or uncased):

mv bert_pretrain_output_all_notes_150000 bert-base-clinical-cased

(I just looked at the vocabulary file to determine that they're using a cased model)

In order to be 100% compatible with PyTorch-Transformers, just rename the configuration file:

cd bert-base-clinical-cased
mv bert_config.json config.json
pwd # prints full path name - copy it and pass it to BertEmbeddings constructor later

Then this model can be used in Flair - via passing the complete path to the BertEmbeddings constructor:

from flair.embeddings import BertEmbeddings
from flair.data import Sentence

embeddings = BertEmbeddings("./bert-base-clinical-cased")
sentence = Sentence("OpenEHR12 is a standard for interoperability specifically between different electronic patient record systems", use_tokenizer=True)

embeddings.embed(sentence)

for token in sentence.tokens: 
  print(token.embedding) 

This a) loads the clinical BERT model b) embeds a sentence and c) prints out the embedding for each layer.

I hope this helps :)

All 6 comments

Hi @sanja7s,

this can be easily done with Flair.

Just download the clinical BERT package and extract it:

wget -O pretrained_bert_tf.tar.gz https://www.dropbox.com/s/8armk04fu16algz/pretrained_bert_tf.tar.gz\?dl\=1
tar -xzf pretrained_bert_tf.tar.gz
cd pretrained_bert_tf

I chose to use the Bio+Clinical BERT model, extract it with:

tar -xzf bert_pretrain_output_all_notes_150000.tar.gz

Then rename to a PyTorch-Transformers compatible "name" (that means: indicate, if the model is cased or uncased):

mv bert_pretrain_output_all_notes_150000 bert-base-clinical-cased

(I just looked at the vocabulary file to determine that they're using a cased model)

In order to be 100% compatible with PyTorch-Transformers, just rename the configuration file:

cd bert-base-clinical-cased
mv bert_config.json config.json
pwd # prints full path name - copy it and pass it to BertEmbeddings constructor later

Then this model can be used in Flair - via passing the complete path to the BertEmbeddings constructor:

from flair.embeddings import BertEmbeddings
from flair.data import Sentence

embeddings = BertEmbeddings("./bert-base-clinical-cased")
sentence = Sentence("OpenEHR12 is a standard for interoperability specifically between different electronic patient record systems", use_tokenizer=True)

embeddings.embed(sentence)

for token in sentence.tokens: 
  print(token.embedding) 

This a) loads the clinical BERT model b) embeds a sentence and c) prints out the embedding for each layer.

I hope this helps :)

Hi @stefan-it,

Thank you so much! It is running already as part of my model. This is amazing -- your library is the first one I have used, in which everything works from the first time! And plus there is your support. :)
Many many thanks and best wishes to you all!

Ok, @stefan-it, I have another question -- I tried to reproduce the same steps as you suggested for clinicalBERT to add BioBERT from here

https://github.com/naver/biobert-pretrained

I think I did everything in the same way, but I now do get this error:

self.__embedding_length += embedding.embedding_length File "/home/xxx/anaconda3/envs/flair_CLONE/lib/python3.6/site-packages/torch/nn/modules/module.py", line 539, in __getattr__ type(self).__name__, name)) AttributeError: 'BertEmbeddings' object has no attribute 'embedding_length'

Could you help me -- is there any simple suggestion on how to resolve this?
This is the structure of the new embedding file:

._bert_config.json

biobert_model.ckpt.data-00000-of-00001

biobert_model.ckpt.index

biobert_model.ckpt.meta

config.json

._vocab.txt

vocab.txt

By comparison with clinicalBERT, I see it is missing these files, which seem pytorch related

graph.pbtxt

pytorch_model.bin

while config.json files are the same.

Is it perhaps the problem? Is there still a way to make these embeddings work with Flair?

Many thanks!

Hi @sanja7s,

unfortunately, the BioBERT authors only provide the TensorFlow checkpoints. But that's no problem, we just need to convert them into PyTorch-compatible weights. For that reason, install the latest version of pytorch-transformers (like: pip install --upgrade pytorch-transformers).

Download and extract a BioBERT model (I used BioBERT v1.0 (+ PubMed 200K + PMC 270K) for testing). First, rename the biobert_v1.0_pubmed_pmc to bert-base-biobert-cased (for better naming).

mv biobert_v1.0_pubmed_pmc bert-base-biobert-cased
cd bert-base-biobert-cased

For the next steps: make sure that you have a recent TensorFlow version installed (needed to convert the TF checkpoints), Conversion can be started with:

pytorch_transformers bert biobert_model.ckpt bert_config.json pytorch_model.bin

Last output should show something like:

INFO:pytorch_transformers.modeling_bert:Initialize PyTorch weight ['bert', 'pooler', 'dense', 'bias']
INFO:pytorch_transformers.modeling_bert:Initialize PyTorch weight ['bert', 'pooler', 'dense', 'kernel']
Save PyTorch model to pytorch_model.bin

Last step: rename the configuration file:

mv bert_config.json config.json

Then BioBERT could be loaded and used in BertEmbeddings:

from flair.data import Sentence
from flair.embeddings import BertEmbeddings

embeddings = BertEmbeddings("/tmp/bert-base-biobert-cased")  # adjust path here

sentence = Sentence("Hemoglobin H inclusions can only be visualized with supravital stains and not Wright or Romanowsky stains.", use_tokenizer=True)

embeddings.embed(sentence)

for token in sentence.tokens:
  print(token.embedding)

God bless you, @stefan-it! It worked again like a charm.
Cheers!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Rahulvks picture Rahulvks  路  3Comments

gopalkalpande picture gopalkalpande  路  3Comments

ChessMateK picture ChessMateK  路  3Comments

inyukwo1 picture inyukwo1  路  3Comments

mittalsuraj18 picture mittalsuraj18  路  3Comments