Flair: Add BERT embeddings

Created on 27 Nov 2018  路  16Comments  路  Source: flairNLP/flair

Would be handy to have Bert embeddings especially for stacked embeddings etc.

feature release-0.4

Most helpful comment

First version added to release-0.4 branch. You can call BertEmbeddings like any other embeddings:

from flair.embeddings import BertEmbeddings

# instantiate BERT embeddings
bert_embeddings = BertEmbeddings()

# make example sentence
sentence = Sentence('I love Berlin.', use_tokenizer=True)

# embed sentence
bert_embeddings.embed(sentence)

# print embedded tokens
for token in sentence:
    print(token)
    print(token.embedding)

This also makes it possible to mix and match BERT, Flair and ELMo embeddings with the StackedEmbeddings class:

embeddings = StackedEmbeddings(embeddings=[
    WordEmbeddings('glove'),
    BertEmbeddings('bert-large-uncased'),
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
])

You can pass two pooling operations to BERT: 'first' and 'mean'. This affects how token embeddings are build from word pieces (either use embedding of first word piece or average over all pieces). You can also choose which layers to concat to form embeddings - it is currently set to the last 4 layers, which in the BERT paper performed best for feature-based usage.

However, very initial experiments do not give us great results with BERT embeddings. This could be because our parameter settings aren't great and more experimentation is required, or we are doing something else wrong.

Anyway, we'd be happy how they work if somebody's already experimenting with the 0.4 branch!

All 16 comments

I trained on UD German with the multilingual model (and Bert Embeddings only), the f-score was 90.1 %

Wow that was fast :) How many BERT layers did you use?

We still need to match BERT tokenization to our token structure. Currently, their tokenizer may split words that we treat as one word: For instance the number "250,000" is split by BERT into three tokens "250", "," and "000" which it embeds individually. We currently just take the first embedding as token embedding, but this was just a quick fix that will affect performance especially for languages other than English. Once we fix this, I would expect POS tagging results to improve!

I used the default parameters of the Bert Embedding class :) But I'm going to test it with more layers + follow this PR for further fixes :)

First version added to release-0.4 branch. You can call BertEmbeddings like any other embeddings:

from flair.embeddings import BertEmbeddings

# instantiate BERT embeddings
bert_embeddings = BertEmbeddings()

# make example sentence
sentence = Sentence('I love Berlin.', use_tokenizer=True)

# embed sentence
bert_embeddings.embed(sentence)

# print embedded tokens
for token in sentence:
    print(token)
    print(token.embedding)

This also makes it possible to mix and match BERT, Flair and ELMo embeddings with the StackedEmbeddings class:

embeddings = StackedEmbeddings(embeddings=[
    WordEmbeddings('glove'),
    BertEmbeddings('bert-large-uncased'),
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
])

You can pass two pooling operations to BERT: 'first' and 'mean'. This affects how token embeddings are build from word pieces (either use embedding of first word piece or average over all pieces). You can also choose which layers to concat to form embeddings - it is currently set to the last 4 layers, which in the BERT paper performed best for feature-based usage.

However, very initial experiments do not give us great results with BERT embeddings. This could be because our parameter settings aren't great and more experimentation is required, or we are doing something else wrong.

Anyway, we'd be happy how they work if somebody's already experimenting with the 0.4 branch!

Hi @alanakbik, with ebfb453 I was able to train on universal dependencies for German. The model achieved an accuracy of 94.35 with is comparable with state-of-the-art (with flair an accuracy of 94.77 can be achieved (from documentation]).

So I guess this indicates that the implementation is done very correctly; thanks for that!

I'm going to run more experiments with Bert embeddings now :)

Parameters tested so far:

bert_embeddings = BertEmbeddings(bert_model='bert-base-multilingual')

embedding_types: List[TokenEmbeddings] = [
    bert_embeddings
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/ud-german',
              learning_rate=0.1,
              mini_batch_size=8,
              max_epochs=500,
              embeddings_in_memory=False)

Hey this is great - thanks for letting us know! Also please keep us posted on your experiments - we're very curious to see how Flair, BERT and ELMo compare!

@alanakbik With commit 1f67c7fb458a864beb0e5aca48f3d1fc44e97e69 the BertEmbeddings layer is no longer working, the following error message appears:

    embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
  File "/mnt/flair/flair/embeddings.py", line 111, in __init__
    self.__embedding_length += embedding.embedding_length
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 518, in __getattr__
    type(self).__name__, name))
AttributeError: 'BertEmbeddings' object has no attribute 'embedding_length'

With ebfb453f5256ff5b9ed58d2d623128abed5f17ed it is working.

Hm strange. It is working for me on the current release-0.4 head. Could you try there?

I tried latest head of release-0.4 but the error message is still the same :(

Ah could it be that you are doing:

from pytorch_pretrained_bert.modeling import BertEmbeddings

instead of

from flair.embeddings import BertEmbeddings

If so, maybe we should rename the class to avoid confusion?

Hm. I used from flair.embeddings import BertEmbeddings, when I use print(dir(BertEmbeddings)) the following output is returned:

['BertInputFeatures', '__call__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_add_embeddings_internal', '_all_buffers', '_apply', '_convert_sentences_to_features', '_get_name', '_load_from_state_dict', '_slow_forward', '_tracing_name', '_version', 'add_module', 'apply', 'children', 'cpu', 'cuda', 'double', 'dump_patches', 'embed', 'embedding_length', 'embedding_type', 'eval', 'extra_repr', 'float', 'forward', 'half', 'load_state_dict', 'modules', 'named_children', 'named_modules', 'named_parameters', 'parameters', 'register_backward_hook', 'register_buffer', 'register_forward_hook', 'register_forward_pre_hook', 'register_parameter', 'share_memory', 'state_dict', 'to', 'train', 'type', 'zero_grad']

And embedding_length is in that list 馃

@stefan-it Could you resolve the issue or is the error still there? If yes, we should take a closer look before the upcoming release.

@tabergma Unfortunately, the error still remains. I'm using the latest commit in the release-0.4 branch (f96160cea7caaa2847ee3ec328921dad890c870c) , PyTorch 1.0 and Python 3.6.6.

@stefan-it The model bert-base-multilingual is not available.

You should use one of the following:

| ID | Language | Embedding |
| ------------- | ------------- | ------------- |
| 'bert-base-uncased' | English | 12-layer, 768-hidden, 12-heads, 110M parameters |
| 'bert-large-uncased' | English | 24-layer, 1024-hidden, 16-heads, 340M parameters |
| 'bert-base-cased' | English | 12-layer, 768-hidden, 12-heads , 110M parameters |
| 'bert-large-cased' | English | 24-layer, 1024-hidden, 16-heads, 340M parameters |
| 'bert-base-multilingual-cased' | 102 languages | 12-layer, 768-hidden, 12-heads, 110M parameters |
| 'bert-base-chinese' | Chinese Simplified and Traditional | 12-layer, 768-hidden, 12-heads, 110M parameters |

(see also https://github.com/huggingface/pytorch-pretrained-BERT#loading-google-ais-pre-trained-weigths-and-pytorch-dump)

I'll add a check when loading the BertEmbeddings that the provided model is available.

Oh, it existed in version 0.2 in pytorch-pretrained-BERT and the cased and uncased models were introduced in version 0.3 (flair uses 0.3). Thanks for that hint! It is working now :)

We are going to share any testing results here: https://github.com/zalandoresearch/flair/issues/308 Feel free to also share your results in that thread. Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mittalsuraj18 picture mittalsuraj18  路  3Comments

aschmu picture aschmu  路  3Comments

jannenev picture jannenev  路  3Comments

davidsbatista picture davidsbatista  路  3Comments

ChessMateK picture ChessMateK  路  3Comments