Flair: Multilingual NER

Created on 8 Apr 2020 · 36Comments · Source: flairNLP/flair

When I use multilingual embedding for NER till now I have used the input text only in English and the model gave me good inference results in Spanish, however now I have also small number of tagged samples in Spanish (less than 1000 sentences) which number is too low to build a Spanish NER model. Thus, my question is if I can combine both english and spanish samples for training? Is this possible and any thoughts about the accuracy of this kind of mixed language training data.

Remark: The entities which I use are not the standard ones, I have custom entities to train the NER model.

Thanks in advance,
Igor

question wontfix

Source

igormis

All 36 comments

Hello @igormis that can work if the entities are similar across languages. For instance, our multilingual NER models ('ner-multi') are trained with multilingual Flair embeddings using all 4 CoNLL-03 corpora. You can use the MultiCorpus object for this like in the tutorial here.

In our case, training an NER model with training data in 4 languages worked well and we found that it even works for languages that were not part of the training data, like French. That's likely because we had a related language (Spanish) in the training data, and person names, location names etc. are often preserved across languages. So depending on your entity types it might work.

alanakbik on 8 Apr 2020

OK,perfect. The entities are the same so I can add the Spanish samples to the English one. Tnx a lot.

igormis on 8 Apr 2020

🚀1

@alanakbik Concerning the multilingual NER I saw the tutorial . I would like to know if my dataset is composed of english, russian, spanish and Ukrainian tagged data, is this good approach:

embedding_types: List[TokenEmbeddings] = [
   FastTextEmbeddings('wiki.ru.align.vec') //(there are no WordEmbeddings in RU)
   FastTextEmbeddings('wiki.ua.align.vec') //(there are no WordEmbeddings in UA)
   FastTextEmbeddings('wiki.es.align.vec') //or use WordEmbeddings('es')
   WordEmbeddings('glove') //this is for the english corpora
    # we use multilingual Flair embeddings in this task
    FlairEmbeddings('multi-forward'),   //contextual are multilingual
    FlairEmbeddings('multi-backward'), //contextual are multilingual
]

I know that the model will be big in this case, but in my opinion this seems the best approach :)

igormis on 7 May 2020

of course instead of WordEmbeddings('glove'), FastTextEmbeddings('wiki.en.align.vec') can be used.

igormis on 7 May 2020

Yes, makes sense, though the resulting model would be gigantic with all those word embeddings! You could instead use either multilingual BERT (smaller model but slower prediction):

embeddings = StackedEmbeddings([
    FlairEmbeddings('multi-forward'),
    FlairEmbeddings('multi-backward'),
    TransformerWordEmbeddings('bert-base-multilingual-uncased', fine_tune=False, layers='all', use_scalar_mix=True),
])

Or multilingual byte pair embeddings (smaller model, but potentially some problems with distributing trained models since there are some bugs in BytePairEmbeddings serialization):

embeddings = StackedEmbeddings([
    FlairEmbeddings('multi-forward'),
    FlairEmbeddings('multi-backward'),
    BytePairEmbeddings('multi'),
])

alanakbik on 7 May 2020

👍1

@alanakbik thanks a lot, I will play with both options and see the results. One Q: If i use TransformerWordEmbeddings then the there will not be a problem with the input text size when inferring as with the Bert-based models, cause I am using only the word embeddings?

igormis on 7 May 2020

Yes that limitation still exists until we find a workaround.

alanakbik on 7 May 2020

👀1

I tried the second approach (BytePairEmbeddings) and the results are quite good, however when I try the first approach (TransformerWordEmbeddings), I receive the following error:
Using bos_token, but it is not set yet.
Any help on this @alanakbik

igormis on 8 May 2020

How to add this beginning of sentence token?

igormis on 8 May 2020

I am creating the tag_dictionary as following:

tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

igormis on 8 May 2020

Also when I use deprecated: BertEmbeddings("bert-base-multilingual-uncased"), the bos_token is not needed and the training works.

igormis on 8 May 2020

The BOS token is just a warning, it does not affect anything. But maybe we should turn the warning off in this case.

alanakbik on 8 May 2020

👍1

OK, I will test it this way and report the results from 1 and 2. However when using my approach (the gigantic model) there are problem that the aligned fasttext format are only in text (.vec) format, whereas FLAIR needs the .bin format, Any suggestion on this?

igormis on 8 May 2020

With BertEmbeddings("bert-base-multilingual-uncased") the training works and with TransformerWordEmbeddings('bert-base-multilingual-uncased', fine_tune=False, layers='all', use_scalar_mix=True,), the training throws the following error (using the same code):

/usr/local/lib/python3.6/dist-packages/flair/models/sequence_tagger_model.py in forward(self, sentences)
    530                 len(sentences),
    531                 longest_token_sequence_in_batch,
--> 532                 self.embeddings.embedding_length,
    533             ]
    534         )

RuntimeError: shape '[32, 45, 4864]' is invalid for input of size 6942720

igormis on 8 May 2020

Is this error thrown immediately or does it occur sometime in the middle of an epoch?

alanakbik on 8 May 2020

Immediately, after loading the models, it is strange that the deprecated function BertEmbeddings does not show any error

igormis on 8 May 2020

Sorry to bombard you with questions, but I also noticed that the FastText embeddings are hard-coded to be in bin format(that .bin is appended to the path of the embeddings). Any possibility to use the .vec format of FT embeddings?

igormis on 8 May 2020

Hm so the following code is running on my machine:

embeddings = TransformerWordEmbeddings(
    'bert-base-multilingual-uncased',
    fine_tune=False,
    layers='all',
    use_scalar_mix=True,
)

corpus = CONLL_03(in_memory=False)

# make tag dictionary
tag_dictionary = corpus.make_tag_dictionary('ner')
print(tag_dictionary)

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type='ner',
    use_crf=True,
    use_rnn=True,
)

trainer = ModelTrainer(tagger, corpus)

trainer.train(f'resources/taggers/ner-quickrun',
                max_epochs=150,
                mini_batch_size=32,
                embeddings_storage_mode='cpu',
                )

alanakbik on 8 May 2020

Yes it is better to use the .vec file because the .bin files are models that compute embeddings also out-of-vocabulry words and lots of people have found that this does not work so well.

For loading .vec, can you try the approach linked in #1290

alanakbik on 8 May 2020

👍1

I tried the same code you provided with the WNUT().downsample(0.1) dataset and the same error occured (I am just neglecting the in_memory flag). Just to know that I am using !pip install --upgrade git+https://github.com/zalandoresearch/flair.git because of the GPU issues with the release.

2020-05-08 12:24:01,568 ----------------------------------------------------------------------------------------------------
2020-05-08 12:24:02,216 epoch 1 - iter 1/11 - loss 79.27043152 - samples/sec: 49.52
2020-05-08 12:24:02,936 epoch 1 - iter 2/11 - loss 66.78250122 - samples/sec: 54.96
2020-05-08 12:24:03,644 epoch 1 - iter 3/11 - loss 53.94743983 - samples/sec: 54.05
2020-05-08 12:24:04,351 epoch 1 - iter 4/11 - loss 44.00874758 - samples/sec: 53.64
2020-05-08 12:24:05,064 epoch 1 - iter 5/11 - loss 37.12425022 - samples/sec: 54.15
2020-05-08 12:24:05,753 epoch 1 - iter 6/11 - loss 32.29027907 - samples/sec: 56.36
2020-05-08 12:24:06,485 epoch 1 - iter 7/11 - loss 28.55213983 - samples/sec: 52.00
2020-05-08 12:24:07,197 epoch 1 - iter 8/11 - loss 25.60082382 - samples/sec: 53.72
2020-05-08 12:24:07,921 epoch 1 - iter 9/11 - loss 23.54193216 - samples/sec: 52.86
2020-05-08 12:24:08,657 epoch 1 - iter 10/11 - loss 22.31630187 - samples/sec: 52.93
2020-05-08 12:24:09,166 epoch 1 - iter 11/11 - loss 20.95466362 - samples/sec: 81.52
2020-05-08 12:24:09,305 ----------------------------------------------------------------------------------------------------
2020-05-08 12:24:09,305 EPOCH 1 done: loss 20.9547 - lr 0.1000000

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-14-f679cfb52410> in <module>()
     36                 max_epochs=150,
     37                 mini_batch_size=32,
---> 38                 embeddings_storage_mode='cpu',
     39                 )

2 frames

/usr/local/lib/python3.6/dist-packages/flair/models/sequence_tagger_model.py in forward(self, sentences)
    530                 len(sentences),
    531                 longest_token_sequence_in_batch,
--> 532                 self.embeddings.embedding_length,
    533             ]
    534         )

RuntimeError: shape '[32, 41, 768]' is invalid for input of size 1006080

igormis on 8 May 2020

Also when I use this code is the same error.

from flair.embeddings import CharacterEmbeddings, TransformerWordEmbeddings, BytePairEmbeddings, FastTextEmbeddings, TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings, PooledFlairEmbeddings, BertEmbeddings, RoBERTaEmbeddings
from flair.data import Corpus
from flair.datasets import WNUT_17
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
# init embedding

from typing import List
embeddings = TransformerWordEmbeddings(
    'bert-base-multilingual-uncased',
    fine_tune=False,
    layers='all',
    use_scalar_mix=True,
)

corpus: Corpus = WNUT_17(in_memory=False).downsample(0.1)
print(corpus)
# 2. what tag do we want to predict?

# make tag dictionary
tag_dictionary = corpus.make_tag_dictionary(tag_type='ner')
print(tag_dictionary)

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type='ner',
    use_crf=True,
    use_rnn=True,
)

trainer = ModelTrainer(tagger, corpus)

trainer.train(f'resources/taggers/ner-quickrun',
                max_epochs=150,
                mini_batch_size=32,
                embeddings_storage_mode='cpu',
                )

igormis on 8 May 2020

Ah but then the error is not thrown immediately after loading model, right? it runs through the entire first epoch.

alanakbik on 8 May 2020

yes, sorry for this.

igormis on 8 May 2020

Thanks, getting the same error now :) I'll debug.

alanakbik on 8 May 2020

Another thing is that when I use the gigantic model:

embedding_types: List[TokenEmbeddings] = [
    FastTextEmbeddings("cc.uk.50.bin"),
    WordEmbeddings('ru'),
    WordEmbeddings('es'),   
    FlairEmbeddings('multi-forward'),
    FlairEmbeddings('multi-backward'),
    BytePairEmbeddings('multi')
]

and train the model:

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
              train_with_dev=True,
              mini_batch_size=64,
              patience=2,
              max_epochs=185)

the training finishes, as shown here:

2020-05-08 23:06:08,424 epoch 185 - iter 23/232 - loss 3.39433322 - samples/sec: 150.50
2020-05-08 23:06:17,596 epoch 185 - iter 46/232 - loss 3.33427280 - samples/sec: 160.85
2020-05-08 23:06:26,660 epoch 185 - iter 69/232 - loss 3.30985034 - samples/sec: 162.77
2020-05-08 23:06:35,990 epoch 185 - iter 92/232 - loss 3.32857562 - samples/sec: 158.12
2020-05-08 23:06:45,164 epoch 185 - iter 115/232 - loss 3.33359058 - samples/sec: 160.82
2020-05-08 23:06:54,427 epoch 185 - iter 138/232 - loss 3.32781486 - samples/sec: 159.30
2020-05-08 23:07:03,691 epoch 185 - iter 161/232 - loss 3.32972354 - samples/sec: 159.24
2020-05-08 23:07:14,302 epoch 185 - iter 184/232 - loss 3.32314946 - samples/sec: 139.00
2020-05-08 23:07:23,555 epoch 185 - iter 207/232 - loss 3.34216449 - samples/sec: 159.45
2020-05-08 23:07:32,677 epoch 185 - iter 230/232 - loss 3.33873231 - samples/sec: 161.78
2020-05-08 23:07:33,199 ----------------------------------------------------------------------------------------------------
2020-05-08 23:07:33,200 EPOCH 185 done: loss 3.3349 - lr 0.0062500
Epoch   185: reducing learning rate of group 0 to 3.1250e-03.
2020-05-08 23:07:33,201 BAD EPOCHS (no improvement): 3

However no evaluation is done and the model file (final-model.pt) is zero bytes. Any info on this @alanakbik , maybe the size of the model is to big to be saved?

igormis on 9 May 2020

I tried the approach for the .vec file, which is linked in #1290. However, I am trying to reduce the dimensionality of the vector size from 300 to 64 using PCA like this:

import gensim

language = "es"

word_vectors = gensim.models.KeyedVectors.load_word2vec_format('https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.es.align.vec', binary=False)

from sklearn.decomposition import PCA
pca = PCA(n_components=64)
import re
import matplotlib.pyplot as plt
X = word_vectors.wv[word_vectors.wv.vocab]
pca_res = pca.fit_transform(X)

word_vectors.wv.vectors = pca_res
word_vectors.save(f'wiki.{language}.vec.gensim')
embeddings = WordEmbeddings(f'wiki.{language}.vec.gensim')

and this works, i.e. the embeddings are read, but when doing the traning with WNUT():

embeddings = WordEmbeddings(f"/content/drive/My Drive/wiki.en.vec.gensim")
print(embeddings.embedding_length)
#keys = frozenset(word_vectors.wv.vocab.items())


print('loaded')
corpus: Corpus = WNUT_17(in_memory=False).downsample(0.1)
print(corpus)
# 2. what tag do we want to predict?

# make tag dictionary
tag_dictionary = corpus.make_tag_dictionary(tag_type='ner')
print(tag_dictionary)

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type='ner',
    use_crf=True,
    use_rnn=True,
)

trainer = ModelTrainer(tagger, corpus)

trainer.train(f'resources/taggers/ner-quickrun',
                max_epochs=150,
                mini_batch_size=32,
                embeddings_storage_mode='cpu',
                )

it prints that the embeddings are 300d and throws the following error:
/```
usr/local/lib/python3.6/dist-packages/flair/models/sequence_tagger_model.py in forward(self, sentences)
530 len(sentences),
531 longest_token_sequence_in_batch,
--> 532 self.embeddings.embedding_length,
533 ]
534 )

RuntimeError: shape '[32, 29, 300]' is invalid for input of size 152140
```
Like the embeddings are expected to be 300d, is there a way to work with 64d vectors in order to save memory and speed up the inference (Again I need this for the gigantic model)

igormis on 9 May 2020

@igormis we just merged a PR that should fix these issues. Could you update your local version and try the problematic code again?

alanakbik on 11 May 2020

@alanakbik should I clone the master branch, i.e.
!pip install --upgrade git+https://github.com/zalandoresearch/flair.git

igormis on 11 May 2020

Yes, best use a fresh environment and do:

!pip install --upgrade git+https://github.com/flairNLP/flair.git

alanakbik on 11 May 2020

OK, tnx, will check and report :)

igormis on 11 May 2020

👍1

When using

    FlairEmbeddings('multi-forward'),
    FlairEmbeddings('multi-backward'),
    TransformerWordEmbeddings('bert-base-multilingual-uncased', fine_tune=False, layers='all', use_scalar_mix=True,),

the following error is thrown:

2020-05-11 17:52:27,496 epoch 1 - iter 46/463 - loss 44.93550226 - samples/sec: 31.23

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-7-e9a519e04622> in <module>()
     20               train_with_dev=True,
     21               patience=2,
---> 22               max_epochs=10)

6 frames

/usr/local/lib/python3.6/dist-packages/flair/embeddings/token.py in _add_embeddings_to_sentences(self, sentences)
   1007 
   1008                         if self.pooling_operation == "first":
-> 1009                             final_embedding: torch.FloatTensor = current_embeddings[0]
   1010 
   1011                         if self.pooling_operation == "last":

IndexError: index 0 is out of bounds for dimension 0 with size 0

igormis on 11 May 2020

@alanakbik any info regarding the issues/questions:

Problem with the TransformerWordEmbeddings (previous message)
Using different length of Word embeddings (after the PCA I reduced the length to 64) but it throws an error (using the approach like in #1290 )
Input size when using BERT-like models

BOS Token warning is solved

igormis on 14 May 2020

Hm, I cannot reproduce this error in my setup. When I use these three embeddings it works ok.

alanakbik on 14 May 2020

OK, I will check my dataset, maybe there are problems with empty tags...Whereas concerning 2 and 3 there is still no fix?

igormis on 14 May 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 11 Sep 2020

I get the same error when trying to use bert models:
File ".../lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 638, in forward
self.embeddings.embedding_length,
RuntimeError: shape '[32, 46, 3072]' is invalid for input of size 4515840

Did anyone ever find out what causes this issue?