Flair: Extend tags recognized by a SequenceTagger

Created on 7 Aug 2020 · 5Comments · Source: flairNLP/flair

Hello,

I would like to know if it's possible to extend the entities recognized by a ner SequenceTagger. For instance I trained a sequence tagger with CamemBERT and it can actually detect the following entities :

tagger.tag_dictionary.get_items()

output :

['<unk>','O','B-LOC','B-DATE','I-DATE','B-TIME','B-ORG','I-ORG','B-PRODUCT','I-PRODUCT','B-PER','I-PER', 'I-LOC','I-TIME',
 'B-EVENT','I-EVENT','-','', '<START>','<STOP>']

So it can recognizes : LOC, ORG, PER, TIME, EVENT, PRODUCT and DATE.

Now I would like to extend this tagger so it can recognizes say "MOLECULE". I tried to add "B-MOLECULE" and "I-MOLECULE" to the tag dictionnary like this :

tagger.tag_dictionary.add_item("B-MOLECULE")
tagger.tag_dictionary.add_item("I-MOLECULE")

and then I started to train my tagger on a (conll) corpus where some tokens are tagged with B-MOLECULE or I-MOLECULE but then I get the following error :

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-11-666d914e1eca> in <module>
     10               embeddings_storage_mode="cpu",
     11               monitor_test=True,
---> 12               checkpoint=True)

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/trainers/trainer.py in train(self, base_path, learning_rate, mini_batch_size, mini_batch_chunk_size, max_epochs, scheduler, anneal_factor, patience, initial_extra_patience, min_learning_rate, train_with_dev, monitor_train, monitor_test, embeddings_storage_mode, checkpoint, save_final_model, anneal_with_restarts, anneal_with_prestarts, batch_growth_annealing, shuffle, param_selection_mode, num_workers, sampler, use_amp, amp_opt_level, eval_on_train_fraction, eval_on_train_shuffle, **kwargs)
    347 
    348                         # forward pass
--> 349                         loss = self.model.forward_loss(batch_step)
    350 
    351                         # Backward

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py in forward_loss(self, data_points, sort)
    598     ) -> torch.tensor:
    599         features = self.forward(data_points)
--> 600         return self._calculate_loss(features, data_points)
    601 
    602     def forward(self, sentences: List[Sentence]):

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py in _calculate_loss(self, features, sentences)
    735 
    736             forward_score = self._forward_alg(features, lengths)
--> 737             gold_score = self._score_sentence(features, tags, lengths)
    738 
    739             score = forward_score - gold_score

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py in _score_sentence(self, feats, tags, lens_)
    707             score[i] = torch.sum(
    708                 self.transitions[
--> 709                     pad_stop_tags[i, : lens_[i] + 1], pad_start_tags[i, : lens_[i] + 1]
    710                 ]
    711             ) + torch.sum(feats[i, r, tags[i, : lens_[i]]])

IndexError: index 20 is out of bounds for dimension 0 with size 20

So my question is : what is the correct way to do this if it is possible ?

question

Source

Nighthyst

Most helpful comment

Hello @Nighthyst interesting problem - thanks for sharing the details! Yes, I can imagine that the model now forgets old information since it is only trained to detect the new class. Some things to try:

Try setting a much lower learning rate than our standard 0.1. For instance, you could use a tiny learning, Adam optimizer and only a few epochs:

trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

trainer.train('resources/taggers/finetune',
              learning_rate=3e-5, # use very small learning rate
              max_epochs=5, # terminate after 5 epochs
              )

That's how we finetune the transformer models, so maybe this works here as well. Essentially this way you lessen the chance of overfitting.

Another thing to try would be so randomly select some 100-200 sentences from the original data that contain the old entity types. Add these sentences to your new training data and train over the combined set. Maybe this will help the model not forget old information.

alanakbik on 24 Aug 2020

👍2

All 5 comments

Hello @Nighthyst the problem is that the final layers in the model are trained to output a prediction over the previous tag space, so adding a tag only to the dictionary is not enough. You would need to initialize new upper layers as well.

Here's a code example that should to the trick:

from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

# load your previous tagger
previous_tagger: SequenceTagger = SequenceTagger.load('ner')

# get previous tag dictionary
tag_dictionary: Dictionary = previous_tagger.tag_dictionary

# new corpus and tag dictionary
corpus: Corpus = WNUT_17()

# new tags of new corpus
new_tag_dictionary = corpus.make_tag_dictionary(tag_type='ner')

# consolidate tag dictionary (add new tags to old one)
for new_tag in new_tag_dictionary.item2idx.keys():
    tag_dictionary.add_item(str(new_tag))


# initialize new tagger with extended dictionary (re-use same embeddings as before)
tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=previous_tagger.embeddings,
        tag_dictionary=new_tag_dictionary,
        tag_type='ner',
    )

# reuse internal layers
tagger.embedding2nn = previous_tagger.embedding2nn
tagger.rnn = previous_tagger.rnn

# train as always
trainer = ModelTrainer(tagger, corpus)
trainer.train( ... )

alanakbik on 13 Aug 2020

Hello @alanakbik, thank you for your code snippet it works. However, I face a "catastrophic forgetting" problem now. For a bit of context, I trained a CamemBERT model on a dataset called Winer with the entities: LOC, ORG, PER, DATE, TIME, PRODUCT and EVENT. Trained on this dataset my model can recognize these entities pretty well (its micro F1-score is 87,42%).

I want to add a new entity to my model, let's say WORK OF ART. For that I have several sentences (approximately 60 sentences, I know it is maybe not enough) that I use to train the model to recognize WORK OF ART. I think it’s important to mention that I labelled these sentences trying to identify all the entities my model already know not only WORK OF ART. So, if there is a person in a sentence, I put the label PER on it and so on.

At the end of the training on these sentences my new model is much worse than before. For instance, if we have the sentence (translated in English so you can understand but of course they are in French since I used CamemBERT) "Apple and Microsoft are two firms which headquarters are in The US" the past model predicts:

"Apple [B-ORG] and Microsoft [B-ORG] are two firms which headquarters are in The [B-LOC] US [I-LOC]"

So, everything is fine. However, the new model’s prediction is:

"Apple [B-EVENT] and Microsoft are two firms which headquarters are in The [B-PER] US [I-PER]"

So, it really seems to have forgotten what it previously learnt on Winer. Any tips to fix that? I still have my Winer dataset, so I was thinking to add the sentences with the new entities to the Winer dataset and train on that instead, but I don't think it's the fastest way to deal with this problem.

Actually I think merging two datasets like this isn't a good idea at all : if the entity WORK OF ART was somewhere in the sentences of the first one where it wasn't one of the tag then the model will have a hard time understanding why in dataset 1 so many works of art are not labelled work of art when in dataset 2 they are.

Nighthyst on 19 Aug 2020

Try setting a much lower learning rate than our standard 0.1. For instance, you could use a tiny learning, Adam optimizer and only a few epochs:

trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

trainer.train('resources/taggers/finetune',
              learning_rate=3e-5, # use very small learning rate
              max_epochs=5, # terminate after 5 epochs
              )

That's how we finetune the transformer models, so maybe this works here as well. Essentially this way you lessen the chance of overfitting.

alanakbik on 24 Aug 2020

👍2

So I did different tests varying the size of the original dataset: 2%, 4%, 8%, 10%. With 8% of the original dataset and the sentences from the dataset with the new entity, in 5 epochs I get an F1 score on the new entity of 77% and on all entities, 84% which is really similar to what I got before trying to add an entity when training the model only on the original dataset.

One detail that may be important is that when trying to fine-tune the model, regardless of the learning rate, the scores stayed close to 0. So I chose fine_tune=False. This is probably because my basic model had been created with fine_tune=False therefore without modifying the weights of the pre-trained model for embeddings (here CamemBERT).

Just a small detail: compared to the code you proposed to extend the tag_dictionnary of the old model, I rather proceeded as follows:

tag_dictionary = tagger.tag_dictionary
new_tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type).get_items()

for new_tag in new_tag_dictionary:
    tag_dictionary.add_item(str(new_tag))

new_tagger = SequenceTagger(
    hidden_size=256,
    embeddings=tagger.embeddings,
    tag_dictionary=tag_dictionary,
    tag_type='ner',
)

It seems to work better. Overall, everything works now, thank you 👍

Nighthyst on 31 Aug 2020

Cool, thanks for sharing the details!

alanakbik on 21 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings