Flair: The order of stacked embeddings

Created on 23 Mar 2021  Â·  10Comments  Â·  Source: flairNLP/flair

Hello,
I have noticed that the model for NER depends on the order of stacked embeddings. That is, the model performance is far better when Flair embeddings are concatenated with Bert as follows:
````

4. initialize embeddings

embedding_types = [

#starting with flair
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),

# now add bert
TransformerWordEmbeddings('bert-base-uncased')
]
On the other hand, when Bert is stacked with Flair, the model was not able to detect labels!

4. initialize embeddings

embedding_types = [

#starting with bert

TransformerWordEmbeddings('bert-base-uncased')
# now add flair
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),

]
````
My question how can the order of embeddings affect the quality of classification?

question

All 10 comments

Strange, there should be no difference at all. Could you share your training script?

@alanakbik Thank you for your reply.
The dataset for training and testing has the following structure:
````
great O
music B_A
, O
long O
story B_A
with O
lots O
of O
colorful O
character B_A
details I_A
and O
twist B_A
at O
the O
end O
. O

I am using kaggle notebook to train BiLSTM based model for aspect-term extraction:

import libraries

import flair
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.data import MultiCorpus # use multiple corpus to train the model
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

Importing the Embeddings

from flair.embeddings import WordEmbeddings, CharacterEmbeddings, StackedEmbeddings, \
FlairEmbeddings, ELMoEmbeddings, FlairEmbeddings, \
TransformerWordEmbeddings, PooledFlairEmbeddings
import torch

set_seed method sets seeds for random, numpy and torch (release 0.7)

for ensuring Training Reproducibility

flair.set_seed(42)

import sklearn
from sklearn.model_selection import train_test_split

by creating the corpus evaluation data will be defined implicitly

even if we don't provide test_file and dev_file

x_train, x_test, y_train, y_test = train_test_split(sentences_train, labels_train, test_size = 0.001, random_state = 42)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size = 0.08, random_state = 42)

convert to txt files

with open("train_aspects.txt", "w") as fp:
for sentence, label in zip(x_train, y_train):
for i, j in zip(sentence, label):
fp.write(f'{i} {j}\n')
fp.write('\n')

with open("test_aspects.txt", "w") as fp:
for sentence, label in zip(x_test, y_test):
for i, j in zip(sentence, label):
fp.write(f'{i} {j}\n')
fp.write('\n')

with open("dev_aspects.txt", "w") as fp:
for sentence, label in zip(x_dev, y_dev):
for i, j in zip(sentence, label):
fp.write(f'{i} {j}\n')
fp.write('\n')

# define columns
columns = {0: 'text', 1:'ner'}

data folder

data_folder = './'

get the corpus from dataset for aspect extraction

using column format, data folder and the names of the train file

corpus_ASP: Corpus = ColumnCorpus(data_folder, columns,
train_file = 'train_aspects.txt',#) # evaluation data will be defined implicitly
test_file = 'test_aspects.txt',
dev_file = 'dev_aspects.txt')

returns you a python dictionary with useful statistics about your dataset:

stats = corpus_ASP.obtain_statistics()
print(stats)

{
"TRAIN": {
"dataset": "TRAIN",
"total_number_of_documents": 2221,
"number_of_documents_per_class": {},
"number_of_tokens_per_tag": {},
"number_of_tokens": {
"total": 37740,
"min": 1,
"max": 162,
"avg": 16.9923457901846
}
},
"TEST": {
"dataset": "TEST",
"total_number_of_documents": 3,
"number_of_documents_per_class": {},
"number_of_tokens_per_tag": {},
"number_of_tokens": {
"total": 29,
"min": 5,
"max": 13,
"avg": 9.666666666666666
}
},
"DEV": {
"dataset": "DEV",
"total_number_of_documents": 194,
"number_of_documents_per_class": {},
"number_of_tokens_per_tag": {},
"number_of_tokens": {
"total": 3563,
"min": 1,
"max": 95,
"avg": 18.3659793814433
}
}
}

what tag do we want to predict?

tag_type = 'ner'

make the tag dictionary from the corpus

tag_dictionary_ASP = corpus_ASP.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary_ASP)

Dictionary with 6 tags: , O, B_A, I_A, ,

4. initialize embeddings

embedding_types = [
TransformerWordEmbeddings('bert-base-uncased'),

# You can remove the comments ‘#’ to use all the embeddings
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward')

]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

5. initialize sequence tagger

tagger_ASP: SequenceTagger = SequenceTagger(hidden_size = 256, # 256 # 128
embeddings = embeddings,
tag_dictionary = tag_dictionary_ASP,
tag_type = tag_type,
rnn_type = "LSTM",
rnn_layers = 2, # 1 or 2
use_crf=False) # True or False

6. initialize trainer

from flair.trainers import ModelTrainer
trainer_ASP: ModelTrainer = ModelTrainer(tagger_ASP,
corpus_ASP)

7. start training

trainer_ASP.train('sequence-labeling/ASP',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
train_with_dev = False,
train_with_test = True, # use 3 sentences defined as test data in corpus to train the model also
patience = 3,
embeddings_storage_mode = 'gpu')

create txt file from our test data

with open("test_ASP.txt", "w") as fp:
for sentence, label in zip(sentences_test, labels_test): # ONLY test data
for i, j in zip(sentence, label):
fp.write(f'{i} {j}\n')
fp.write('\n')

define columns

columns = {0: 'text', 1:'ner'}

data folder

data_folder = './'

get the corpus from test data

using column format, data folder and the names of the train, dev and test data files

corpus_test_ASP: Corpus = ColumnCorpus(data_folder, columns, test_file = 'test_ASP.txt')

Evaluation on test part of the corpus

result_test_ASP, score = model_ASP.evaluate(corpus_test_ASP.test, mini_batch_size=1, out_path=f"predictions_test_ASP.txt")

print(result_test_ASP.detailed_results)

Results:

  • F-score (micro): 0.9267
  • F-score (macro): 0.3265
  • Accuracy (incl. no class): 0.9267

By class:
precision recall f1-score support

       O     0.9293    0.9969    0.9619     37953
     B_A     0.1259    0.0095    0.0176      1792
     I_A     1.0000    0.0000    0.0000      1103

accuracy                         0.9267     40848

macro avg 0.6851 0.3355 0.3265 40848
weighted avg 0.8960 0.9267 0.8945 40848

````
As shown, the model was not able to detect B_A and I_A labels.
On the other hand, using the same script after modifying the stacked embedding part improve the quality of classification significantly:
`````

4. initialize embeddings

embedding_types = [

FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),

TransformerWordEmbeddings('bert-base-uncased')

]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
` The new results:
Results:

  • F-score (micro): 0.9416
  • F-score (macro): 0.5512
  • Accuracy (incl. no class): 0.9416

By class:
precision recall f1-score support

       O     0.9526    0.9886    0.9703     37953
     B_A     0.6809    0.4821    0.5645      1792
     I_A     0.4010    0.0698    0.1189      1103

accuracy                         0.9416     40848

macro avg 0.6782 0.5135 0.5512 40848
weighted avg 0.9258 0.9416 0.9295 40848
````
Thank you in advance!

hi @afi1289

My guess would be, that the behavior is due the seed that you set:

# for ensuring Training Reproducibility 
flair.set_seed(42)

the initialization determines what local minima the training converges to and changing the order basically swaps the initialization of weights. My guess would be, that you will get different results for different seeds and therefore, there shouldn't be a reason to assume one order is better than the other one.

That is strange. A small question: In the training corpus, the test split only has 3 documents, but the evaluation statistics printed at the end look like from a much larger corpus?

@alanakbik That is right because I am using a separate corpus for testing corpus_test_ASP to ensure that the model will not use the test part during training process. So I have corpus_ASP for training and corpus_test_ASP for testing purposes.

Hello @afi1289 a small issue in your code (unrelated to the original issue). You call model_ASP.evaluate after training, but there is a bug in Flair that after training, the model has the final model weights, not best model weights. So you need to explicitly call SequenceTagger.load('best-model.pt') in order to make sure that you evaluate the best model at the end.

This was just now fixed in the branch tars_tagger but will take while before its merged into main branch.

Oh, and another issue: Your tags should be "B-A" and "I-A" - this way, the correct evaluation routine will trigger (i.e. the span-F1 routine). This will evaluate the "A" tag as a whole instead of each component part, and treat evaluation of O correctly.

@afi1289 I am not able to reproduce this issue. If I permute the order of embeddings I get pretty much the same results on my data.

@alanakbik Sorry I have a little question: when I call SequenceTagger.load('best-model.pt') to evaluate the best model at the end. Does that mean I will evaluate the model that has the best accuracy on the validation set?
Thank you in advance!

@afi1289 The best model is selected using the score on the validation set. The score is accuracy if each word always gets a tag (like in POS tagging) and F1 if there is an "out" tag, such as in NER where not each word is an entity.

Was this page helpful?
0 / 5 - 0 ratings