Hello,
I have noticed that the model for NER depends on the order of stacked embeddings. That is, the model performance is far better when Flair embeddings are concatenated with Bert as follows:
````
embedding_types = [
#starting with flair
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),
# now add bert
TransformerWordEmbeddings('bert-base-uncased')
]
On the other hand, when Bert is stacked with Flair, the model was not able to detect labels!
embedding_types = [
#starting with bert
TransformerWordEmbeddings('bert-base-uncased')
# now add flair
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),
]
````
My question how can the order of embeddings affect the quality of classification?
Strange, there should be no difference at all. Could you share your training script?
@alanakbik Thank you for your reply.
The dataset for training and testing has the following structure:
````
great O
music B_A
, O
long O
story B_A
with O
lots O
of O
colorful O
character B_A
details I_A
and O
twist B_A
at O
the O
end O
. O
I am using kaggle notebook to train BiLSTM based model for aspect-term extraction:
import flair
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.data import MultiCorpus # use multiple corpus to train the model
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.embeddings import WordEmbeddings, CharacterEmbeddings, StackedEmbeddings, \
FlairEmbeddings, ELMoEmbeddings, FlairEmbeddings, \
TransformerWordEmbeddings, PooledFlairEmbeddings
import torch
flair.set_seed(42)
import sklearn
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(sentences_train, labels_train, test_size = 0.001, random_state = 42)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size = 0.08, random_state = 42)
with open("train_aspects.txt", "w") as fp:
for sentence, label in zip(x_train, y_train):
for i, j in zip(sentence, label):
fp.write(f'{i} {j}\n')
fp.write('\n')
with open("test_aspects.txt", "w") as fp:
for sentence, label in zip(x_test, y_test):
for i, j in zip(sentence, label):
fp.write(f'{i} {j}\n')
fp.write('\n')
with open("dev_aspects.txt", "w") as fp:
for sentence, label in zip(x_dev, y_dev):
for i, j in zip(sentence, label):
fp.write(f'{i} {j}\n')
fp.write('\n')
# define columns
columns = {0: 'text', 1:'ner'}
data_folder = './'
corpus_ASP: Corpus = ColumnCorpus(data_folder, columns,
train_file = 'train_aspects.txt',#) # evaluation data will be defined implicitly
test_file = 'test_aspects.txt',
dev_file = 'dev_aspects.txt')
stats = corpus_ASP.obtain_statistics()
print(stats)
{
"TRAIN": {
"dataset": "TRAIN",
"total_number_of_documents": 2221,
"number_of_documents_per_class": {},
"number_of_tokens_per_tag": {},
"number_of_tokens": {
"total": 37740,
"min": 1,
"max": 162,
"avg": 16.9923457901846
}
},
"TEST": {
"dataset": "TEST",
"total_number_of_documents": 3,
"number_of_documents_per_class": {},
"number_of_tokens_per_tag": {},
"number_of_tokens": {
"total": 29,
"min": 5,
"max": 13,
"avg": 9.666666666666666
}
},
"DEV": {
"dataset": "DEV",
"total_number_of_documents": 194,
"number_of_documents_per_class": {},
"number_of_tokens_per_tag": {},
"number_of_tokens": {
"total": 3563,
"min": 1,
"max": 95,
"avg": 18.3659793814433
}
}
}
tag_type = 'ner'
tag_dictionary_ASP = corpus_ASP.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary_ASP)
Dictionary with 6 tags:
, O, B_A, I_A, ,
embedding_types = [
TransformerWordEmbeddings('bert-base-uncased'),
# You can remove the comments ‘#’ to use all the embeddings
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward')
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
tagger_ASP: SequenceTagger = SequenceTagger(hidden_size = 256, # 256 # 128
embeddings = embeddings,
tag_dictionary = tag_dictionary_ASP,
tag_type = tag_type,
rnn_type = "LSTM",
rnn_layers = 2, # 1 or 2
use_crf=False) # True or False
from flair.trainers import ModelTrainer
trainer_ASP: ModelTrainer = ModelTrainer(tagger_ASP,
corpus_ASP)
trainer_ASP.train('sequence-labeling/ASP',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
train_with_dev = False,
train_with_test = True, # use 3 sentences defined as test data in corpus to train the model also
patience = 3,
embeddings_storage_mode = 'gpu')
with open("test_ASP.txt", "w") as fp:
for sentence, label in zip(sentences_test, labels_test): # ONLY test data
for i, j in zip(sentence, label):
fp.write(f'{i} {j}\n')
fp.write('\n')
columns = {0: 'text', 1:'ner'}
data_folder = './'
corpus_test_ASP: Corpus = ColumnCorpus(data_folder, columns, test_file = 'test_ASP.txt')
result_test_ASP, score = model_ASP.evaluate(corpus_test_ASP.test, mini_batch_size=1, out_path=f"predictions_test_ASP.txt")
print(result_test_ASP.detailed_results)
Results:
By class:
precision recall f1-score support
O 0.9293 0.9969 0.9619 37953
B_A 0.1259 0.0095 0.0176 1792
I_A 1.0000 0.0000 0.0000 1103
accuracy 0.9267 40848
macro avg 0.6851 0.3355 0.3265 40848
weighted avg 0.8960 0.9267 0.8945 40848
````
As shown, the model was not able to detect B_A and I_A labels.
On the other hand, using the same script after modifying the stacked embedding part improve the quality of classification significantly:
`````
embedding_types = [
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),
TransformerWordEmbeddings('bert-base-uncased')
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
`
The new results:
Results:
By class:
precision recall f1-score support
O 0.9526 0.9886 0.9703 37953
B_A 0.6809 0.4821 0.5645 1792
I_A 0.4010 0.0698 0.1189 1103
accuracy 0.9416 40848
macro avg 0.6782 0.5135 0.5512 40848
weighted avg 0.9258 0.9416 0.9295 40848
````
Thank you in advance!
hi @afi1289
My guess would be, that the behavior is due the seed that you set:
# for ensuring Training Reproducibility
flair.set_seed(42)
the initialization determines what local minima the training converges to and changing the order basically swaps the initialization of weights. My guess would be, that you will get different results for different seeds and therefore, there shouldn't be a reason to assume one order is better than the other one.
That is strange. A small question: In the training corpus, the test split only has 3 documents, but the evaluation statistics printed at the end look like from a much larger corpus?
@alanakbik That is right because I am using a separate corpus for testing corpus_test_ASP to ensure that the model will not use the test part during training process. So I have corpus_ASP for training and corpus_test_ASP for testing purposes.
Hello @afi1289 a small issue in your code (unrelated to the original issue). You call model_ASP.evaluate after training, but there is a bug in Flair that after training, the model has the final model weights, not best model weights. So you need to explicitly call SequenceTagger.load('best-model.pt') in order to make sure that you evaluate the best model at the end.
This was just now fixed in the branch tars_tagger but will take while before its merged into main branch.
Oh, and another issue: Your tags should be "B-A" and "I-A" - this way, the correct evaluation routine will trigger (i.e. the span-F1 routine). This will evaluate the "A" tag as a whole instead of each component part, and treat evaluation of O correctly.
@afi1289 I am not able to reproduce this issue. If I permute the order of embeddings I get pretty much the same results on my data.
@alanakbik Sorry I have a little question: when I call SequenceTagger.load('best-model.pt') to evaluate the best model at the end. Does that mean I will evaluate the model that has the best accuracy on the validation set?
Thank you in advance!
@afi1289 The best model is selected using the score on the validation set. The score is accuracy if each word always gets a tag (like in POS tagging) and F1 if there is an "out" tag, such as in NER where not each word is an entity.