Flair: difference in feeding a dataset to text classifier via .txt and .tsv

Created on 3 Oct 2019  路  20Comments  路  Source: flairNLP/flair

I have followed the docs and trained a text classifier on TREC6. There the input files are "new_filenames = ["train.txt", "test.txt"]" and everything works just fine.

Since i wanted to do 10 fold cross validation from all labeled questions, i have played with the files and prepared proper .tsv files.

With use of these .tsv files i have succeeded in training a model based on Glove wordembeddings, but RoBERTaEmbeddings() didn't train properly. The error suggested
embeddings.py", line 965, in _extract_embeddings
first_embedding: torch.FloatTensor = current_embeddings[0]
IndexError: index 0 is out of bounds for dimension 0 with size 0

After some time i figured out that the problem was with 6 sentences that contained "n't". After changing this to "not" model based on RoBERTaEmbeddings() worked fine.

My question is: how is this possible? What is the difference in feeding the model via txt file and tsv file that it didn't work with "n't" with RoBERTaEmbeddings() and at the same time didn't cause problems with glove (and also Flair)embeddings? I have inputted numerous ' signs in other questions, but it didn't cause any problem.

p.s.

example sentence causing training to give above mentioned error was:
"LOC:country What two South American countries do n't border Brazil ?"

question

Most helpful comment

@myeghaneh the latter i.e., I have first prepared data for each fold, and second I have trained models on that data with Flair for each fold separately.

All 20 comments

Hello @krzysztoffsiuwa that's difficult to say - could you try to isolate the error in a minimal code example, for instance loading a single Sentence that causes the error?

Thanks for quick response.

Generally speaking this issue is not stopping my work, and if it appears again with a different dataset, i'll begin from changing this "n't" to "not" or change my code to prepare .txt files. However if you wish to investigate:

Please find attached the train.tsv file with a single sentence example causing the problem and train_.tsv with "n't" replaced by "not" and working fine (zipped).
train.zip

I feed the model by:
corpus: flair.data.Corpus = flair.datasets.ClassificationCorpus(Path(os.path.join(path2[i])),
test_file='test_.tsv',
dev_file='dev.tsv',
train_file='train.tsv'
)

word_embeddings = [
RoBERTaEmbeddings()
]

document_embeddings = DocumentRNNEmbeddings(word_embeddings,
                                            hidden_size=256,
                                            reproject_words=True,
                                            reproject_words_dimension=256,
                                            rnn_type="gru",
                                            bidirectional=False,
                                            rnn_layers=1)

classifier = TextClassifier(document_embeddings,
                            label_dictionary=corpus.make_label_dictionary(),
                            multi_label=False)

trainer = ModelTrainer(classifier, corpus)
trainer.train(base_path="{}".format(path2[i]),
              max_epochs=epochs,
              learning_rate=0.1,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=20,
              embeddings_storage_mode='gpu',
              shuffle=False,
              )

i'm working on Ubuntu 18.04 LTS, python3, flair 0.4.3.

I am having similar issues with RoBERTa. When other embeddings work on a dataset, this one can happen not to work. Still investigating exact causes in my case.

I reproduced this problem with following code.
As @sanja7s said, it seems there is something wrong with RoBERTa.
(macOS 10.14, Python3.7, flair MASTER)

from pathlib import Path
from flair.training_utils import EvaluationMetric
from flair.trainers import ModelTrainer
from flair.models import SequenceTagger
from flair.data import Corpus
from flair.datasets import WNUT_17
from flair.embeddings import TokenEmbeddings
from flair.embeddings import StackedEmbeddings
from flair.embeddings import RoBERTaEmbeddings
from typing import List

# 1. get the corpus
corpus: Corpus = WNUT_17().downsample(0.1)

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [
    RoBERTaEmbeddings(),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

# 6. initialize trainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 7. start training
trainer.train('resources/taggers/example-ner',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150,
              checkpoint=True)

# 8. stop training at any point

# 9. continue trainer at later point

checkpoint = tagger.load_checkpoint(
    Path('resources/taggers/example-ner/checkpoint.pt'))
trainer = ModelTrainer.load_from_checkpoint(checkpoint, corpus)
trainer.train('resources/taggers/example-ner',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150,
              checkpoint=True)

Facing the same issue with RoBERTa and XLNet even with the txt files

Yes, I can reproduce the error with the above code. @stefan-it - perhaps something changed with the new transformers version? RoBERTa was working for me before.

@krzysztoffsiuwa Sorry, I totally missed that issue here!

There's one problem in the code where I added a special case for GPT2 and RoBERTa:

if "roberta" in name or "gpt2" in name:

This worked some time ago, but with https://github.com/zalandoresearch/flair/commit/e1393f65188fe86e687b68ddd8441cea04cc893b#diff-9b198b0f5b0c7ad771c9239623c9ad8c an integer prefix was added. So the model name won't be roberta-base, it is 0-roberta-base now.

I'll add a fix for that soon!

PR is ready.

I trained a model on the complete WNUT-17 dataset with RoBERTa:

MICRO_AVG: acc 0.3576 - f1-score 0.5269                                                                                                                                                                                                                                                                                     
MACRO_AVG: acc 0.2889 - f1-score 0.4315833333333334                                                                                                                                                                                                                                                                         
corporation tp: 22 - fp: 27 - fn: 44 - tn: 22 - precision: 0.4490 - recall: 0.3333 - accuracy: 0.2366 - f1-score: 0.3826                                                                                                                                                                                                    
creative-work tp: 33 - fp: 19 - fn: 109 - tn: 33 - precision: 0.6346 - recall: 0.2324 - accuracy: 0.2050 - f1-score: 0.3402                                                                                                                                                                                                 
group      tp: 42 - fp: 23 - fn: 123 - tn: 42 - precision: 0.6462 - recall: 0.2545 - accuracy: 0.2234 - f1-score: 0.3652                                                                                                                                                                                                    
location   tp: 82 - fp: 35 - fn: 68 - tn: 82 - precision: 0.7009 - recall: 0.5467 - accuracy: 0.4432 - f1-score: 0.6143                                                                                                                                                                                                     
person     tp: 270 - fp: 108 - fn: 159 - tn: 270 - precision: 0.7143 - recall: 0.6294 - accuracy: 0.5028 - f1-score: 0.6692                                                                                                                                                                                                 
product    tp: 17 - fp: 12 - fn: 110 - tn: 17 - precision: 0.5862 - recall: 0.1339 - accuracy: 0.1223 - f1-score: 0.2180

@stefan-it thanks for fixing this!

:+1:

馃憤

Thanks a lot, it seems it helped also for me and XLNet on another dataset.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was fixed a while back

@krzysztoffiok can you let me know how did you manage to have cross validation? did you use it inside the flair pipeline or did you use it using external approach?

@myeghaneh the latter i.e., I have first prepared data for each fold, and second I have trained models on that data with Flair for each fold separately.

thank you for your quick response, I see, you kind of chose 9 fold train, 1 test then you train, then you kind of chose others fold as test and you train, so you kind of train 10 times? correct? then you either average or aggregate the results? can you share part of your code how did you choose folds? @krzysztoffiok

@myeghaneh I think here I was using https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

nothing special, a standard procedure.

thank you, I see , since one training took a lot of time form me...may I ask how long it took the whole cross validation

usually it takes a lot of time, that's for sure. You have to try on your computing machine/dataset/model/parameters to see and decide if you can afford that.

Was this page helpful?
0 / 5 - 0 ratings