Flair: trainer.train yields ZeroDivisionError

Created on 12 Feb 2019 · 11Comments · Source: flairNLP/flair

Describe the bug
Following along the tutorial @ https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f works fine! But when changing dataset to another csv-file with same format I get ZeroDivisionError in trainer.train('./', max_epochs=10) seem to be something in the evaluation

--> 169 train_loss /= len(train_data)

To Reproduce
The full code example:

import pandas as pd
data = pd.read_csv("./genders.csv", sep='\t', encoding='latin-1').sample(frac=1)
data['label'] = '__label__' + data['label'].astype(str)
data.iloc[0:int(len(data)*0.8)].to_csv('train.csv', sep='\t', index = False, header = False)
data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('test.csv', sep='\t', index = False, header = False)
data.iloc[int(len(data)*0.9):].to_csv('dev.csv', sep='\t', index = False, header = False);

from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from pathlib import Path



md5-bee5a00b9be18ea174477cc6411e1ff2



corpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'), test_file='test.csv', dev_file='dev.csv', train_file='train.csv')
word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]
document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)
classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)



md5-bee5a00b9be18ea174477cc6411e1ff2



trainer = ModelTrainer(classifier, corpus)
trainer.train('./', max_epochs=10) """Crashes here!"""

Expected behavior
Same behavior as the spam.csv dataset in the tutorial.

Screenshots
https://imgur.com/a/UcDeIFL

Environment (please complete the following information):

Ubuntu 18.04
latest version of flair (2019/02/12):

Additional context
dataset

bug

Source

timpal0l

Most helpful comment

Just correct this with:

data.iloc[0:int(len(data)*0.8)].to_csv('train.csv', sep='\t', index = False, header = False, columns=['label', 'text'])
data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('test.csv', sep='\t', index = False, header = False, columns=['label', 'text'])
data.iloc[int(len(data)*0.9):].to_csv('dev.csv', sep='\t', index = False, header = False, columns=['label', 'text'])

This reorders the columns :)

stefan-it on 12 Feb 2019

👍2

All 11 comments

One problem is, that the data format is not flair compatible:

-Z of me, I hope you enjoyed reading about me. What's on your A-Z?  __label__female

So, the label is located at the end of a line. but it has to be on the first position :)

stefan-it on 12 Feb 2019

Just correct this with:

data.iloc[0:int(len(data)*0.8)].to_csv('train.csv', sep='\t', index = False, header = False, columns=['label', 'text'])
data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('test.csv', sep='\t', index = False, header = False, columns=['label', 'text'])
data.iloc[int(len(data)*0.9):].to_csv('dev.csv', sep='\t', index = False, header = False, columns=['label', 'text'])

This reorders the columns :)

stefan-it on 12 Feb 2019

👍2

@stefan-it Thanks alot man :) !!

timpal0l on 12 Feb 2019

Hi there I am having the same problem at the exact same line: "trainer.train"
I put the label first and sentence after the tab for all datasets: train,test and dev.

from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

"""Loading the (uploaded) Switchboard Dialog Act Corpus into Flair to make a label dictionary"""

from flair.data import TaggedCorpus
from flair.data_fetcher import  NLPTaskDataFetcher, NLPTask
# 1. get the corpus
corpus = NLPTaskDataFetcher.load_classification_corpus('/content/',train_file='sdac_train.csv',test_file='sdac_test.csv',dev_file='sdac_dev.csv')
label_dict = corpus.make_label_dictionary()

from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
word_embeddings = [WordEmbeddings('glove')]
document_embeddings = DocumentRNNEmbeddings(word_embeddings,
                                                                     hidden_size=512,
                                                                     reproject_words=True,
                                                                     reproject_words_dimension=256,
                                                                     )

"""Initializing Text Classifier"""

from flair.models import TextClassifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=True)

"""Initializing Training"""

trainer = ModelTrainer(classifier, corpus)
trainer.train('resources/taggers/sdac',
              learning_rate=0.1,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=5,
              max_epochs=10)

"""Plotting training curves"""

from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('resources/taggers/sdac/loss.tsv')
plotter.plot_weights('resources/taggers/sdac/weights.txt')

"""Predicting"""

classifier = TextClassifier.load_from_file('resources/taggers/sdac/final-model.pt')

# create example sentence
sentence = Sentence('France is the current world cup winner.')

# predict tags and print
classifier.predict(sentence)

print(sentence.labels)

Error:
```2019-03-15 06:20:48,409 ----------------------------------------------------------------------------------------------------
2019-03-15 06:20:48,412 Evaluation method: MICRO_F1_SCORE

2019-03-15 06:20:48,416 ----------------------------------------------------------------------------------------------------

ZeroDivisionError Traceback (most recent call last)
in ()
5 anneal_factor=0.5,
6 patience=5,
----> 7 max_epochs=10)

/usr/local/lib/python3.6/dist-packages/flair/trainers/trainer.py in train(self, base_path, evaluation_metric, learning_rate, mini_batch_size, eval_mini_batch_size, max_epochs, anneal_factor, patience, anneal_against_train_loss, train_with_dev, monitor_train, embeddings_in_memory, checkpoint, save_final_model, anneal_with_restarts, test_mode, param_selection_mode, **kwargs)
167 weight_extractor.extract_weights(self.model.state_dict(), iteration)
168
--> 169 train_loss /= len(train_data)
170
171 self.model.eval()

ZeroDivisionError: division by zero```

Zoher15 on 15 Mar 2019

@Zoher15 this looks like the training data may not have been read correctly. Could you print the corpus statistics to see if it was loaded correctly, i.e.:

print(corpus)
print(corpus.obtain_statistics())

and paste the results here? Also, could you share a few lines of the training data file that you are reading in?

alanakbik on 18 Mar 2019

Hi @alanakbik I figured it out. After diving deep into your code. I found that I needed to add __"____label____"__ before every label. I don't know if this is already in the documentation, but a useful tutorial will be: the different corpus loading functions. :) awesome work by you and your team!

Best,
Zoher

Zoher15 on 18 Mar 2019

👍1

Ah great, glad it works! We'll try to clarify in the tutorial!

alanakbik on 18 Mar 2019

thanks @Zoher15 ....you saved my day!

gopalkalpande on 2 May 2019

👍1

Still don't have any clue about the adding "label" part. Can you guys help please?
I am working with CONLL_2000 dataset using NLPTask and still getting this error.

saklain31 on 9 May 2019

The __label__ format is only for text classification. If you are using the CoNLL 2000 data then you are working on sequence labeling, right? Could you paste a code example that yields this error so we can reproduce?

alanakbik on 9 May 2019

I am trying to pass a data frame in the pre-trained Flair model. Is this possible? If so would I need to convert my data frame into string?? Please some advice will be appreciated as I am struggling with it a little bit.
Thanks @Alan Akbik