Describe the bug
Following along the tutorial @ https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f works fine! But when changing dataset to another csv-file with same format I get ZeroDivisionError in trainer.train('./', max_epochs=10) seem to be something in the evaluation
--> 169 train_loss /= len(train_data)
To Reproduce
The full code example:
import pandas as pd
data = pd.read_csv("./genders.csv", sep='\t', encoding='latin-1').sample(frac=1)
data['label'] = '__label__' + data['label'].astype(str)
data.iloc[0:int(len(data)*0.8)].to_csv('train.csv', sep='\t', index = False, header = False)
data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('test.csv', sep='\t', index = False, header = False)
data.iloc[int(len(data)*0.9):].to_csv('dev.csv', sep='\t', index = False, header = False);
from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from pathlib import Path
md5-bee5a00b9be18ea174477cc6411e1ff2
corpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'), test_file='test.csv', dev_file='dev.csv', train_file='train.csv')
word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]
document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)
classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)
md5-bee5a00b9be18ea174477cc6411e1ff2
trainer = ModelTrainer(classifier, corpus)
trainer.train('./', max_epochs=10) """Crashes here!"""
Expected behavior
Same behavior as the spam.csv dataset in the tutorial.
Screenshots
https://imgur.com/a/UcDeIFL
Environment (please complete the following information):
Additional context
dataset
One problem is, that the data format is not flair compatible:
-Z of me, I hope you enjoyed reading about me. What's on your A-Z? __label__female
So, the label is located at the end of a line. but it has to be on the first position :)
Just correct this with:
data.iloc[0:int(len(data)*0.8)].to_csv('train.csv', sep='\t', index = False, header = False, columns=['label', 'text'])
data.iloc[int(len(data)*0.8):int(len(data)*0.9)].to_csv('test.csv', sep='\t', index = False, header = False, columns=['label', 'text'])
data.iloc[int(len(data)*0.9):].to_csv('dev.csv', sep='\t', index = False, header = False, columns=['label', 'text'])
This reorders the columns :)
@stefan-it Thanks alot man :) !!
Hi there I am having the same problem at the exact same line: "trainer.train"
I put the label first and sentence after the tab for all datasets: train,test and dev.
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
"""Loading the (uploaded) Switchboard Dialog Act Corpus into Flair to make a label dictionary"""
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
# 1. get the corpus
corpus = NLPTaskDataFetcher.load_classification_corpus('/content/',train_file='sdac_train.csv',test_file='sdac_test.csv',dev_file='sdac_dev.csv')
label_dict = corpus.make_label_dictionary()
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
word_embeddings = [WordEmbeddings('glove')]
document_embeddings = DocumentRNNEmbeddings(word_embeddings,
hidden_size=512,
reproject_words=True,
reproject_words_dimension=256,
)
"""Initializing Text Classifier"""
from flair.models import TextClassifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=True)
"""Initializing Training"""
trainer = ModelTrainer(classifier, corpus)
trainer.train('resources/taggers/sdac',
learning_rate=0.1,
mini_batch_size=32,
anneal_factor=0.5,
patience=5,
max_epochs=10)
"""Plotting training curves"""
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('resources/taggers/sdac/loss.tsv')
plotter.plot_weights('resources/taggers/sdac/weights.txt')
"""Predicting"""
classifier = TextClassifier.load_from_file('resources/taggers/sdac/final-model.pt')
# create example sentence
sentence = Sentence('France is the current world cup winner.')
# predict tags and print
classifier.predict(sentence)
print(sentence.labels)
Error:
```2019-03-15 06:20:48,409 ----------------------------------------------------------------------------------------------------
2019-03-15 06:20:48,412 Evaluation method: MICRO_F1_SCORE
ZeroDivisionError Traceback (most recent call last)
5 anneal_factor=0.5,
6 patience=5,
----> 7 max_epochs=10)
/usr/local/lib/python3.6/dist-packages/flair/trainers/trainer.py in train(self, base_path, evaluation_metric, learning_rate, mini_batch_size, eval_mini_batch_size, max_epochs, anneal_factor, patience, anneal_against_train_loss, train_with_dev, monitor_train, embeddings_in_memory, checkpoint, save_final_model, anneal_with_restarts, test_mode, param_selection_mode, **kwargs)
167 weight_extractor.extract_weights(self.model.state_dict(), iteration)
168
--> 169 train_loss /= len(train_data)
170
171 self.model.eval()
ZeroDivisionError: division by zero```
@Zoher15 this looks like the training data may not have been read correctly. Could you print the corpus statistics to see if it was loaded correctly, i.e.:
print(corpus)
print(corpus.obtain_statistics())
and paste the results here? Also, could you share a few lines of the training data file that you are reading in?
Hi @alanakbik I figured it out. After diving deep into your code. I found that I needed to add __"____label____"__ before every label. I don't know if this is already in the documentation, but a useful tutorial will be: the different corpus loading functions. :) awesome work by you and your team!
Best,
Zoher
Ah great, glad it works! We'll try to clarify in the tutorial!
thanks @Zoher15 ....you saved my day!
Still don't have any clue about the adding "label" part. Can you guys help please?
I am working with CONLL_2000 dataset using NLPTask and still getting this error.
The __label__ format is only for text classification. If you are using the CoNLL 2000 data then you are working on sequence labeling, right? Could you paste a code example that yields this error so we can reproduce?
I am trying to pass a data frame in the pre-trained Flair model. Is this possible? If so would I need to convert my data frame into string?? Please some advice will be appreciated as I am struggling with it a little bit.
Thanks @Alan Akbik
Most helpful comment
Just correct this with:
This reorders the columns :)