Flair: Language modelling train set directory

Created on 10 Apr 2019 · 5Comments · Source: flairNLP/flair

Language modelling
When I want to create a language model using more than one split training sets, I get the following error:

Sinas-MacBook-Pro:finnlp sina$ python3 train_LM_flair.py 
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
2019-04-10 19:27:29,037 read text file with 2 lines
2019-04-10 19:27:29,176 read text file with 3 lines
Traceback (most recent call last):
  File "train_LM_flair.py", line 34, in <module>
    trainer.train('language_model', mini_batch_size=10, sequence_length=10, max_epochs=10)
  File "/usr/local/lib/python3.7/site-packages/flair/trainers/language_model_trainer.py", line 256, in train
    for curr_split, train_slice in enumerate(training_generator, self.split):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
2019-04-10 19:27:29,308 read text file with 1 lines
    raise batch.exc_type(batch.exc_msg)
TypeError: function takes exactly 5 arguments (1 given)

Following the errors, I could find out that the error is due to this part of the language_model_trainer.py script:

 self.train = TextDataset(path / 'train', dictionary, False, self.forward, self.split_on_char,
                                 self.random_case_flip, shuffle_lines=self.shuffle_lines)

To Reproduce
Steps to reproduce the behavior:

from pathlib import Path

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

# are you training a forward or backward LM?
is_forward_lm = True

# load the default character dictionary
dictionary: Dictionary = Dictionary.load('chars')

# get your corpus, process forward and at the character level
corpus = TextCorpus(Path('corpus'),
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
                               is_forward_lm,
                               hidden_size=128,
                               nlayers=1)

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model',
              sequence_length=10,
              mini_batch_size=10,
              max_epochs=10)

Directory structure
This is how my training, validation and testing data sets are organized in my working directory:

Sinas-MacBook-Pro:corpus sina$ tree
.
├── test.txt
├── train
│   ├── train_split_1.txt
│   ├── train_split_2.txt
│   ├── train_split_3.txt
│   ├── train_split_4.txt
│   └── train_split_5.txt
└── valid.txt

Environment (please complete the following information):

OS Mojave
Latest version (as of April 10, 2019)

bug

Source

sinaahmadi

Most helpful comment

Well, there were a dummy problem with Mac!
The error was caused by the hidden .DS_Store file where macOS saves the directory details in! Setting a restriction over the file extensions such as .txt may be better for future developments.

sinaahmadi on 16 Apr 2019

👍2

All 5 comments

Hi @sinaahmadi,

is your training script located at the same (file system) level as test.txt and valid.txt? Then you have to adjust the following object:

from

TextCorpus(Path('corpus')

to just:

TextCorpus(Path('.')

Normally, I would use the following folder structure:

-> corpus/train/*splits
-> corpus/test.txt
-> corpus/valid.txt
-> train.py

I hope that helps :)

stefan-it on 13 Apr 2019

Hello @sinaahmadi - were you able to fix the problem?

alanakbik on 16 Apr 2019

sinaahmadi on 16 Apr 2019

👍2

Ah ok - then I'll close the issue for now!

alanakbik on 16 Apr 2019

👍1

how can you find the problem, really cool. I think the problem is really hard to find @sinaahmadi
really want to learn some insights about the debugging