Language modelling
When I want to create a language model using more than one split training sets, I get the following error:
Sinas-MacBook-Pro:finnlp sina$ python3 train_LM_flair.py
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
2019-04-10 19:27:29,037 read text file with 2 lines
2019-04-10 19:27:29,176 read text file with 3 lines
Traceback (most recent call last):
File "train_LM_flair.py", line 34, in <module>
trainer.train('language_model', mini_batch_size=10, sequence_length=10, max_epochs=10)
File "/usr/local/lib/python3.7/site-packages/flair/trainers/language_model_trainer.py", line 256, in train
for curr_split, train_slice in enumerate(training_generator, self.split):
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
return self._process_next_batch(batch)
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
2019-04-10 19:27:29,308 read text file with 1 lines
raise batch.exc_type(batch.exc_msg)
TypeError: function takes exactly 5 arguments (1 given)
Following the errors, I could find out that the error is due to this part of the language_model_trainer.py script:
self.train = TextDataset(path / 'train', dictionary, False, self.forward, self.split_on_char,
self.random_case_flip, shuffle_lines=self.shuffle_lines)
To Reproduce
Steps to reproduce the behavior:
from pathlib import Path
from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus
# are you training a forward or backward LM?
is_forward_lm = True
# load the default character dictionary
dictionary: Dictionary = Dictionary.load('chars')
# get your corpus, process forward and at the character level
corpus = TextCorpus(Path('corpus'),
dictionary,
is_forward_lm,
character_level=True)
# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
is_forward_lm,
hidden_size=128,
nlayers=1)
# train your language model
trainer = LanguageModelTrainer(language_model, corpus)
trainer.train('resources/taggers/language_model',
sequence_length=10,
mini_batch_size=10,
max_epochs=10)
Directory structure
This is how my training, validation and testing data sets are organized in my working directory:
Sinas-MacBook-Pro:corpus sina$ tree
.
โโโ test.txt
โโโ train
โย ย โโโ train_split_1.txt
โย ย โโโ train_split_2.txt
โย ย โโโ train_split_3.txt
โย ย โโโ train_split_4.txt
โย ย โโโ train_split_5.txt
โโโ valid.txt
Environment (please complete the following information):
Hi @sinaahmadi,
is your training script located at the same (file system) level as test.txt and valid.txt? Then you have to adjust the following object:
from
TextCorpus(Path('corpus')
to just:
TextCorpus(Path('.')
Normally, I would use the following folder structure:
-> corpus/train/*splits
-> corpus/test.txt
-> corpus/valid.txt
-> train.py
I hope that helps :)
Hello @sinaahmadi - were you able to fix the problem?
Well, there were a dummy problem with Mac!
The error was caused by the hidden .DS_Store file where macOS saves the directory details in! Setting a restriction over the file extensions such as .txt may be better for future developments.
Ah ok - then I'll close the issue for now!
how can you find the problem, really cool. I think the problem is really hard to find @sinaahmadi
really want to learn some insights about the debugging
Most helpful comment
Well, there were a dummy problem with Mac!
The error was caused by the hidden
.DS_Storefile where macOS saves the directory details in! Setting a restriction over the file extensions such as.txtmay be better for future developments.