transformers version: 3.4.0@sgugger @patrickvonplaten
Model I am using: T5
The problem arises when using my own modified scripts:
I load my dataset this way:
def tokenize(batch):
tokenized_input = tokenizer(batch[text_column], padding=True, truncation=True, max_length=153)
tokenized_label = tokenizer(batch[generated_column], padding=True, truncation=True, max_length=274)
tokenized_input['labels'] = tokenized_label['input_ids']
return tokenized_input
dataset = load_dataset('csv', data_files=dataset_file, split='train')
dataset = dataset.train_test_split(test_size=0.05, seed=SEED)
train_dataset = dataset['train']
val_dataset = dataset['test']
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
train_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
And then I use Trainer to train my T5 model like this:
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
eval_accumulation_steps=1,
learning_rate=0.001,
evaluation_strategy='steps',
save_steps=1000000,
save_total_limit=1,
remove_unused_columns=True,
run_name=now,
logging_steps=100,
eval_steps=100,
logging_first_step=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
trainer.train()
The tasks I am working on is my own task or dataset:
I am using a custom dataset for machine translation which has 12MB size and 18.000 examples. The sequence max token sizes are 153 for input and 274 for output. I have also added 68 special tokens as the dataset has many symbols in it.
Steps to reproduce the behavior:
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 281882432 bytes. Error code 12 (Cannot allocate memory) (The machine I am using has 60GB RAM).The evaluation RAM should be freed after every step. Looks like something gets accumulated while training and RAM is not freed. I get the same behavior if I don't run training but only evaluation: after many evaluation steps the RAM blows up.
Additional info:
as a workaround, I am now using a smaller validation set, but it is not ideal. If the memory issue can't be solved, a better solution could be to introduce an option to use a random subset of the validation set to use to evaluate during training.
If the problem is just that the RAM is not freed after evaluation, we can try to work around that (though Python garbage collector can be tricky to trigger).
If the validation set gives predictions that do not fit in RAM, we can't do much in the generic Trainer directly. You can subclass Trainer and the evaluate function to use the datasets library Metric objects, which store the predictions with arrows so use less RAM.
If the problem is just that the RAM is not freed after evaluation, we can try to work around that (though Python garbage collector can be tricky to trigger).
I think the problem is not this one. The RAM is freed after evaluation (after some seconds), but it is not freed between an evaluation single step and the other. Correct me if I am wrong, but after a step the only thing to keep in RAM should be the loss, so it can be averaged at the end of evaluation, so the RAM usage should not increase as the steps go ahead, which instead is what happens.
During evaluation, we need to store predictions and labels too, for the metric computation. If you want to store the loss only, then pass along the flag prediction_loss_only=True to your training arguments, which will use less more RAM (and you can then probably remove the eval_accumulation_steps=1 to speed up evaluation).
I didn't know that, it solved my problem thank you!
Should even be automatic now as I just merged a PR on master where the Trainer does not bother saving the predictions when there is no compute_metrics (which is your case here).
Most helpful comment
During evaluation, we need to store predictions and labels too, for the metric computation. If you want to store the loss only, then pass along the flag
prediction_loss_only=Trueto your training arguments, which will use less more RAM (and you can then probably remove theeval_accumulation_steps=1to speed up evaluation).