Describe the bug
When training a SequenceTagger model the RAM usage keeps gradually increasing, untill it runs out of memory after a few epochs, and crashes when saving checkpoint.
2020-05-08 12:14:29,965 epoch 4 - iter 480/602 - loss 0.81027193 - samples/sec: 140.87
2020-05-08 12:14:57,488 epoch 4 - iter 540/602 - loss 0.81292692 - samples/sec: 139.73
2020-05-08 12:15:28,748 epoch 4 - iter 600/602 - loss 0.81350967 - samples/sec: 123.00
2020-05-08 12:15:29,467 ----------------------------------------------------------------------------------------------------
2020-05-08 12:15:29,467 EPOCH 4 done: loss 0.8129 - lr 0.1000
2020-05-08 12:15:29,467 BAD EPOCHS (no improvement): 0
Traceback (most recent call last):
File "./config/flair_train_script.py", line 104, in <module>
checkpoint=True)
File "/home/ubuntu/miniconda/envs/aws_env/lib/python3.6/site-packages/flair/trainers/trainer.py", line 525, in train
self.save_checkpoint(base_path / "checkpoint.pt")
File "/home/ubuntu/miniconda/envs/aws_env/lib/python3.6/site-packages/flair/trainers/trainer.py", line 573, in save_checkpoint
torch.save(self, str(model_file), pickle_protocol=4)
File "/home/ubuntu/miniconda/envs/aws_env/lib/python3.6/site-packages/torch/serialization.py", line 328, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File "/home/ubuntu/miniconda/envs/aws_env/lib/python3.6/site-packages/torch/serialization.py", line 401, in _legacy_save
pickler.dump(obj)
MemoryError
To Reproduce
SequenceTagger with
embedding_types = [
WordEmbeddings('nl'),
PooledFlairEmbeddings('dutch-forward', pooling='mean'),
PooledFlairEmbeddings('dutch-backward', pooling='mean'),
]
Expected behavior
Don't use all the RAM, and don't crash.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
I'm training using the CONLL_03_DUTCH dataset + another dataset that has overlap in the tags.
pytorch 1.4.0 is used because 1.5.0 currently has this issue (https://github.com/pytorch/pytorch/issues/37703).
Could it have something to do with the pooled flair embeddings not being properly cleared after each epoch?
The problem does not occur when using
embedding_types = [
WordEmbeddings('nl'),
FlairEmbeddings('nl-forward'),
FlairEmbeddings('nl-backward'),
]
I think this might be the same issue as https://github.com/flairNLP/flair/issues/1270.
Could it be possible that the "fix" of that issue just moved the problem from GPU RAM to normal RAM?
Could it be related to this?: https://discuss.pytorch.org/t/memory-ram-usage-keep-going-up-every-step/12109/8
@schelv thanks for reporting this. I have to test. So essentially the RAM increases with each epoch, so we should see this behavior also in a test on a small dataset, right?
Yes It happens also when using a smaller dataset.
What is strange is that the first few epochs the RAM increase is largest.
for example start of epoch 3 is 21GB of RAM, at the start of epoch 7 25.9GB of RAM, and at the start of 8 it is 26.2GB of RAM.
After that the increase is even slower.
Yes, I notice the same thing. On smaller data it seems there is no RAM increase after 3-4 epochs. I can't find a memory leak on our side. It could be that this is the way PyTorch reserves more memory when it encounters a large mini-batch which is why there is almost no increase after a few epochs.
If that is the case, then you would expect to see this effect also for other types of embeddings, right?
Why would this be specific for the PooledEmbedding?
Hi @alanakbik
Suddenly I have the same issue. however, the same code was working fine before.
I appreciate your help
2020-06-14 16:53:49,857 epoch 1 - iter 12/128 - loss 46.14641762 - samples/sec: 7.96
2020-06-14 16:54:14,401 epoch 1 - iter 24/128 - loss 30.61114464 - samples/sec: 8.36
2020-06-14 16:54:34,429 epoch 1 - iter 36/128 - loss 24.42451382 - samples/sec: 10.35
2020-06-14 16:54:57,682 epoch 1 - iter 48/128 - loss 21.07016906 - samples/sec: 8.85
2020-06-14 16:55:25,035 epoch 1 - iter 60/128 - loss 19.32886293 - samples/sec: 7.51
2020-06-14 16:55:52,091 epoch 1 - iter 72/128 - loss 17.65958128 - samples/sec: 7.54
2020-06-14 16:56:16,569 epoch 1 - iter 84/128 - loss 16.43847208 - samples/sec: 8.38
2020-06-14 16:56:38,924 epoch 1 - iter 96/128 - loss 15.39204485 - samples/sec: 9.33
2020-06-14 16:56:57,066 epoch 1 - iter 108/128 - loss 14.41638838 - samples/sec: 11.51
2020-06-14 16:57:27,448 epoch 1 - iter 120/128 - loss 13.77307765 - samples/sec: 6.66
2020-06-14 16:57:44,326 ----------------------------------------------------------------------------------------------------
2020-06-14 16:57:44,329 EPOCH 1 done: loss 13.5623 - lr 0.1000000
PicklingError Traceback (most recent call last)
101 mini_batch_size=16,
102 max_epochs=150,
--> 103 checkpoint=True)
104
3 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _legacy_save(obj, f, pickle_module, pickle_protocol)
441 pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
442 pickler.persistent_id = persistent_id
--> 443 pickler.dump(obj)
444
445 serialized_storage_keys = sorted(serialized_storages.keys())
PicklingError: Can't pickle
@abeermohamed1 thanks for reporting this - I now get the same error. This is likely because somehow the torch aggregation functions like torch.add can no longer be serialized. Maybe something changed in their implementation in torch.
I'll put in a PR to fix this shortly!
Thank you @alanakbik
Could you please advice how to apply the fix now as I need it urgently?
Just merged the PR! If you install from master it should work now!
Thank you @alanakbik
I am getting the below error
TypeError Traceback (most recent call last)
103 mini_batch_size=16,
104 max_epochs=150,
--> 105 checkpoint=True)
106
5 frames
/usr/local/lib/python3.6/dist-packages/flair/embeddings/token.py in _add_embeddings_internal(self, sentences)
762 # set aggregation operation
763 if self.pooling == "mean":
--> 764 aggregated_embedding = torch.mean(self.word_embeddings[token.text], local_embedding)
765 elif self.pooling == "fade":
766 aggregated_embedding = torch.add(self.word_embeddings[token.text], local_embedding)
TypeError: mean() received an invalid combination of arguments - got (Tensor, Tensor), but expected one of:
@alanakbik 'min' and 'max' is working fine. the problem is in 'mean' which I need to work on.
Should be fixed now, can you try again?
Should be fixed now, can you try again?
Yes, it is working now. Appreciated @alanakbik Thank you :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Should be fixed now, can you try again?