Flair: MemoryError after few epochs using PooledFlairEmbeddings

Created on 8 May 2020 · 13Comments · Source: flairNLP/flair

Describe the bug
When training a SequenceTagger model the RAM usage keeps gradually increasing, untill it runs out of memory after a few epochs, and crashes when saving checkpoint.

2020-05-08 12:14:29,965 epoch 4 - iter 480/602 - loss 0.81027193 - samples/sec: 140.87
2020-05-08 12:14:57,488 epoch 4 - iter 540/602 - loss 0.81292692 - samples/sec: 139.73
2020-05-08 12:15:28,748 epoch 4 - iter 600/602 - loss 0.81350967 - samples/sec: 123.00
2020-05-08 12:15:29,467 ----------------------------------------------------------------------------------------------------
2020-05-08 12:15:29,467 EPOCH 4 done: loss 0.8129 - lr 0.1000
2020-05-08 12:15:29,467 BAD EPOCHS (no improvement): 0

 Traceback (most recent call last):
  File "./config/flair_train_script.py", line 104, in <module>
    checkpoint=True)
  File "/home/ubuntu/miniconda/envs/aws_env/lib/python3.6/site-packages/flair/trainers/trainer.py", line 525, in train
    self.save_checkpoint(base_path / "checkpoint.pt")
  File "/home/ubuntu/miniconda/envs/aws_env/lib/python3.6/site-packages/flair/trainers/trainer.py", line 573, in save_checkpoint
    torch.save(self, str(model_file), pickle_protocol=4)
  File "/home/ubuntu/miniconda/envs/aws_env/lib/python3.6/site-packages/torch/serialization.py", line 328, in save
    _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
  File "/home/ubuntu/miniconda/envs/aws_env/lib/python3.6/site-packages/torch/serialization.py", line 401, in _legacy_save
    pickler.dump(obj)
MemoryError

To Reproduce
SequenceTagger with

embedding_types = [
        WordEmbeddings('nl'),
        PooledFlairEmbeddings('dutch-forward', pooling='mean'),
        PooledFlairEmbeddings('dutch-backward', pooling='mean'),
    ]

Expected behavior
Don't use all the RAM, and don't crash.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

V100 GPU
61 GiB RAM
OS: Ubuntu 18.04
Version:
flair-0.4.5
pytorch-1.4.0-py3.6_cuda10.1.243_cudnn7.6.3_0

Additional context

I'm training using the CONLL_03_DUTCH dataset + another dataset that has overlap in the tags.
pytorch 1.4.0 is used because 1.5.0 currently has this issue (https://github.com/pytorch/pytorch/issues/37703).

Could it have something to do with the pooled flair embeddings not being properly cleared after each epoch?

The problem does not occur when using

    embedding_types = [
        WordEmbeddings('nl'),
        FlairEmbeddings('nl-forward'),
        FlairEmbeddings('nl-backward'),
    ]

I think this might be the same issue as https://github.com/flairNLP/flair/issues/1270.
Could it be possible that the "fix" of that issue just moved the problem from GPU RAM to normal RAM?

bug wontfix

Source

schelv

👍1

Most helpful comment

Should be fixed now, can you try again?

alanakbik on 16 Jun 2020

🚀1 🎉1

All 13 comments

@schelv thanks for reporting this. I have to test. So essentially the RAM increases with each epoch, so we should see this behavior also in a test on a small dataset, right?

alanakbik on 11 May 2020

Yes It happens also when using a smaller dataset.
What is strange is that the first few epochs the RAM increase is largest.

for example start of epoch 3 is 21GB of RAM, at the start of epoch 7 25.9GB of RAM, and at the start of 8 it is 26.2GB of RAM.

After that the increase is even slower.

schelv on 12 May 2020

Yes, I notice the same thing. On smaller data it seems there is no RAM increase after 3-4 epochs. I can't find a memory leak on our side. It could be that this is the way PyTorch reserves more memory when it encounters a large mini-batch which is why there is almost no increase after a few epochs.

alanakbik on 14 May 2020

If that is the case, then you would expect to see this effect also for other types of embeddings, right?
Why would this be specific for the PooledEmbedding?

schelv on 14 May 2020

Hi @alanakbik
Suddenly I have the same issue. however, the same code was working fine before.
I appreciate your help

2020-06-14 16:53:49,857 epoch 1 - iter 12/128 - loss 46.14641762 - samples/sec: 7.96
2020-06-14 16:54:14,401 epoch 1 - iter 24/128 - loss 30.61114464 - samples/sec: 8.36
2020-06-14 16:54:34,429 epoch 1 - iter 36/128 - loss 24.42451382 - samples/sec: 10.35
2020-06-14 16:54:57,682 epoch 1 - iter 48/128 - loss 21.07016906 - samples/sec: 8.85
2020-06-14 16:55:25,035 epoch 1 - iter 60/128 - loss 19.32886293 - samples/sec: 7.51
2020-06-14 16:55:52,091 epoch 1 - iter 72/128 - loss 17.65958128 - samples/sec: 7.54
2020-06-14 16:56:16,569 epoch 1 - iter 84/128 - loss 16.43847208 - samples/sec: 8.38
2020-06-14 16:56:38,924 epoch 1 - iter 96/128 - loss 15.39204485 - samples/sec: 9.33
2020-06-14 16:56:57,066 epoch 1 - iter 108/128 - loss 14.41638838 - samples/sec: 11.51
2020-06-14 16:57:27,448 epoch 1 - iter 120/128 - loss 13.77307765 - samples/sec: 6.66
2020-06-14 16:57:44,326 ----------------------------------------------------------------------------------------------------
2020-06-14 16:57:44,329 EPOCH 1 done: loss 13.5623 - lr 0.1000000

2020-06-14 16:57:44,331 BAD EPOCHS (no improvement): 0

PicklingError Traceback (most recent call last)
in ()
101 mini_batch_size=16,
102 max_epochs=150,
--> 103 checkpoint=True)
104

3 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _legacy_save(obj, f, pickle_module, pickle_protocol)
441 pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
442 pickler.persistent_id = persistent_id
--> 443 pickler.dump(obj)
444
445 serialized_storage_keys = sorted(serialized_storages.keys())

PicklingError: Can't pickle : it's not the same object as torch._C._VariableFunctions

abeermohamed1 on 14 Jun 2020

@abeermohamed1 thanks for reporting this - I now get the same error. This is likely because somehow the torch aggregation functions like torch.add can no longer be serialized. Maybe something changed in their implementation in torch.

I'll put in a PR to fix this shortly!

alanakbik on 14 Jun 2020

Thank you @alanakbik
Could you please advice how to apply the fix now as I need it urgently?

abeermohamed1 on 15 Jun 2020

Just merged the PR! If you install from master it should work now!

alanakbik on 15 Jun 2020

Thank you @alanakbik
I am getting the below error

TypeError Traceback (most recent call last)
in ()
103 mini_batch_size=16,
104 max_epochs=150,
--> 105 checkpoint=True)
106

5 frames
/usr/local/lib/python3.6/dist-packages/flair/embeddings/token.py in _add_embeddings_internal(self, sentences)
762 # set aggregation operation
763 if self.pooling == "mean":
--> 764 aggregated_embedding = torch.mean(self.word_embeddings[token.text], local_embedding)
765 elif self.pooling == "fade":
766 aggregated_embedding = torch.add(self.word_embeddings[token.text], local_embedding)

TypeError: mean() received an invalid combination of arguments - got (Tensor, Tensor), but expected one of:

(Tensor input, *, torch.dtype dtype)
(Tensor input, tuple of names dim, bool keepdim, *, torch.dtype dtype, Tensor out)
(Tensor input, tuple of ints dim, bool keepdim, *, torch.dtype dtype, Tensor out)

abeermohamed1 on 16 Jun 2020

@alanakbik 'min' and 'max' is working fine. the problem is in 'mean' which I need to work on.

abeermohamed1 on 16 Jun 2020

Should be fixed now, can you try again?

alanakbik on 16 Jun 2020

🚀1 🎉1

Should be fixed now, can you try again?

Yes, it is working now. Appreciated @alanakbik Thank you :)

abeermohamed1 on 17 Jun 2020

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.