Transformers: DistilBERT training is killed because OOM

Created on 2 Sep 2019  ยท  14Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

I am trying DistilBERT training. The training script (train.py) had gradually consumed CPU memory, and the training was killed because OOM in about one day (the available CPU memory is 96GB).
I used one GPU for the training.

Do you have any idea? Thanks in advance.

Most helpful comment

I believe we found the bug.
It was related to some internal bug in PyTorch: see pytorch/pytorch#24200.

I installed PyTorch from source (it is a pretty recent fix, so it's not in the last release yet), tracked the RAM while distilling and the memory usage is more or less constant.
I am launching a bigger training right now just to make sure this is really causing the memory leak, if so (and I'll get back to you here), it seems you'll have to compile PyTorch from source for now.

Victor

So I trained a model for ~16hours and observed no increase in RAM over the training.

I will update the README to pinpoint this special setup (compiling from source for now) and leave the issue open until the next PyTorch release.

All 14 comments

@tomohideshibata Could you paste the complete error message here ๐Ÿค—

I'm currently distilling a model and my RAM usage (system) is ~ 20 GB. GPU usage is ~8 GB on a V-100. If the OOM was caused for your GPU then I would recommend decreasing the batch size (which is 5 by default) :)

I'm trying to train distibert, but I cannot find the dump.txt which I assume is preprocessed wikipedia and torento corpus datasets. Could someone help? Thanks.

@stefan-it The error message was just killed.

GPU memory has no problem. I can make the batch size larger (16). The problem is CPU memory.

I have just suspected tensorboard.add_scalar. I will try to make the volume of outputted logs smaller. If I find something, I will let you know.

After 40h of training I could also confirm an increase from 20GB (~20 hours training) to 42GB ๐Ÿค”

@forjiuzhou When calling the binarized_data.py script you have to specify your input corpus via --file_path. It points to data/dump.txt by default. So just pass your training/preprocessed corpus to the --file_path option.

Yes, I do confirm this bug (?). I am actually scratching my head around this strange behaviour too... so if you actually find the reason, I would more than happy to push an update.

@forjiuzhou, indeed @stefan-it is correct! Please replace the file dump.txt with you own text dataset. Then, I also recommend that you call token_counts.py before training (so that you just do it once).

I believe we found the bug.
It was related to some internal bug in PyTorch: see https://github.com/pytorch/pytorch/issues/24200.

I installed PyTorch from source (it is a pretty recent fix, so it's not in the last release yet), tracked the RAM while distilling and the memory usage is more or less constant.
I am launching a bigger training right now just to make sure this is really causing the memory leak, if so (and I'll get back to you here), it seems you'll have to compile PyTorch from source for now.

Victor

hey guys, I think the reason is there have too many tensorboard log @tomohideshibata @stefan-it , I stop to save log and then i have more train time now

I have suppressed tensorboard logs (the block for param_name, param in self.student.named_parameters(): was commented out in the function log_tensorboard), but the CPU memory consumption seemed unchanged.

So, I will try the latest PyTorch.

After 40h of training I could also confirm an increase from 20GB (~20 hours training) to 42GB ๐Ÿค”

@forjiuzhou When calling the binarized_data.py script you have to specify your input corpus via --file_path. It points to data/dump.txt by default. So just pass your training/preprocessed corpus to the --file_path option.

Sorry I seem ask the wrong question in this issue. But I actually don't have the access to wikipedia and
toronto corpus. And it seems unavailable on internet.

I believe we found the bug.
It was related to some internal bug in PyTorch: see pytorch/pytorch#24200.

I installed PyTorch from source (it is a pretty recent fix, so it's not in the last release yet), tracked the RAM while distilling and the memory usage is more or less constant.
I am launching a bigger training right now just to make sure this is really causing the memory leak, if so (and I'll get back to you here), it seems you'll have to compile PyTorch from source for now.

Victor

So I trained a model for ~16hours and observed no increase in RAM over the training.

I will update the README to pinpoint this special setup (compiling from source for now) and leave the issue open until the next PyTorch release.

@VictorSanh I have installed PyTorch from source, and the training is fine. Thanks!

So PyTorch 1.3 was released yesterday ๐Ÿ”ฅ๐ŸŽ‰(and it includes new features I am extremely excited about)!
The release includes the bug fixing, so you should be able to use the stable version available on pip!
(Of course, if you prefer, you can still compile PyTorch from source !)

So PyTorch 1.3 was released yesterday ๐Ÿ”ฅ๐ŸŽ‰(and it includes new features I am extremely excited about)!
The release includes the bug fixing, so you should be able to use the stable version available on pip!
(Of course, if you prefer, you can still compile PyTorch from source !)

I tried to install the PyTorch 1.3, but it's still leaking.

@iamlxb3 do you mind sharing your exact PyTorch configuration? I re-launched the scipts a few days ago w/ torch==1.4.0 and didn't see memory leak.

Was this page helpful?
0 / 5 - 0 ratings