Transformers: DistilBERT training is killed because OOM

Created on 2 Sep 2019 · 14Comments · Source: huggingface/transformers

❓ Questions & Help

I am trying DistilBERT training. The training script (train.py) had gradually consumed CPU memory, and the training was killed because OOM in about one day (the available CPU memory is 96GB).
I used one GPU for the training.

Do you have any idea? Thanks in advance.

Source

tomohideshibata

Most helpful comment

I believe we found the bug.
It was related to some internal bug in PyTorch: see pytorch/pytorch#24200.

I installed PyTorch from source (it is a pretty recent fix, so it's not in the last release yet), tracked the RAM while distilling and the memory usage is more or less constant.
I am launching a bigger training right now just to make sure this is really causing the memory leak, if so (and I'll get back to you here), it seems you'll have to compile PyTorch from source for now.

Victor

So I trained a model for ~16hours and observed no increase in RAM over the training.

I will update the README to pinpoint this special setup (compiling from source for now) and leave the issue open until the next PyTorch release.

VictorSanh on 4 Sep 2019

👍4 🎉3

All 14 comments

@tomohideshibata Could you paste the complete error message here 🤗

I'm currently distilling a model and my RAM usage (system) is ~ 20 GB. GPU usage is ~8 GB on a V-100. If the OOM was caused for your GPU then I would recommend decreasing the batch size (which is 5 by default) :)

stefan-it on 2 Sep 2019

I'm trying to train distibert, but I cannot find the dump.txt which I assume is preprocessed wikipedia and torento corpus datasets. Could someone help? Thanks.

forjiuzhou on 3 Sep 2019

@stefan-it The error message was just killed.

GPU memory has no problem. I can make the batch size larger (16). The problem is CPU memory.

I have just suspected tensorboard.add_scalar. I will try to make the volume of outputted logs smaller. If I find something, I will let you know.

tomohideshibata on 3 Sep 2019

After 40h of training I could also confirm an increase from 20GB (~20 hours training) to 42GB 🤔

@forjiuzhou When calling the binarized_data.py script you have to specify your input corpus via --file_path. It points to data/dump.txt by default. So just pass your training/preprocessed corpus to the --file_path option.

stefan-it on 3 Sep 2019

Yes, I do confirm this bug (?). I am actually scratching my head around this strange behaviour too... so if you actually find the reason, I would more than happy to push an update.

@forjiuzhou, indeed @stefan-it is correct! Please replace the file dump.txt with you own text dataset. Then, I also recommend that you call token_counts.py before training (so that you just do it once).

VictorSanh on 3 Sep 2019

I believe we found the bug.
It was related to some internal bug in PyTorch: see https://github.com/pytorch/pytorch/issues/24200.

I installed PyTorch from source (it is a pretty recent fix, so it's not in the last release yet), tracked the RAM while distilling and the memory usage is more or less constant.
I am launching a bigger training right now just to make sure this is really causing the memory leak, if so (and I'll get back to you here), it seems you'll have to compile PyTorch from source for now.

Victor

VictorSanh on 4 Sep 2019

👍5 😄1

hey guys, I think the reason is there have too many tensorboard log @tomohideshibata @stefan-it , I stop to save log and then i have more train time now

poria-cat on 4 Sep 2019

I have suppressed tensorboard logs (the block for param_name, param in self.student.named_parameters(): was commented out in the function log_tensorboard), but the CPU memory consumption seemed unchanged.

So, I will try the latest PyTorch.

tomohideshibata on 4 Sep 2019

After 40h of training I could also confirm an increase from 20GB (~20 hours training) to 42GB 🤔

@forjiuzhou When calling the binarized_data.py script you have to specify your input corpus via --file_path. It points to data/dump.txt by default. So just pass your training/preprocessed corpus to the --file_path option.

Sorry I seem ask the wrong question in this issue. But I actually don't have the access to wikipedia and
toronto corpus. And it seems unavailable on internet.

forjiuzhou on 4 Sep 2019

I believe we found the bug.
It was related to some internal bug in PyTorch: see pytorch/pytorch#24200.

I installed PyTorch from source (it is a pretty recent fix, so it's not in the last release yet), tracked the RAM while distilling and the memory usage is more or less constant.
I am launching a bigger training right now just to make sure this is really causing the memory leak, if so (and I'll get back to you here), it seems you'll have to compile PyTorch from source for now.

Victor

So I trained a model for ~16hours and observed no increase in RAM over the training.

I will update the README to pinpoint this special setup (compiling from source for now) and leave the issue open until the next PyTorch release.

VictorSanh on 4 Sep 2019

👍4 🎉3

@VictorSanh I have installed PyTorch from source, and the training is fine. Thanks!

tomohideshibata on 9 Sep 2019

👍1

So PyTorch 1.3 was released yesterday 🔥🎉(and it includes new features I am extremely excited about)!
The release includes the bug fixing, so you should be able to use the stable version available on pip!
(Of course, if you prefer, you can still compile PyTorch from source !)

VictorSanh on 11 Oct 2019

So PyTorch 1.3 was released yesterday 🔥🎉(and it includes new features I am extremely excited about)!
The release includes the bug fixing, so you should be able to use the stable version available on pip!
(Of course, if you prefer, you can still compile PyTorch from source !)

I tried to install the PyTorch 1.3, but it's still leaking.

iamlxb3 on 2 Feb 2020

@iamlxb3 do you mind sharing your exact PyTorch configuration? I re-launched the scipts a few days ago w/ torch==1.4.0 and didn't see memory leak.

VictorSanh on 3 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings