Pytorch: Reliably repeating pytorch system crash/reboot when using imagenet examples

Created on 8 Oct 2017 · 67Comments · Source: pytorch/pytorch

So I have a 100% repeatable system crash (reboot) when trying to run the imagenet example (2012 dataset). resnet18 defaults. The crash seems to happen at Variable.py at torch.autograd.backward(..) (line 158).

I am able to run the basic mnist example successfully.

Setup: Ubuntu 16.04, 4.10.0-35-generic #39~16.04.1-Ubuntu SMP Wed Sep 13 09:02:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

python --version Python 3.6.2 :: Anaconda, Inc.

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1335 G /usr/lib/xorg/Xorg 499MiB |
| 0 2231 G cinnamon 55MiB |
| 0 3390 G ...-token=C6DE372B6D9D4FCD6453869AF4C6B4E5 93MiB |
+-----------------------------------------------------------------------------+

torch/vision was built locally on the machine from master. No issues at compile or install time, other than the normal compile time warnings...

Happy to help get further information..

Source

castleguarders

Most helpful comment

Had the same issue with _GTX1070_ but reboots were not random.
I had a code that was able to make my PC reboot every time i run it after at most 1 epoch.
At first i thought it can be PSU since mine has only 500W. However after closer investigation and even setting max power consumption to lower values with nvidia-smi i realized the issue is somewhere else.
It was not an overheating problem as well so i started to think that it might be because of _I7-7820x_ Turbo mode. After disabling Turbo mode in BIOS settings of my _Asus X299-A_ and changing Ubuntu's configuration as stated here the issue seems to be gone.

What did NOT work:

Changing pin_memory for dataloaders.
Playing with batch size.
Increasing system shared memory limits.
Setting nvidia-smi -pl 150 out of 195 possible for my system.

Not sure if this is related to native BIOS issues. I am running 1203 version while the latest is 3 releases ahead -- 1503 and they put

improved stability

into the description of each of those 3. Asus X299-A BIOS versions One of those releases had also

Updated Intel CPU microcode.

So there is a chance this is fixed.

vwvolodya on 6 Sep 2018

👍7 🎉2 👀1

All 67 comments

I have experienced random system reboots once due to motherboard - GPU incompatibility. This has manifested during long training. Do other frameworks (e.g. caffe) succeed in training on ImageNet?

vadimkantorov on 8 Oct 2017

Haven't tried that yet. However ran some long running graphics bench ;) with no problems. I could probably look at giving other frameworks a shot, what's your recommendation. Caffe?

Do keep in mind, the crash I reported happens practically immediately (mnist-cuda example runs to completion many times without an issue). So I doubt that it's a h/w incompatibility issue.

castleguarders on 8 Oct 2017

Can you try triggering the crash once again and see if anything relevant is printed in /var/log/dmesg.0 or /var/log/kern.log?

apaszke on 9 Oct 2017

Zero entries related to this in either dmesg or kern.log. The machine does an audible click and resets, so I think it's the h/w registers or memory being twiddled in a way it doesn't like. No real notice to kernel to log anything. Reboots at the same line of code each time, at least the few times I stepped through it.

castleguarders on 9 Oct 2017

That's weird. To be honest I don't have any good ideas for debugging such issues. My guess would be that it's some kind of a hardware problem, but I don't really know.

apaszke on 10 Oct 2017

it's definitely a hardware issue as well. Whether it's at an nvidia driver level, or a bios / hardware failure.
I'm closing the issue, as there's no action to be taken on the pytorch project side.

soumith on 10 Oct 2017

For future reference, the issue was due to steep power ramp of 1080ti's triggering server power supply over voltage protection. Only some pytorch examples caused it to show up.

castleguarders on 7 Dec 2017

@castleguarders Have you figured out how to solve this issue? It seems that even 1200W "platinum" power supply is not enough for just 2X 1080Ti, it reboots from time to time.

yurymalkov on 10 Feb 2018

@castleguarders I am having similar issues, how did you found that that was the problem?

pmcrodrigues on 17 Apr 2018

@pmcrodrigues There was an audible click whenever the issue happened. I used nvidia-smi to soft control the power draw, this allowed the tests a bit longer, but trip anyways. I switched to a 825W Delta powersupply and it took care of the issue fully. Furmark makes easy work of testing this if you run windows. I ran it fully pegged for a couple of days, while driving the CPUs 100% with a different script. It's zero issues since then.

@yurymalkov I only have 1x 1080ti, didn't dare to put a second one.

castleguarders on 19 Apr 2018

@pmcrodrigues @castleguarders
I've also "solved" the problem by feeding the second GPU from a separate PSU (1000W+1200W for 2X 1080Ti). Reducing the power draw by 0.5X via nvidia-smi -pl also helped, but it killed the performance. Also tried different motherboards/GPUs but it didn't help.

yurymalkov on 19 Apr 2018

👍1

@castleguarders @yurymalkov Thank you both. I have also tried to reduce power draw via nvidia-smi and it stopped crashing the system. But with stress tests at full power draw simultaneously on my 2 xeons (with http://people.seas.harvard.edu/~apw/stress/) and the 4 1080ti (with https://github.com/wilicc/gpu-burn) didn't make it crash. So for now I have only seen this problem on pytorch. Maybe I need other stress tests?

pmcrodrigues on 19 Apr 2018

@pmcrodrigues gpuburn seems to be a bad test for this, as it does not create steep power ramps.
I.e. a machine could pass gpuburn with 4 gpus, but failed at 2 gpus with a pytorch script.

The problem reproduces on some other frameworks (e.g. tensorflow), but it seems that pytorch scripts are the best test, probably because of the highly synchronous nature.

yurymalkov on 19 Apr 2018

I am having the same issue. Has anybody found any soft solution to this?
I have 4 GPU system with one CPU and 1500W power supply. Using 3 out of 4 or 4/4 causes the reboot.
@castleguarders @yurymalkov @pmcrodrigues How to reduce power draw via nvidia-smi?

gurkirt on 2 May 2018

@gurkirt For now, I am only using 2 GPUs with my 1500W PSU. If you want to test reducing the power draw you can use "nvidia-smi -pl X" where X is the new power draw. For my gtx 1080i I used "nvidia-smi -pl 150" whereas standard draw is 250W. I am waiting on more potent PSU to test if it solves the problem. Currently I have a measuring device to measure the power coming directly from the wall, but even when I am using 4 GPUs it does not pass the 1000W. It can still be some weird peaks that are not being registered but something is off. Either way, we probably need to go with the dual 1500W PSUs.

pmcrodrigues on 2 May 2018

👍4

@pmcrodrigues thanks a lot for quick response. I have another system which has 2000W with 4 1080Ti's. That works just fine. I will try plugging that power supply in this machine and see if 2000W is enough on this machine.

gurkirt on 2 May 2018

@pmcrodrigues did you find any log/warning/ crash report somewhere?

gurkirt on 2 May 2018

@gurkirt None.

pmcrodrigues on 2 May 2018

I’m having a similar problem- audible click, complete system shutdown.

It seems that it only occurs with BatchNorm layers in place. Does that match with your experience?

lukepfister on 8 Aug 2018

I was using the resenet at that time. It is a problem of inadequate power supply problem. It is a hardware problem. I needed to upgrade the power supply. According to my searches online, the power surge is a problem of pytorch. I upgraded the power supply from 1500W to 1600W. The problem still appears now and then but only when the room temperature is a bit higher. I think there are two factors at play, room temperature and another major being the power supply.

gurkirt on 8 Aug 2018

I have the same problem with a 550W power supply and a GTX1070 graphics cards. I start the learning and about a second later the power cuts.

But this made me thinking that perhaps it would be possible to trick/convince the PSU that everything is ok by creating a ramp-up function that e.g. mixes between sleeps and gpu activity and gradially increases the load. Has anyone tried this? Does someone have minimal code that reliably triggers the power cut?

dov on 16 Aug 2018

👍2

What did NOT work:

Changing pin_memory for dataloaders.
Playing with batch size.
Increasing system shared memory limits.
Setting nvidia-smi -pl 150 out of 195 possible for my system.

Not sure if this is related to native BIOS issues. I am running 1203 version while the latest is 3 releases ahead -- 1503 and they put

improved stability

into the description of each of those 3. Asus X299-A BIOS versions One of those releases had also

Updated Intel CPU microcode.

So there is a chance this is fixed.

vwvolodya on 6 Sep 2018

👍7 🎉2 👀1

For the record, my problem was a broken power supply. I diagnosed this by running https://github.com/wilicc/gpu-burn on Linux and then FurMark on Windows, under the assumption, that unless I can reproduce the crash on Windows, they won't talk to me in my computer shop. Both these tests failed for me, wherupon I took the computer for repair and got a new power supply. Since then, I have been running pytorch for hours without any crashes.

dov on 6 Sep 2018

👍1

Has anyone found a way to fix this. I have a similar error where my computer restarts shortly after I start training. I have a 750w psu and only 1 gpu (1080ti) so I don't think it is a power problem. Also, I did not see an increased wattage going to my gpu before it restarts.

DanielLongo on 22 Oct 2018

If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode.

For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated.

Have a great day guys !

yaynouche on 12 Nov 2018

👍3

If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode.

For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated.

Have a great day guys !

@yaynouche @vwvolodya similar issues happened on a ASUS WS-X299 SAGE with i9-9920X. Turning off Turbo Mode is the only solution right now, with latest BIOS (Version 0905 which officially supports i9-9920X).

UPDATE: turns out, I must enable turbo mode in BIOS and use commands like echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo as in https://github.com/pytorch/pytorch/issues/3022#issuecomment-419093454 to disable the turbo via software. If I disable turbo mode in the BIOS, then still the machine will reboot.

UPDATE 2: I think turning off Turbo Mode can only lower the chance of my issue, not eliminate it.

zym1010 on 20 Jan 2019

I am having the same issue. Has anybody found any soft solution to this?
I have 4 GPU system with one CPU and 1500W power supply. Using 3 out of 4 or 4/4 causes the reboot.
@castleguarders @yurymalkov @pmcrodrigues How to reduce power draw via nvidia-smi?

facing the same problem. 4 GTX 1080Ti with 1600W PSU (With redundancy) . Tried to use gpu burn to test it and it's stable like a rock.

Suley on 7 Apr 2019

@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard.

zym1010 on 7 Apr 2019

@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard.

Thanks for your reply. I will test the CPUs to identify the problem

Suley on 7 Apr 2019

@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard.

I ran cpu stress test and gpu stress test at the same time, no problem found.
My mobo supports 150 W TDP, my cpu's tdp is 115w tdp.
So my max power consumption would be: 115w * 2(CPU) + 250w *4(1080Ti) + 200W(Disk and other component) = 1430
It seems that 1600W is enough. and besides, there's two 1600W redundancy power which both outputs power, that means each PSU only carry half of the load.

2 GPU works OK.
3 GPU unstable. reboot after few minutes.
4 GPU crashed immediately. system reboot and no logs have been recorded.

Suley on 7 Apr 2019

I also tried running stress tests for CPU and GPU simultaneously; no issue at all. Maybe it’s due to type of instructions... not sure.

Can you try disabling some CPU cores or underclock them? In my case, this decreased probability/frequency of reboot but not fix the problem.

It’s based on the fact that reducing CPU load can make programs more stable (at least on my machine) that I think this is a CPU issue.

Yimeng Zhang
Sent from my iPhone

On Apr 7, 2019, at 1:04 PM, Suley notifications@github.com wrote:

@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard.
I ran cpu stress test and gpu stress test at the same time, no problem found.
My mobo supports 150 W TDP,
So my max power consumption would be: 115w * 2(CPU) + 250w *4(1080Ti) + 200W(Disk and other component) = 1430
It seems that 1600W is enough. and besides, there's two 1600W redundancy power which both outputs power, that means each PSU only carry half of the load.

2 GPU works OK.
3 GPU unstable. reboot after few minutes.
4 GPU crashed immediately. system reboot and no logs have been recorded.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

zym1010 on 7 Apr 2019

I also tried running stress tests for CPU and GPU simultaneously; no issue at all. Maybe it’s due to type of instructions... not sure. Can you try disabling some CPU cores or underclock them? In my case, this decreased probability/frequency of reboot but not fix the problem. It’s based on the fact that reducing CPU load can make programs more stable (at least on my machine) that I think this is a CPU issue. Yimeng Zhang Sent from my iPhone
…
On Apr 7, 2019, at 1:04 PM, Suley @.*> wrote: @Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard. I ran cpu stress test and gpu stress test at the same time, no problem found. My mobo supports 150 W TDP, So my max power consumption would be: 115w * 2(CPU) + 250w *4(1080Ti) + 200W(Disk and other component) = 1430 It seems that 1600W is enough. and besides, there's two 1600W redundancy power which both outputs power, that means each PSU only carry half of the load. 2 GPU works OK. 3 GPU unstable. reboot after few minutes. 4 GPU crashed immediately. system reboot and no logs have been recorded. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Thanks. Currently there's a task running at the server. I will try it after the task is finished, and share my test result.
But still can't explain why stressing gpu and cpu works, but pytorch doesn't. Hope someone can dig into this and come out with a solution.

Suley on 10 Apr 2019

I also tried running stress tests for CPU and GPU simultaneously; no issue at all. Maybe it’s due to type of instructions... not sure. Can you try disabling some CPU cores or underclock them? In my case, this decreased probability/frequency of reboot but not fix the problem. It’s based on the fact that reducing CPU load can make programs more stable (at least on my machine) that I think this is a CPU issue. Yimeng Zhang Sent from my iPhone
…
On Apr 7, 2019, at 1:04 PM, Suley @.*> wrote: @Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard. I ran cpu stress test and gpu stress test at the same time, no problem found. My mobo supports 150 W TDP, So my max power consumption would be: 115w * 2(CPU) + 250w *4(1080Ti) + 200W(Disk and other component) = 1430 It seems that 1600W is enough. and besides, there's two 1600W redundancy power which both outputs power, that means each PSU only carry half of the load. 2 GPU works OK. 3 GPU unstable. reboot after few minutes. 4 GPU crashed immediately. system reboot and no logs have been recorded. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

seems like that you are right. it's a cpu related bug. After I disabled all cpu cores except for cpu0, it worked.
But only one core works. Enabling half of the cores still crashed.

Suley on 13 Apr 2019

@Suley do you use X299 chipset? Seems that many builds with X299 have this problem.

zym1010 on 13 Apr 2019

1600W PSU with 4x 2080 TI's facing the same problem. I attached a second 750W PSU with ADD2PSU and now i am running 1600W PSU = 3x2080Ti + System and 750W PSU = 1x2080Ti and everything seems stable. As commented by other, pytorch is the only application stressing the GPU's so much they run into current protection. Miners, Renderers, Stresstests all are comfortable with one 1600W PSU. So this was a hardware issue and from now on pytorch will be my GPU Stresstest :-) BTW: I have a X399 build

mwyborski on 23 Apr 2019

Yes, pytorch cause power surge at the time of network initialisation. 1600W PSU is enough if you PSU is platinum grade and up silver of gold grade PSU are not robust enough to handle the sudden change in power requirement. Your PSU is able to supply enough but it can not handle the sudden change from ~250W usage to 1000+W required within seconds. Check the grade of power supply. Also, turn off overclocking in bios settings.

gurkirt on 24 Apr 2019

@gurkirt I had a "platinum grade" 1200W PSU which couldn't handle two 1080Ti GPUs. Although, it worked better than other PSUs that I had (1000W, different brands, not cheap).

yurymalkov on 24 Apr 2019

I have corsair 1600W platinum with 4x1080Ti and it works fine.

gurkirt on 28 Apr 2019

Yes, pytorch cause power surge at the time of network initialisation. 1600W PSU is enough if you PSU is platinum grade and up silver of gold grade PSU are not robust enough to handle the sudden change in power requirement. Your PSU is able to supply enough but it can not handle the sudden change from ~250W usage to 1000+W required within seconds. Check the grade of power supply. Also, turn off overclocking in bios settings.

my psu is Platinum grade psu. Supermicro 7047GR barebone. and it's two 1600w, combining 3200w in total.

Suley on 28 Apr 2019

@gurkirt I had a "platinum grade" 1200W PSU which couldn't handle two 1080Ti GPUs. Although, it worked better than other PSUs that I had (1000W, different brands, not cheap).

Strange! I have two platinum grade PSU. (1600w). Can't handle 4 1080Ti !

Suley on 28 Apr 2019

@Suley do you use X299 chipset? Seems that many builds with X299 have this problem.

No. I use x79, which is quite old. my X99 server works well.

Suley on 28 Apr 2019

I had the same issue with 4x2080ti + asus x299 sage + Rosewill Hercules 1600W PSU (or Corsair 1500i), disabling cpu turbo not help. After using Corsair 1600i Titanium , it works perfectly.

ZhengRui on 16 May 2019

I had the same issue with 4x2080ti + asus x299 sage + Rosewill Hercules 1600W PSU (or Corsair 1500i), disabling cpu turbo not help. After using Corsair 1600i Titanium , it works perfectly.

@ZhengRui My machine also has 4x2080ti + x299 sage, but with a 2000W PSU; still failing... (maybe due to CPU difference? Mine is a 12 core i9-9920X).

zym1010 on 16 May 2019

@zym1010 my cpu is 10core i9-9820

ZhengRui on 17 May 2019

I had the same issue with 4x2080ti + asus x299 sage + Rosewill Hercules 1600W PSU (or Corsair 1500i), disabling cpu turbo not help. After using Corsair 1600i Titanium , it works perfectly.

I had a similar case, after upgrade to 1600i, it worked.

gurkirt on 23 May 2019

In my case my machine has 1080 and 550W PSU. Running my libtorch program in Rust once is fine. However, if I repeat killing and restarting my program every 30 seconds, the system either reliably shuts down, or the GPU gets offline. Eventually, the motherboard breaks cannot boot at all.

jerry73204 on 25 May 2019

I think it is clear from the above discussion that mostly it is the fault of PSU, PSU not only has to have enough power outage but should be robust enough to withstand the power surge. My advice to you if you have this problem then try changing it to better PSU and keep the machine in a cold and dry place.

gurkirt on 26 May 2019

It turned out that the main problem for me was not the PSU, but the lack of cables. Apparently connecting a GPU which has 2 PCIe sockets to a single socket in the PSU draws too much power from the single PSU socket and the overvoltage protection turns everything off.

Upgrading the PSU in my case seemed to worsen the problem, as the PSU was not turning on at all. Reason being that the new (and better PSU) was doing cable checks before turning on and they were failing.

Using either a cable with 2 heads on both sides or two distinct cables solved the issue for me

mikonapoli on 29 May 2019

I am not sure what I face is the same as the problem. My computer uses 1080Ti, and if the GPU Memory usage is close to 100%, i.e. uses almost 11GB memory, it will reboot. But if I reduce the batch size of the network in order to decrease the memory usage, the reboot problem will not happen without upgrade the power. If someone meets the reboot problem, I hope my condition might help you.

qwesdfok on 11 Jul 2019

I face the same problem with 1080 Ti and a 450 W PSU and tried to reduce power consumption by typing command "sudo nvidia-smi -pl X" as a temporal solution. However, this did not work at first try. After that, I noticed that if you limit the power consumption first and type "nvidia-smi -lms 50" on another terminal to check the power and memory usage of the GPU just before starting the training, the I can train the network without problem. I'm waiting for a new PSU right now for a permanent solution.

alpErenSari on 12 Jul 2019

I too had this issue and was able to reproduce it with a Pytorch script without using any GPUs (only CPU). So I agree with @zym1010 for me it's a CPU issue. I updated my BIOS (ASUS WS X299 SAGE LGA 2066 Intel X299) and it seems to have stopped the issue from happening. However considering the comments in this thread I'm not entirely sure the issue is fixed...

@soumith Don't you think Pytorch contributors should look into this issue rather than just closing it? Pytorch seems to stress GPU/CPU in a way GPU/CPU stress tests do not. This is not an expected behaviour, and the problem affects many people. It seems like a rather interesting issue as well!

Caselles on 16 Jul 2019

👍3

I too had this issue and was able to reproduce it with a Pytorch script without using any GPUs (only CPU). So I agree with @zym1010 for me it's a CPU issue. I updated my BIOS (ASUS WS X299 SAGE LGA 2066 Intel X299) and it seems to have stopped the issue from happening. However considering the comments in this thread I'm not entirely sure the issue is fixed...

@soumith Don't you think Pytorch contributors should look into this issue rather than just closing it? Pytorch seems to stress GPU/CPU in a way GPU/CPU stress tests do not. This is not an expected behaviour, and the problem affects many people. It seems like a rather interesting issue as well!

@Caselles are you referring to BIOS version 1001? I saw it some time ago on ASUS website but seems that it has been taken down somehow.

zym1010 on 16 Jul 2019

The BIOS I installed is this one: "WS X299 SAGE Formal BIOS 0905 Release".

Caselles on 17 Jul 2019

In my experience, this issue comes with different Thermaltake PSUs. In the last case, changing the PSU from Thermaltake platinum 1500W to Corsair HX1200 solved the problem on a two-2080Ti setup.

yurymalkov on 17 Jul 2019

I have this issue with both CPU and GPU, which means rebooting happens even when I physically uninstall the GPU and only train the network on CPU without using dataloader

My power supply is EVGA 850w gold power supply, and CPU: i7-8700k, GPU: GTX 1080ti (just 1 piece)

And I have a ECO switch on my power supply, if I switch it to "on", it happens more often.

Just like what others said, the pressure test on both CPU and GPU pass.

So, a conclusion here:

Reboot would happen even with training only on CPU, even after I removed the GPU physically.
Turn on the ECO switch on PSU result in more-often reboot.
I7-8700k+GTX 1080ti on 850W power supply.
Only appears while using Pytorch even without Dataloader

pengyu965 on 4 Sep 2019

If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode.
For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated.
Have a great day guys !

@yaynouche @vwvolodya similar issues happened on a ASUS WS-X299 SAGE with i9-9920X. Turning off Turbo Mode is the only solution right now, with latest BIOS (Version 0905 which officially supports i9-9920X).

UPDATE: turns out, I must enable turbo mode in BIOS and use commands like echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo as in #3022 (comment) to disable the turbo via software. If I disable turbo mode in the BIOS, then still the machine will reboot.

UPDATE 2: I think turning off Turbo Mode can only lower the chance of my issue, not eliminate it.

My hardware details:

Motherboard: Asus WS X299 SAGE/10G 
CPU: Intel Core i9-9900X
GPU: Geforce RTX2080 TI - 11GB (4 of them)
Power supply: Masterwatt Maker - 1500Watts

Bios Version: 0905. Then updated to 1201.
Turbo enabled from bios and then set 1 in /sys/devices/system/cpu/intel_pstate/no_turbo
Tried other combinations.

Tested using https://github.com/wilicc/gpu-burn. All gpus are ok.

Whenever I am training maskrcnn_resnet50_fpn on coco dataset using 4 GPUs with batch size 4, the system is rebooting immediately. But, when I am using 3 GPUs with batch size 4 or 4 GPUs with batch size 2, it's training.

What could be the reason? Power supply?
I am dying to solve. I appreciate your comments.
Thanks in advance
Zulfi

cognitiveRobot on 9 Sep 2019

I also have this issue using 4 x Geforce RTX2080 TI - 11GB and a 1600W EVGA SuperNOVA Platinum PSU (I also tried swapping the PSU with a 1600W SuperNOVA EVGA Gold PSU) and the issue still occurs when using PyTorch with the 4 GPUs.

jeroneandrews on 16 Oct 2019

From my experience, reboot occur often when nvidia-persistenced is not installed and running.
Link: https://docs.nvidia.com/deploy/driver-persistence/index.html

Updating the Bios is also a crucial part of the solution. Hope it helps.

Best regards,

Yassine

yaynouche on 16 Oct 2019

👍1

@gurkirt what are your other system specs?

I also have 4 x RTX 2080tis and a corsair 1600i psu but my pc still shuts down after awhile when using all 4 gpus.

jeroneandrews on 24 Oct 2019

Hey just FYI I was experiencing this issue on multiple machines (all X299 with multiple 2080Tis), and after trying 4 different PSUs the Corsair AX1600I is the only one that I did not encounter reboots with.

sjdrc on 9 Mar 2020

I have the same issue.
Machine config - Lenovo y540, RTX 2060, Ubuntu 18.04. I tried training a simple binary image classification model (4 conv layers with batchnorm). The model trained for 20 epochs (batch size = 8) and then my laptop shut down.

Output of nvidia-smi:

| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8     3W /  N/A |     10MiB /  5934MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Following is the log file before the system crashed I think. I found it in - cat /var/log/kern.log.

Mar 10 17:05:01 maverick kernel: [    9.279289] audit: type=1400 audit(1583840101.525:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=837 comm="apparmor_parser"
Mar 10 17:05:01 maverick kernel: [    9.280042] audit: type=1400 audit(1583840101.529:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=828 comm="apparmor_parser"
Mar 10 17:05:01 maverick kernel: [    9.325087] intel_rapl_common: Found RAPL domain package
Mar 10 17:05:01 maverick kernel: [    9.325092] intel_rapl_common: Found RAPL domain core
Mar 10 17:05:01 maverick kernel: [    9.325096] intel_rapl_common: Found RAPL domain uncore
Mar 10 17:05:01 maverick kernel: [    9.325100] intel_rapl_common: Found RAPL domain dram
Mar 10 17:05:01 maverick kernel: [    9.355748] input: HDA Intel PCH Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
Mar 10 17:05:01 maverick kernel: [    9.355987] input: HDA Intel PCH Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14
Mar 10 17:05:01 maverick kernel: [    9.356199] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input15
Mar 10 17:05:01 maverick kernel: [    9.356895] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input16
Mar 10 17:05:01 maverick kernel: [    9.357074] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input17
Mar 10 17:05:01 maverick kernel: [    9.357296] input: HDA Intel PCH HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input18
Mar 10 17:05:01 maverick kernel: [    9.357497] input: HDA Intel PCH HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input19
Mar 10 17:05:01 maverick kernel: [    9.432866] dw-apb-uart.2: ttyS4 at MMIO 0x8f802000 (irq = 20, base_baud = 115200) is a 16550A
Mar 10 17:05:01 maverick kernel: [    9.434397] iwlwifi 0000:00:14.3 wlp0s20f3: renamed from wlan0
Mar 10 17:05:01 maverick kernel: [    9.445610] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  430.50  Thu Sep  5 22:39:50 CDT 2019
Mar 10 17:05:01 maverick kernel: [    9.575171] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 234
Mar 10 17:05:01 maverick kernel: [    9.623512] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
Mar 10 17:05:01 maverick kernel: [    9.623516] Bluetooth: BNEP filters: protocol multicast
Mar 10 17:05:01 maverick kernel: [    9.623525] Bluetooth: BNEP socket layer initialized
Mar 10 17:05:01 maverick kernel: [    9.664785] input: MSFT0001:01 06CB:CD5F Touchpad as /devices/pci0000:00/0000:00:15.1/i2c_designware.1/i2c-2/i2c-MSFT0001:01/0018:06CB:CD5F.0003/input/input24
Mar 10 17:05:01 maverick kernel: [    9.665154] hid-multitouch 0018:06CB:CD5F.0003: input,hidraw2: I2C HID v1.00 Mouse [MSFT0001:01 06CB:CD5F] on i2c-MSFT0001:01
Mar 10 17:05:01 maverick kernel: [    9.669632] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input20
Mar 10 17:05:01 maverick kernel: [    9.669880] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input21
Mar 10 17:05:01 maverick kernel: [    9.669932] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input22
Mar 10 17:05:02 maverick kernel: [    9.767641] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190703/nsarguments-66)
Mar 10 17:05:02 maverick kernel: [   10.035982] Generic Realtek PHY r8169-700:00: attached PHY driver [Generic Realtek PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
Mar 10 17:05:02 maverick kernel: [   10.149333] r8169 0000:07:00.0 enp7s0: Link is Down
Mar 10 17:05:02 maverick kernel: [   10.179246] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
Mar 10 17:05:02 maverick kernel: [   10.296096] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
Mar 10 17:05:02 maverick kernel: [   10.361833] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring
Mar 10 17:05:02 maverick kernel: [   10.374304] iwlwifi 0000:00:14.3: BIOS contains WGDS but no WRDS
Mar 10 17:05:02 maverick kernel: [   10.378535] Bluetooth: hci0: Waiting for firmware download to complete
Mar 10 17:05:02 maverick kernel: [   10.379322] Bluetooth: hci0: Firmware loaded in 1598306 usecs
Mar 10 17:05:02 maverick kernel: [   10.379451] Bluetooth: hci0: Waiting for device to boot
Mar 10 17:05:02 maverick kernel: [   10.392359] Bluetooth: hci0: Device booted in 12671 usecs
Mar 10 17:05:02 maverick kernel: [   10.395240] Bluetooth: hci0: Found Intel DDC parameters: intel/ibt-17-16-1.ddc
Mar 10 17:05:02 maverick kernel: [   10.398388] Bluetooth: hci0: Applying Intel DDC parameters completed
Mar 10 17:05:03 maverick kernel: [   11.148057] nvidia-uvm: Unloaded the UVM driver in 8 mode
Mar 10 17:05:03 maverick kernel: [   11.171826] nvidia-modeset: Unloading
Mar 10 17:05:03 maverick kernel: [   11.219065] nvidia-nvlink: Unregistered the Nvlink Core, major device number 237
Mar 10 17:05:04 maverick kernel: [   12.125832] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
Mar 10 17:05:04 maverick kernel: [   12.127484] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
Mar 10 17:05:04 maverick kernel: [   12.175644] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  430.50  Thu Sep  5 22:36:31 CDT 2019
Mar 10 17:05:05 maverick kernel: [   13.205291] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  430.50  Thu Sep  5 22:39:50 CDT 2019
Mar 10 17:05:05 maverick kernel: [   13.250663] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 234
Mar 10 17:05:06 maverick kernel: [   13.986003] wlp0s20f3: authenticate with 58:c1:7a:1b:bd:d0
Mar 10 17:05:06 maverick kernel: [   13.994385] wlp0s20f3: send auth to 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:05:06 maverick kernel: [   14.047103] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:05:06 maverick kernel: [   14.063692] wlp0s20f3: authenticated
Mar 10 17:05:06 maverick kernel: [   14.068040] wlp0s20f3: associate with 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:05:06 maverick kernel: [   14.097924] wlp0s20f3: RX AssocResp from 58:c1:7a:1b:bd:d0 (capab=0x431 status=0 aid=4)
Mar 10 17:05:06 maverick kernel: [   14.143288] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:05:06 maverick kernel: [   14.177499] wlp0s20f3: associated
Mar 10 17:05:06 maverick kernel: [   14.296025] IPv6: ADDRCONF(NETDEV_CHANGE): wlp0s20f3: link becomes ready
Mar 10 17:05:08 maverick kernel: [   16.376337] bpfilter: Loaded bpfilter_umh pid 1511
Mar 10 17:05:18 maverick kernel: [   26.325876] Bluetooth: RFCOMM TTY layer initialized
Mar 10 17:05:18 maverick kernel: [   26.325884] Bluetooth: RFCOMM socket layer initialized
Mar 10 17:05:18 maverick kernel: [   26.325892] Bluetooth: RFCOMM ver 1.11
Mar 10 17:05:19 maverick kernel: [   27.169380] rfkill: input handler disabled
Mar 10 17:08:10 maverick kernel: [  198.039283] ucsi_ccg 0-0008: failed to reset PPM!
Mar 10 17:08:10 maverick kernel: [  198.039292] ucsi_ccg 0-0008: PPM init failed (-110)
Mar 10 17:10:11 maverick kernel: [  319.690728] mce: CPU11: Core temperature above threshold, cpu clock throttled (total events = 75)
Mar 10 17:10:11 maverick kernel: [  319.690729] mce: CPU5: Core temperature above threshold, cpu clock throttled (total events = 75)
Mar 10 17:10:11 maverick kernel: [  319.690730] mce: CPU11: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690730] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690772] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690773] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690774] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690775] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690776] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690777] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690778] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690779] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690780] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.690781] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [  319.691710] mce: CPU5: Core temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691713] mce: CPU11: Core temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691716] mce: CPU11: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691717] mce: CPU5: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691777] mce: CPU0: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691781] mce: CPU7: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691783] mce: CPU6: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691787] mce: CPU2: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691790] mce: CPU1: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691793] mce: CPU8: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691798] mce: CPU10: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691800] mce: CPU4: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691804] mce: CPU3: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [  319.691807] mce: CPU9: Package temperature/speed normal
Mar 10 17:13:35 maverick kernel: [  523.048575] wlp0s20f3: authenticate with 58:c1:7a:1b:bd:d0
Mar 10 17:13:35 maverick kernel: [  523.055288] wlp0s20f3: send auth to 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:13:35 maverick kernel: [  523.097819] wlp0s20f3: authenticated
Mar 10 17:13:35 maverick kernel: [  523.099819] wlp0s20f3: associate with 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:13:35 maverick kernel: [  523.107873] wlp0s20f3: RX AssocResp from 58:c1:7a:1b:bd:d0 (capab=0x431 status=0 aid=1)
Mar 10 17:13:35 maverick kernel: [  523.109523] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:13:35 maverick kernel: [  523.110798] wlp0s20f3: associated
Mar 10 17:13:35 maverick kernel: [  523.119975] IPv6: ADDRCONF(NETDEV_CHANGE): wlp0s20f3: link becomes ready

How can I stop this from happening again ie. stop pytorch training and not crash my system?

theairbend3r on 10 Mar 2020

@theairbend3r I'm not sure if you're having the same issue as the one here. As I understand it, when starting training with torch, the GPUs and CPU(s) ramp up so quickly that it can exceed normal power draw and trigger overload protection on the PSU. I was always experiencing this before the first epoch ended.

Sorry I don't have any more useful suggestions for you.

sjdrc on 10 Mar 2020

👍1

Several possible solutions: (not sure if anyone of them could fix the problem independently)

BIOS version: I followed the discussion above to update my BIOS version from 3501 to 4001 (Asus X99-E WS/USB3.1), problem solved.
Setting the Nvidia GPU fan: I changed the GPU fan speed to reduce the risk of high temperatures that could cause emergent shutdown/reboot.
Lower the num_worker from 12 to 4 (the max #core on my server is 12).
Insufficient power of the power supplier. My situation is: change HDD to SDD, thus the speedup of the whole pipeline of my task adds too much pressure on the power supply.

sdsy888 on 17 Mar 2020

It seems that even 1200W "platinum" power supply is not enough for just 2X 1080Ti, it reboots from time to time.

Faced this issue with 2x 2080ti on multiple PCs with platiunum 1000W and 1200W. Worked fine when using only 1 GPU, but not 2. Solved by upgrading the PSU to 1600W.