I'm training semantic-segmentation networks, along the lines of #288.
The code works great on GeForce GTX products, but with new RTX hardware training gets stuck randomly. Looks like a race condition or similar, because the freeze happens after multiple iterations and usually after different number of iterations (from run to run).
Using CUDA 10 and cuDNN 7.3 on 64-bit Windows (but same issue with CUDA 8 and cuDNN 5). Latest master from GitHub. MSVC debugger shows the code is waiting on this line.
Because steps to reproduce include acquisition of RTX hardware, I'll rather spend time trying to debug the issue than writing a complete set of steps – at least at this point.
That's weird. cudaStreamSynchronize should never hang forever. It might be a bug in CUDA. What happens if you comment out the call to cudaStreamSynchronize? Does it stop hanging?
I agree that this might be a bug in CUDA.
The process consumes 100% of one CPU core while it's in cudaStreamSynchronize.
If I change cudaStreamSynchronize to a loop that polls cudaStreamQuery until cudaSuccess is returned, it seems that the problem goes away. But need more testing to be sure about it.
Interestingly, I haven't been able to reproduce this problem using the semantic segmentation example. But I guess this is normal if it's a race condition or similar – given our application and data, the bottlenecks are surely in somewhat different places, than with the example program/data.
@reunanen I've started seeing the same thing, with the call to cudaStreamSynchronize(0) causing a hang. One thing that I've noticed is that it goes away when I switch from release mode to debug mode in VS2017 (v15.7.5). This has happened on two different computers, both running Win10. Once machine has a GTX-108Ti and the other has a Quadro M6000. Both are up to date on the drivers and CudaToolKit10.0 / cudnn v7.3.0.29.
Similarly, I've only seen this happen on one set of dnn architectures. Not sure if it is a size issue but the nets only have 43 layers, 8 of which are tag1/add_prev1 pairs.
Maybe it's just a windows issues then instead of something specific to that card. @reunanen, have you seen the issue on Linux? Maybe the PR should have an #ifdef that only applies it to windows if not.
Yeah, it could well be a timing issue made visible by the new card (and/or a specific dnn architecture).
Haven't seen it on Linux, but haven't tried either.
Would be a little surprised though, because we've run identical code on single and dual GTX 1080 Ti rigs for countless hours already, without issues. Whereas on RTX 2080 Ti, the occurrence of the hang is a matter of minutes, not hours. But yeah, I know – if it's a race condition or such, then even this is not too unexpected.
@davemers0160 if you cherry-pick #1514, does your release-mode build still hang?
I won't have access to the two computers until Monday. I will try then.
I also just noticed that there is a new driver for the GTX series. For the systems that I'm using the drivers are both version 411.63. It might be worth checking to see if an update to the driver might fix the problem.
Also I haven't been brave enough to upgrade my Ubuntu 16.04 system from Cuda Toolkit 9.1 to 10.0.
I was able to incorporate #1514 on my Quadro machine and was able to complete a training session (~5hrs).
Before making the change on my GTX-1080 machine I upgraded the driver to 416.34, but this did not help, it still stalled within the first few minutes of training. I then made the 1514 change and was able to complete the training (~5hrs).
@davemers0160, did the error occur only on windows?
I've only run across the error on Windows 10. I haven't tested any Win7 or Win8.1 machines. I also haven't upgraded any Ubuntu machines to Cuda Toolkit v10.0.
Someone here recently came across this issue and wasn't sure what was causing it.
Some interesting finds regarding this hang:
Works with CUDA 9.1 (CuDNN 7.1) + Old nvidia driver
Didn't work with CUDA 9.2 (CuDNN 7.2) + Old or new nvidia driver. This is how we were experiencing this issue!
Works with CUDA 10 (CuDNN 7.4) + New nvidia driver. This is what we now use for training.
This is happening with a Titan V on Windows 10.
The hang was experienced after ~10 minutes of training. We noticed it then hits 100% CPU stuck in synchronisation. Also, it does not occur in Debug mode or if optimisations are turned off.
We were using CUDA 9.2 because it's used for the inferencing application where we need CUDA 9.2 and CUDA 10 requires drivers that are too new. We have no such issues in inferencing, only training.
Considering everyone here says the issue still occurs on CUDA 10, I'm not sure why CUDA 10 is working for us.
@xsacha Did you test with #1514?
I think this continues to sound like a timing issue – race condition or so. Whether your application code is Debug mode or has optimizations turned off shouldn't change the CUDA code that is executed, but it does play a role in how quickly different things are completed, in particular relative to each other (and same for using different dnn architectures, as in @davemers0160's case; or different GPU hardware, as in mine).
So to me it sounds like this is a latent bug that pretty much every version has, but most combinations of application code / dnn architecture / GPU hardware just don't make the bug surface.
Ok, just an update. I reported before that I had it working with CUDA 10. It definitely is not working with CUDA 10, it just happens less!
CUDA 9.1 is the only release where I have no issues (without #1514 applied).
I haven't tried #1514 yet. Will get it when the next release occurs.
I think you're right about the timing issue. However, I've never seen this issue in CUDA 9.1 (which is our most tested release). So I'm happy to continue using it until I get this patch.
Just wanted to give you all an update on this. I just ran into this problem 3 times out of about 57 different trials. Same architecture for each trial. The only difference now is that I'm seeing it on CUDA 9.0, for an IBM Power8 machine running RHEL7 with 4 P100's. This was using dlib-19.15. I'm going to try the patch, but with ~5% chance of the stall happening I might not catch it.
Ok. So after several experiments I still am getting seeing the training sticking every once in a while even after making the changes in #1514. It seems that if I don't load the GPUs down as much as possible then the stall happens more often. I still have now seen the stall occur with the changes implemented on a Win 10/Cuda 10 machine.
Has anyone worked out if this only happens on WDDM2 driver or it happens on TCC too?
@davemers0160 : So is it that cudaStreamQuery never returns cudaSuccess, or simply doesn't return at all?
Pre-#1514, cudaStreamSynchronize didn't return at all (at least in the cases I saw).
@reunanen : On the windows side it was always cudaStreamSynchronize(0) that would never return. However on the RHEL 7 side I can only guess that the hang is occurring here cudaError_t err = cudaStreamQuery(stream); because I get no error message returned. This system is an HPC cluster so I don't have a lot of access or ability to debug since it pretty much submit the job and then peek in on every once in awhile to make sure that it is still running.
@davemers0160 So after #1514 it's still occurring on Windows as well? And still freezing specifically in cudaStreamSynchronize(0)? If so, then I guess the remaining calls need to be replaced also. Can do.
However, do note that (the merged version of) #1514 affects Windows only. So if you are seeing a similar freeze on RHEL also, then maybe the #ifdefs need to be removed after all.
@reunanen : So far I have not seen the freeze happen on Windows after applying #1514.
@davemers0160 Ok, thanks for the update. Could you please try #1596 on your RHEL setup? I simply made the #1514 fix apply also when not on Windows.