Pytorch-cyclegan-and-pix2pix: Training stops randomly around epoch 77

Created on 15 Apr 2019 · 5Comments · Source: junyanz/pytorch-CycleGAN-and-pix2pix

Hi there,

I have been using both CycleGAN and Pix2Pix for my thesis, and it's been very helpful for me. I have already been able to train quite a few models on both CycleGAN and Pix2Pix already.
Sometimes however the training process stops, without giving any error or warning.
(my terminal just states: end of epoch 77 ..., and then nothing happens anymore)
If I check the state of my graphics-card (using nvidia-smi), I can see that the GPU memory is still allocated by python, but the GPU-usage is 0% (normally during training, this goes between about 100 and 95%).
If I then stop the script and restart training with the --continue_train option, it does work and finishes up after 200 epochs.

For training I only used the train.py script as provided with all the default settings.
trainA and trainB contained exactly 1000 images. (pix2pix training also happened on 1000 pairs)

I have had this issue on my machine:

Nvidia GeForce GTX 1060 6GB
Ubuntu 18.04 LTS (Linux 4.15.0-47-generic)
visdom 0.1.8.5 (pip)
torch 0.4.1 (pip)
torchvision 0.2.1 (pip)
dominate 2.3.4 (pip)
Python 3.6.7

I have also had this exact issue on our university's Nvidia DGX-1 server:
This was inside a docker container starting from Nvidia's own docker image, I just pip-installed dominate and visdom.
On the server the issue was very annoying, when I checked up today, it appeared the server had not been doing any training past epoch 77 of my first experiment (I had 4 more planned) for 3 days.

I am very sorry for the vague description. I really have no clue what causes this issue, and I have not been able to reproduce it intentionally. The issue seems to come up randomly. I have had this happen at least 3 times (2 times on my machine, and once on the DGX) now, in both CycleGAN and Pix2Pix trainings, always around epoch 7x . I hoped that it would have to do something with my machine and it's chaossy configuration, but apparently this also happens on state-of-the-art machines inside a docker.

Any thoughts? Comments? I would be very happy to provide more information about my setup/situation if this post is not clear enough.
I am also going to try to run this through pycharm's debugger and see if I can get any wiser about it myself.

Thanks in advance,

Josse

Source

JosseVanDelm

👍2

Most helpful comment

I have the same issue, always stopping at 15 epochs.
Same I was not starting visdom manually.
So I disabled visdom and it is now working.
I use tensoborad instead.

olivier-gillet on 23 Apr 2019

👍3

All 5 comments

Not sure what is the reason. Maybe @SsnL @taesungp have a clue.

junyanz on 15 Apr 2019

Hi there, after running the train.py-script code through the debugger (and being lucky enought that it got stuck again), I noticed that the program is not running past this line in the train.pyscript

It gets stuck in these lines of pytorch code (comments added by myself):

    while True:  # This loop takes forever
        try:
            r = index_queue.get(timeout=MANAGER_STATUS_CHECK_INTERVAL) # r: <class 'tuple'>: (709,[538])
        except queue.Empty:
            if watchdog.is_alive(): # and for some reason watchdog is always alive
                continue                  # so this loop keeps going forever :(
            else:
                break
        if r is None:
            break
        idx, batch_indices = r
        try:
            samples = collate_fn([dataset[i] for i in batch_indices])
        except Exception:
            data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
        else:
            data_queue.put((idx, samples))
            del samples

This is the stacktrace I get:

_worker_loop, dataloader.py:97
run, process.py:93
_bootstrap, process.py:258
_launch, popen_fork.py:73
__init__, popen_fork.py:19
_Popen, context.py:277
_Popen, context.py:223
start, process.py:105
__init__, dataloader.py:289
__iter__, dataloader.py:501
__iter__, __init__.py:90
<module>, train.py:43

followed by the debugger that calls the train script.

execfile, _pydev_execfile.py:18
run, pydevd.py:1135
main, pydevd.py:1735
<module>, pydevd.py:1741

I still have no clue as to what makes this happen.
Any thoughts? Is it possible that this has something to do with the fact that I did not explicitly start the visdom server myself or something?
I'll try to keep looking whilst debugging, but this "low-level" code is way out of my comfort zone, so the help of anyone who knows more about this kind of issue is very much appreciated. Thanks!

JosseVanDelm on 17 Apr 2019

I have the same issue, always stopping at 15 epochs.
Same I was not starting visdom manually.
So I disabled visdom and it is now working.
I use tensoborad instead.

olivier-gillet on 23 Apr 2019

👍3

@JosseVanDelm The pytorch code you linked is running in the worker process, and it is supposed to be an infinite loop until the main process sends a signal or dies. The hang could very well be in the main process.

That said, between 0.4.1 and 1.0.0, a lot of improvements are done to the data loader. If the hang is indeed related to the dataloader, upgrading may resolve it.

SsnL on 1 May 2019

👍1

Thank you for your comments @SsnL and @olivier-gillet .
This weekend I did 26 consecutive trainings without a hassle on the DGX server.
This time I used a slightly newer version of the docker container and used the display_id=0 option with every training to disable visdom.
I still have no clue what causes the issue. Maybe the reason for the hang is indeed that I didn't start visdom beforehand manually as @olivier-gillet pointed out as well?
Because the last time that I did the training on the DGX I used this container which uses PyTorch commit 81e025d (which is past version 1.0.0, if I am correct) and I had the same issue there.