Keras: flaky test,10 minutes timeout on travis: inputs = self.queue.get(block=True).get() hangs

Created on 23 Oct 2018 · 13Comments · Source: keras-team/keras

This issue is a discussion to track the 10 minutes timeout bug on travis.

On travis, the test suite sometimes hangs and travis gives the message no output has been recieved for the last 10 minutes. By adding pytest-timeout to our test suite, we were able to get a traceback of the problem. Note that it happens most of the time with tensorflow, but sometimes happens with theano and cntk. Here is this traceback:

=================================== FAILURES ===================================
________________________________ test_warnings _________________________________
[gw0] linux -- Python 3.6.6 /home/travis/miniconda/envs/test-environment/bin/python
    @pytest.mark.skipif(sys.version_info < (3,),
                        reason='Cannot catch warnings in python 2')
    def test_warnings():
        a = Input(shape=(3,), name='input_a')
        b = Input(shape=(3,), name='input_b')

        a_2 = Dense(4, name='dense_1')(a)
        dp = Dropout(0.5, name='dropout')
        b_2 = dp(b)

        model = Model([a, b], [a_2, b_2])

        optimizer = 'rmsprop'
        loss = 'mse'
        loss_weights = [1., 0.5]
        model.compile(optimizer, loss, metrics=[], loss_weights=loss_weights,
                      sample_weight_mode=None)

        @threadsafe_generator
        def gen_data(batch_sz):
            while True:
                yield ([np.random.random((batch_sz, 3)),
                        np.random.random((batch_sz, 3))],
                       [np.random.random((batch_sz, 4)),
                        np.random.random((batch_sz, 3))])

        with pytest.warns(Warning) as w:
            out = model.fit_generator(gen_data(4),
                                      steps_per_epoch=10,
                                      use_multiprocessing=True,
>                                     workers=2)
tests/keras/engine/test_training.py:604: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
keras/legacy/interfaces.py:91: in wrapper
    return func(*args, **kwargs)
keras/engine/training.py:1418: in fit_generator
    initial_epoch=initial_epoch)
keras/engine/training_generator.py:179: in fit_generator
    generator_output = next(output_generator)
keras/utils/data_utils.py:685: in get
    inputs = self.queue.get(block=True).get()
../../../miniconda/envs/test-environment/lib/python3.6/multiprocessing/pool.py:638: in get
    self.wait(timeout)
../../../miniconda/envs/test-environment/lib/python3.6/multiprocessing/pool.py:635: in wait
    self._event.wait(timeout)
../../../miniconda/envs/test-environment/lib/python3.6/threading.py:551: in wait
    signaled = self._cond.wait(timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <Condition(<unlocked _thread.lock object at 0x7fd35883ce18>, 0)>
timeout = None
    def wait(self, timeout=None):
        """Wait until notified or until a timeout occurs.

        If the calling thread has not acquired the lock when this method is
        called, a RuntimeError is raised.

        This method releases the underlying lock, and then blocks until it is
        awakened by a notify() or notify_all() call for the same condition
        variable in another thread, or until the optional timeout occurs. Once
        awakened or timed out, it re-acquires the lock and returns.

        When the timeout argument is present and not None, it should be a
        floating point number specifying a timeout for the operation in seconds
        (or fractions thereof).

        When the underlying lock is an RLock, it is not released using its
        release() method, since this may not actually unlock the lock when it
        was acquired multiple times recursively. Instead, an internal interface
        of the RLock class is used, which really unlocks it even when it has
        been recursively acquired several times. Another internal interface is
        then used to restore the recursion level when the lock is reacquired.

        """
        if not self._is_owned():
            raise RuntimeError("cannot wait on un-acquired lock")
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
>               waiter.acquire()
E               Failed: Timeout >720.0s
../../../miniconda/envs/test-environment/lib/python3.6/threading.py:295: Failed
----------------------------- Captured stdout call -----------------------------
Epoch 1/1
----------------------------- Captured stderr call -----------------------------
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++

To see the full logs, here are two builds which failed with this error:

https://travis-ci.org/keras-team/keras/jobs/444740251
https://travis-ci.org/keras-team/keras/jobs/441817913

This is likely linked to this issue too #11288 .

@Dref360 @fchollet .

flaky test buperformance

Source

gabrieldemarmiesse

All 13 comments

This is related to : https://bugs.python.org/issue22393

Basically, a worker dies (for whatever reason) and the pool is not notified.

Using 'spawn' instead of fork would mitigate this issue, but it's only available on Python3 and for Sequence, not generators.

Dref360 on 24 Oct 2018

If I understand correctly, it's likely that a worker dies because of a out-of-memory error? Maybe we could try to solve that while waiting for the fix in the python standard library?

gabrieldemarmiesse on 25 Oct 2018

The issue is that beginning with tf 1.8, tf is somewhat forkable, meaning that a lot of objects are copied on fork. Using a spawn scheme instead of fork would not copy these objects. The issue is that spawn is only available on PY3 and for Sequence.

It's not trivial to fix this issue as of now.

Dref360 on 25 Oct 2018

Ok, thank you for the explanation. I'll leave this issue open since the bug is still present, let's hope that the fix will make it into python soon. I made #11491. I don't know if it's going to be useful.

gabrieldemarmiesse on 25 Oct 2018

We could add a mechanism such that after 5 secs we drop the promise if it's not done. We would lose a sample, but with the proper warnings it should be easily understandable. Or maybe requeue them at the end of the epoch?

I'll try to propose something this weekend.

One cool thing is that the pool is restarted at each epoch so if we lose a worker in the middle of an epoch, we will use N-1 workers for the rest of the epoch and then jump back up to N workers after.

Dref360 on 26 Oct 2018

👍1

Or maybe requeue them at the end of the epoch?

We could do this, and we should also add a warning to avoid user confusion. After all, having a process dying silently isn't supposed to be common.

gabrieldemarmiesse on 27 Oct 2018

Any news here? I found out that in my case it always hangs here:
https://github.com/keras-team/keras/blob/d6b5c5ebb410e3366c9d7aca41977a60134bfe10/keras/engine/training_generator.py#L181

thorstenwagner on 1 Nov 2018

👍1

Another question: If I can ensure, that the users of my program are using python 3 (+ I'm using Sequence), I could simply use

   import multiprocessing as mp
   [...]
   mp.set_start_method('spawn')

to solve the problem?
https://docs.python.org/3/library/multiprocessing.html

thorstenwagner on 1 Nov 2018

Maybe?

In our case, Travis fails because when we fork the main process, TF is also forked which cause a tremendous amount of memory to be allocated which causes an OOM on the worker. The worker will die silently so we wait forever.

Using spawn will only load the library if we need it.

Right now it's not trivial to retrieve the exception on the worker.

Dref360 on 1 Nov 2018

I'm running the training right now! At least it did not freeze during the last 50 Epochs. Typically it happend after only a few epochs. However, I noticed that after each epoch I get multiple messages of "Using tensorflow backend". Any idea why that happens? Otherwise it seems to work.

Thanks for you hint regarding "Spawn". I will see if that fix my problems in the long run.

thorstenwagner on 1 Nov 2018

Yeah you see the "Using tensorflow backend" because the keras library is being loaded at every epoch for every worker.

Dref360 on 1 Nov 2018

Ah, because the workers are restarted after every epoch in spawn mode, right?

thorstenwagner on 1 Nov 2018

They are always restarted after each epoch but when we use fork the library is already loaded.

Dref360 on 2 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings