Keras: How to parallelize fit_generator? (PicklingError)

Created on 4 Oct 2016 · 15Comments · Source: keras-team/keras

I tried several ways but cannot get parallelization of sample / data generation to work successfully. Below is a gist. Am I doing something wrong, or is there a bug?

https://gist.github.com/stmax82/283ef735c8e2601ef841de8b37243ee1

I suppose that my fourth try would be the correct one - but when I set pickle_safe=True, I get the error:

PicklingError: Can't pickle <function generator_queue.<locals>.data_generator_task at 0x000000001B042EA0>: attribute lookup data_generator_task on keras.engine.training failed

stale

Source

stmax82

👍1

Most helpful comment

TL;DR: Do NOT set pickle_safe=True. You're bound for trouble.

Extensive explanation:
I've been investigating the way workers are used in the _generator function sets (see issues #5071, #6745 ).
So far, my conclusion is that the way pickle_safe=True is implemented is, at best, flawed beyond recovery and should be avoided completely.
Here's what I've gathered:
1) Generators are not picklable, and it seems they won't be any time soon.
2) If you're a Windows user: you're actually luckier because the code won't run at all. Since Windows doesn't have fork(), the multiprocessing.Process is made -simplifying heavily- by creating a whole new application process, pickling the data the new process needs and sending it over a pipe, together with other data required to simulate a fork() (See this article for a more detailed and precise explanation of why that is necessary)
3) If you're a Linux user, your problem is sneakier. The code will run just fine, because thanks to fork()'s magic, there isn't the need to pickle and unpickle the generator. However, An identical independent clone of the original generator is created and used independently in each child process!
Which means that your data is being enqueued one time per each worker whenever they call next() and obtain a new batch.

Take this example:

import multiprocessing
import time

if __name__ == '__main__':
    def my_generator():
        i = 0
        while True:
            i += 1
            yield i
    gen = my_generator()
    queue = multiprocessing.Queue()
    def target_f():
        import os
        import sys
        import time
        sys.stdout = open(str(os.getpid()) + ".out", "w")
        sys.stderr = open(str(os.getpid()) + ".err", "w")
        time.sleep(1)
        queue.put(next(gen))
    p1 = multiprocessing.Process(target=target_f)
    p2 = multiprocessing.Process(target=target_f)
    p1.start()
    time.sleep(0.3)
    p2.start()
    p1.join()
    p2.join()
    print(queue.qsize())
    print(queue.get())
    print(queue.get())

The output of this code under Linux is:

1
1

while under Windows it breaks with the error:

TypeError: can't pickle generator objects

Considering all, I believe the pickle_safe argument to be misleading, wrong and potentially harmful and should be IMO removed altogether.
Until then, stick to pickle_safe=False in your code to avoid headaches.

GPhilo on 29 May 2017

👍5

All 15 comments

You can read this

parag2489 on 5 Oct 2016

@parag2489 I did, but it doesn't seem to solve my problem - please explain? Is this for an older version? In 1.1.0 you cannot set nb_worker=2 (like they did in #1638) without also setting pickle_safe=True.. and when you set pickle_safe=True it uses multiprocess parallelization, which should make the threadsafe wrapper unnecessary... but setting pickle_safe results in another error like in my fourth case (PickleError).

stmax82 on 5 Oct 2016

Same problem here. I also tried with the solution proposed in #1638 but it didn't solve my problem. I'm trying to read my dataset from a hdf5 file, it it matters.

rawhazard on 5 Oct 2016

Same problem, my code works perfectly with threads, but fails with process (pickle_safe = True). Any solution?

eliafrigieri on 23 Feb 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 24 May 2017

TL;DR: Do NOT set pickle_safe=True. You're bound for trouble.

Take this example:

import multiprocessing
import time

if __name__ == '__main__':
    def my_generator():
        i = 0
        while True:
            i += 1
            yield i
    gen = my_generator()
    queue = multiprocessing.Queue()
    def target_f():
        import os
        import sys
        import time
        sys.stdout = open(str(os.getpid()) + ".out", "w")
        sys.stderr = open(str(os.getpid()) + ".err", "w")
        time.sleep(1)
        queue.put(next(gen))
    p1 = multiprocessing.Process(target=target_f)
    p2 = multiprocessing.Process(target=target_f)
    p1.start()
    time.sleep(0.3)
    p2.start()
    p1.join()
    p2.join()
    print(queue.qsize())
    print(queue.get())
    print(queue.get())

The output of this code under Linux is:

1
1

while under Windows it breaks with the error:

TypeError: can't pickle generator objects

GPhilo on 29 May 2017

👍5

Is it possible to clone a generator? If you have copies of a generator, calling next() on one will also advance the state of its clones, which means they aren't really separate copies of the original generator. If the generator supports concurrent use (i.e., it won't throw ValueError: generator already in use), multiple processes accessing the generator shouldn't be a problem.

Now, this problem may be different for an iterable object, which is possible to clone. But not for a generator function, I don't think.

rben01 on 8 Jun 2017

As far as I could research, it's not possible to clone a generator (unless done implicitly by fork() when a new process is spawned from the process that has the original generator, but even then the two copies are entirely independent from each other).
What do you mean with "If you have copies of a generator, calling next() on one will also advance the state of its clones, which means they aren't really separate copies of the original generator"?

GPhilo on 8 Jun 2017

def gen():
    i = 0
    while True:
        yield i
        i += 1

g = gen()
h = g
print(next(g)) // 0
print(next(h)) // 1

rben01 on 8 Jun 2017

That is not "cloning" the generator, you're just making a second reference to it. When you move an object to a new process, however, you must create a real copy of it, not just a reference. The object is either copied when the process forks (on linux) or pickled, sent to the new process and unpickled (on windows).
The problem is that generators are not picklable, so they can't be sent to the second process.

GPhilo on 9 Jun 2017

I've been using generators with pickle_safe=True. It indeed creates an
independent process for each worker.
This is not a problem if you have a large dataset and each generator
shuffles the sample indexes independently.

Just make sure not to initialize the random seed to a constant within the
generator, otherwise all of the copies will have the same sample order, and
your mini-batches will have K copies of each sample (if you have K workers).

On Fri, Jun 9, 2017 at 12:22 PM, GPhilo notifications@github.com wrote:

That is not "cloning" the generator, you're just making a second reference
to it. When you move an object to a new process, however, you must create a
real copy of it, not just a reference. The object is either copied when the
process forks (on linux) or pickled, sent to the new process and unpickled
(on windows).
The problem is that generators are not picklable, so they can't be sent to
the second process.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/3962#issuecomment-307340730,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFdLCNIz6vaJ1fQA_bKv0naJrZGPFvW2ks5sCQ7vgaJpZM4KN1ck
.

jonilaserson on 9 Jun 2017

@jonilaserson Are you working on windows or linux? Also, this works for your case because the order in which the generator produces the data is not important. If you however use something like the generator created with ImageDataGenerator.flow_from_directory() the order in which batches are generated is important to link each result to the filename it comes from (useful, for example, if you're trying to predict the class of unlabelled data)

GPhilo on 9 Jun 2017

I think flow_from_directory is not one of Keras more flexible constructs,
and I recommend not to use it.

On Jun 9, 2017 1:04 PM, "GPhilo" notifications@github.com wrote:

@jonilaserson https://github.com/jonilaserson Are you working on windows
or linux? Also, this works for your case because the order in which the
generator produces the data is not important. If you however use something
like the generator created with ImageDataGenerator.flow_from_directory()
the order in which batches are generated is important to link each result
to the filename it comes from (useful, for example, if you're trying to
predict the class of unlabelled data)

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/3962#issuecomment-307349814, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AFdLCLL0aUHExiPeRGzB4sD1NI0AYa0Wks5sCRiNgaJpZM4KN1ck
.

jonilaserson on 14 Jun 2017

stale[bot] on 12 Sep 2017

no solution yet