Keras: How to shuffle after each epoch using a custom generator?

Created on 19 Mar 2018 · 9Comments · Source: keras-team/keras

Hey,

I'm feeding my data using a custom generator....
model.fit_generator(train_generator,steps_per_epoch=steps_per_epoch, epochs=100, shuffle = False,
callbacks=[checkpointer,accCallBack,tbCallBack])

I init my custom generator like this:
train_generator = p.pairLoader(files,batch_size) (files include the paths to images)

I'm wondering if I could manually shuffle the files list after a epoch callback (depending how Keras works with the generators internally I guess)? Or is there something more convenient?

(the data is big - and not possible to read into memory...)

Source

lschaupp

👍5

Most helpful comment

I will show you how I would do this.
This is a generic generator

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

1) Create a new generator which gives indices to every file in your set.
2) Slice those indices by batch size instead of slicing the files directly.
3) Use indices to slice the files
4) Override the on_epoch_end method to shuffle the indices

This is the result:

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.indices = np.arange(self.x.shape[0])

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = self.x[inds]
        batch_y = self.y[inds]
        return batch_x, batch_y

    def on_epoch_end(self):
        np.random.shuffle(self.indices)

Note: shuffle is in-place

fculinovic on 20 Mar 2018

👍25 🎉2

All 9 comments

I will show you how I would do this.
This is a generic generator

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

This is the result:

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.indices = np.arange(self.x.shape[0])

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = self.x[inds]
        batch_y = self.y[inds]
        return batch_x, batch_y

    def on_epoch_end(self):
        np.random.shuffle(self.indices)

Note: shuffle is in-place

fculinovic on 20 Mar 2018

👍25 🎉2

@fculinovic if we consider keras callbacks, there seem to be keras callbacks executing on_epoch_end at the same time as on_epoch_end is called on the sequences. This lead to very odd behavior for us. There is a built-in "shuffle" parameter to fit_generator, why not set that to True?

jontejj on 8 Aug 2018

@jontejj
If you check the documentation of fit_generator function (https://keras.io/models/sequential/) you can see the following:

shuffle: Boolean. Whether to shuffle the order of the batches at the beginning of each epoch

Shuffling the order of batches is different from shuffling the samples themselves - in one case samples can change batches, in the other only the batch ordering is changed.

fculinovic on 8 Aug 2018

👍6

I have two different datasets, and I want to keep track of them, i.e., I want to shuffle the dataset 1 and dataset 2 separately, and then concatenate them. Can some of you give me some clues about how to do that?

dongfang91 on 21 Aug 2018

Just wonder how this 'idx' update through __getitem__, source code seems set a 'batch_step' param but show its pass to generator implicit.

yhmybzc on 26 Jan 2019

I use the following script to test:

train_datagen = Generator(x_train, x_train, batch_size)
test_datagen = Generator(x_test, x_test, batch_size)

vae.fit_generator(train_datagen,
    steps_per_epoch=len(x_train)//batch_size,
    validation_data=test_datagen,
    validation_steps=len(x_test)//batch_size,
    epochs=epochs)

But I need to revise return math.ceil(self.x.shape[0] / self.batch_size) to return math.floor(self.x.shape[0] / self.batch_size) so it can run successful.

keineahnung2345 on 11 Feb 2019

👍2

I will show you how I would do this.
This is a generic generator

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

Create a new generator which gives indices to every file in your set.
Slice those indices by batch size instead of slicing the files directly.
Use indices to slice the files
Override the on_epoch_end method to shuffle the indices

This is the result:

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.indices = np.arange(self.x.shape[0])

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = self.x[inds]
        batch_y = self.y[inds]
        return batch_x, batch_y

    def on_epoch_end(self):
        np.random.shuffle(self.indices)

Note: shuffle is in-place

Where is the sequence class, how to import it?

nantha42 on 21 Jan 2020

@nantha42 It's the keras.utils.Sequence class https://keras.io/utils/#sequence

fculinovic on 22 Jan 2020

@fculinovic if we consider keras callbacks, there seem to be keras callbacks executing on_epoch_end at the same time as on_epoch_end is called on the sequences. This lead to very odd behavior for us. There is a built-in "shuffle" parameter to fit_generator, why not set that to True?

The question about the shuffle argument in the fit_generator function is answered. But what about how to avoid or deal with the overidding of on_epoch_end function when we want to use both this generator and callbacks at the same time?