Keras: How to shuffle after each epoch using a custom generator?

Created on 19 Mar 2018  路  9Comments  路  Source: keras-team/keras

Hey,

I'm feeding my data using a custom generator....
model.fit_generator(train_generator,steps_per_epoch=steps_per_epoch, epochs=100, shuffle = False,
callbacks=[checkpointer,accCallBack,tbCallBack])

I init my custom generator like this:
train_generator = p.pairLoader(files,batch_size) (files include the paths to images)

I'm wondering if I could manually shuffle the files list after a epoch callback (depending how Keras works with the generators internally I guess)? Or is there something more convenient?

(the data is big - and not possible to read into memory...)

Most helpful comment

I will show you how I would do this.
This is a generic generator

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

1) Create a new generator which gives indices to every file in your set.
2) Slice those indices by batch size instead of slicing the files directly.
3) Use indices to slice the files
4) Override the on_epoch_end method to shuffle the indices

This is the result:

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.indices = np.arange(self.x.shape[0])

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = self.x[inds]
        batch_y = self.y[inds]
        return batch_x, batch_y

    def on_epoch_end(self):
        np.random.shuffle(self.indices)

Note: shuffle is in-place

All 9 comments

I will show you how I would do this.
This is a generic generator

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

1) Create a new generator which gives indices to every file in your set.
2) Slice those indices by batch size instead of slicing the files directly.
3) Use indices to slice the files
4) Override the on_epoch_end method to shuffle the indices

This is the result:

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.indices = np.arange(self.x.shape[0])

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = self.x[inds]
        batch_y = self.y[inds]
        return batch_x, batch_y

    def on_epoch_end(self):
        np.random.shuffle(self.indices)

Note: shuffle is in-place

@fculinovic if we consider keras callbacks, there seem to be keras callbacks executing on_epoch_end at the same time as on_epoch_end is called on the sequences. This lead to very odd behavior for us. There is a built-in "shuffle" parameter to fit_generator, why not set that to True?

@jontejj
If you check the documentation of fit_generator function (https://keras.io/models/sequential/) you can see the following:

shuffle: Boolean. Whether to shuffle the order of the batches at the beginning of each epoch

Shuffling the order of batches is different from shuffling the samples themselves - in one case samples can change batches, in the other only the batch ordering is changed.

I have two different datasets, and I want to keep track of them, i.e., I want to shuffle the dataset 1 and dataset 2 separately, and then concatenate them. Can some of you give me some clues about how to do that?

Just wonder how this 'idx' update through __getitem__, source code seems set a 'batch_step' param but show its pass to generator implicit.

I use the following script to test:

train_datagen = Generator(x_train, x_train, batch_size)
test_datagen = Generator(x_test, x_test, batch_size)

vae.fit_generator(train_datagen,
    steps_per_epoch=len(x_train)//batch_size,
    validation_data=test_datagen,
    validation_steps=len(x_test)//batch_size,
    epochs=epochs)

But I need to revise return math.ceil(self.x.shape[0] / self.batch_size) to return math.floor(self.x.shape[0] / self.batch_size) so it can run successful.

I will show you how I would do this.
This is a generic generator

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y
  1. Create a new generator which gives indices to every file in your set.
  2. Slice those indices by batch size instead of slicing the files directly.
  3. Use indices to slice the files
  4. Override the on_epoch_end method to shuffle the indices

This is the result:

class Generator(Sequence):
    # Class is a dataset wrapper for better training performance
    def __init__(self, x_set, y_set, batch_size=256):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.indices = np.arange(self.x.shape[0])

    def __len__(self):
        return math.ceil(self.x.shape[0] / self.batch_size)

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = self.x[inds]
        batch_y = self.y[inds]
        return batch_x, batch_y

    def on_epoch_end(self):
        np.random.shuffle(self.indices)

Note: shuffle is in-place

Where is the sequence class, how to import it?

@nantha42 It's the keras.utils.Sequence class https://keras.io/utils/#sequence

@fculinovic if we consider keras callbacks, there seem to be keras callbacks executing on_epoch_end at the same time as on_epoch_end is called on the sequences. This lead to very odd behavior for us. There is a built-in "shuffle" parameter to fit_generator, why not set that to True?

The question about the shuffle argument in the fit_generator function is answered. But what about how to avoid or deal with the overidding of on_epoch_end function when we want to use both this generator and callbacks at the same time?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zygmuntz picture zygmuntz  路  3Comments

harishkrishnav picture harishkrishnav  路  3Comments

farizrahman4u picture farizrahman4u  路  3Comments

snakeztc picture snakeztc  路  3Comments

Imorton-zd picture Imorton-zd  路  3Comments