Hey,
I'm feeding my data using a custom generator....
model.fit_generator(train_generator,steps_per_epoch=steps_per_epoch, epochs=100, shuffle = False,
callbacks=[checkpointer,accCallBack,tbCallBack])
I init my custom generator like this:
train_generator = p.pairLoader(files,batch_size) (files include the paths to images)
I'm wondering if I could manually shuffle the files list after a epoch callback (depending how Keras works with the generators internally I guess)? Or is there something more convenient?
(the data is big - and not possible to read into memory...)
I will show you how I would do this.
This is a generic generator
class Generator(Sequence):
# Class is a dataset wrapper for better training performance
def __init__(self, x_set, y_set, batch_size=256):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return math.ceil(self.x.shape[0] / self.batch_size)
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
return batch_x, batch_y
1) Create a new generator which gives indices to every file in your set.
2) Slice those indices by batch size instead of slicing the files directly.
3) Use indices to slice the files
4) Override the on_epoch_end
method to shuffle the indices
This is the result:
class Generator(Sequence):
# Class is a dataset wrapper for better training performance
def __init__(self, x_set, y_set, batch_size=256):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
self.indices = np.arange(self.x.shape[0])
def __len__(self):
return math.ceil(self.x.shape[0] / self.batch_size)
def __getitem__(self, idx):
inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_x = self.x[inds]
batch_y = self.y[inds]
return batch_x, batch_y
def on_epoch_end(self):
np.random.shuffle(self.indices)
Note: shuffle is in-place
@fculinovic if we consider keras callbacks, there seem to be keras callbacks executing on_epoch_end at the same time as on_epoch_end is called on the sequences. This lead to very odd behavior for us. There is a built-in "shuffle" parameter to fit_generator, why not set that to True?
@jontejj
If you check the documentation of fit_generator function (https://keras.io/models/sequential/) you can see the following:
shuffle: Boolean. Whether to shuffle the order of the batches at the beginning of each epoch
Shuffling the order of batches is different from shuffling the samples themselves - in one case samples can change batches, in the other only the batch ordering is changed.
I have two different datasets, and I want to keep track of them, i.e., I want to shuffle the dataset 1 and dataset 2 separately, and then concatenate them. Can some of you give me some clues about how to do that?
Just wonder how this 'idx' update through __getitem__, source code seems set a 'batch_step' param but show its pass to generator implicit.
I use the following script to test:
train_datagen = Generator(x_train, x_train, batch_size)
test_datagen = Generator(x_test, x_test, batch_size)
vae.fit_generator(train_datagen,
steps_per_epoch=len(x_train)//batch_size,
validation_data=test_datagen,
validation_steps=len(x_test)//batch_size,
epochs=epochs)
But I need to revise return math.ceil(self.x.shape[0] / self.batch_size)
to return math.floor(self.x.shape[0] / self.batch_size)
so it can run successful.
I will show you how I would do this.
This is a generic generatorclass Generator(Sequence): # Class is a dataset wrapper for better training performance def __init__(self, x_set, y_set, batch_size=256): self.x, self.y = x_set, y_set self.batch_size = batch_size def __len__(self): return math.ceil(self.x.shape[0] / self.batch_size) def __getitem__(self, idx): batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size] batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size] return batch_x, batch_y
- Create a new generator which gives indices to every file in your set.
- Slice those indices by batch size instead of slicing the files directly.
- Use indices to slice the files
- Override the
on_epoch_end
method to shuffle the indicesThis is the result:
class Generator(Sequence): # Class is a dataset wrapper for better training performance def __init__(self, x_set, y_set, batch_size=256): self.x, self.y = x_set, y_set self.batch_size = batch_size self.indices = np.arange(self.x.shape[0]) def __len__(self): return math.ceil(self.x.shape[0] / self.batch_size) def __getitem__(self, idx): inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size] batch_x = self.x[inds] batch_y = self.y[inds] return batch_x, batch_y def on_epoch_end(self): np.random.shuffle(self.indices)
Note: shuffle is in-place
Where is the sequence class, how to import it?
@nantha42 It's the keras.utils.Sequence class https://keras.io/utils/#sequence
@fculinovic if we consider keras callbacks, there seem to be keras callbacks executing on_epoch_end at the same time as on_epoch_end is called on the sequences. This lead to very odd behavior for us. There is a built-in "shuffle" parameter to fit_generator, why not set that to True?
The question about the shuffle argument in the fit_generator function is answered. But what about how to avoid or deal with the overidding of on_epoch_end function when we want to use both this generator and callbacks at the same time?
Most helpful comment
I will show you how I would do this.
This is a generic generator
1) Create a new generator which gives indices to every file in your set.
2) Slice those indices by batch size instead of slicing the files directly.
3) Use indices to slice the files
4) Override the
on_epoch_end
method to shuffle the indicesThis is the result:
Note: shuffle is in-place