Keras: Implement a Data Generator that outputs sample_weight

Created on 5 Dec 2018 · 13Comments · Source: keras-team/keras

I'm using a data generator to feed the fit_generator. My generator have as output the tuple (x_val, y_val, val_sample_weights) so showing sample weights. This is like:

import numpy as np
import keras
import librosa
from time import time
import random
from config import *

class DataGenerator(keras.utils.Sequence):

    'Generates data for Keras'

    def __init__(self, dataframe, batch_size=None, dim=None, labels_dim=None,
                 n_classes=None, shuffle=True, samples=None, duration=None, sample_weights=None):
        'Initialization'
        self.dim = dim
        self.batch_size = batch_size

        self.dataframe = dataframe
        self.dataframe = self.dataframe.sample(n=len(self.dataframe))
        self.samples = samples
        self.on_epoch_end()
        self.shuffle = shuffle
        self.sample_weights = sample_weights

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.dataframe) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        random_pd = self.dataframe.iloc[self.batch_size*index : (index+1)*self.batch_size]
        # Generate data
        X, y = self.__data_generation(random_pd)
        return X, y
    def __data_generation(self, random_pd):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, 1, self.samples))
        y = np.empty((self.batch_size, self.n_classes))
        i = 0
        while i < self.batch_size:
            for index, row in random_pd.iterrows():
                 # generate 
                y[i,] = label
                X[i,] = ...
                i += 1
        return X, y, self.sample_weights

so it will return X, y, self.sample_weights.

The problem is that we will get a StopIteration: too many values to unpack as it would expect 2 but I'm giving 3 values - as actually I'm doing.

Traceback (most recent call last):
File "train.py", line 438, in <module>
train()
File "train.py", line 422, in train
callbacks=callbacks
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 1315, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 2250, in fit_generator
max_queue_size=max_queue_size)
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 2383, in evaluate_generator
generator_output = next(output_generator)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/data_utils.py", line 584, in get
six.raise_from(StopIteration(e), e)
File "/usr/local/lib/python2.7/dist-packages/six.py", line 737, in raise_from
raise value
StopIteration: too many values to unpack

I call the fit_generator as usual passing my training_generator then

history = model.fit_generator(generator=training_generator,
                                class_weight=class_weights,
                                verbose=1,
                                use_multiprocessing=True,
                                workers=24, 
                                steps_per_epoch=training_steps_per_epoch,
                                epochs=epochs,
                                validation_data=validation_generator,
                                validation_steps = validation_steps_per_epoch,
                                callbacks=callbacks
                                )

I do this because when using the fit_generator it is not possibile to pass the sample_weight, since the method signature only supports the class_weight - see here https://github.com/keras-team/keras/issues/11800

To investigate

Source

loretoparisi

👍2

Most helpful comment

Then my error was in this line when computing the sample weights:
labs = [y.argmax() for y in y]
it should be
labs = [j.argmax() for j in y]
Basically I was corrupting my labels doing like this!

Thanks a lot! I solved the issue, it is working correctly now.

shoegazerstella on 6 Dec 2018

❤2

All 13 comments

[UPDATE]
I have tried doing the sample weight calculation in the generator to match the X,y batch size, so like

def generate_sample_weights(training_data, class_weight_dictionary): 
    sample_weights = [class_weight_dictionary[np.where(one_hot_row==1)[0][0]] for one_hot_row in training_data]
    return np.asarray(sample_weights)
#...
generate_sample_weights(y, class_weights_dict)

but I'm still getting the too many values to unpack error.

loretoparisi on 5 Dec 2018

We'll fuse fit_generator and fit. So it's likely this issue will go away at the same time. Please monitor #11772. Also I'm not sure that's what you want, but __getitem__ can also return three values: x, y, w.

gabrieldemarmiesse on 5 Dec 2018

👍1

If the documentation is not saying that returning a tuple of three arrays is possible, it might be worth a pull request to add an example.

gabrieldemarmiesse on 5 Dec 2018

@gabrieldemarmiesse so what I did now was __getitem__ to return the tuple (X,y,sample_weigth) like so

def __getitem__(self, index):
        'Generate one batch of data'
        random_pd = self.dataframe.iloc[self.batch_size*index : (index+1)*self.batch_size]
        # Generate data
        X, y, sample_weight = self.__data_generation(random_pd)
        return X, y, sample_weight

so that to get the right tuple of the size of the batch. Unexpectedly now I will get a ValueError: Error when checking target: expected dense_1 to have shape (4,) but got array with shape (1,), where I have 4 labels.

loretoparisi on 5 Dec 2018

@gabrieldemarmiesse No need for a PR. It has been explicitly mentioned in the documentation:

generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. The output of the generator must be either

a tuple (inputs, targets)

a tuple (inputs, targets, sample_weights).

mkaze on 6 Dec 2018

@mkaze yes it is, in fact I have started from that (documented) tuple and modified the __getItem__ accordingly like above to return return X, y, sample_weight that tuple.

Problem is that there is something wrong in the y output shape when doing so.

loretoparisi on 6 Dec 2018

@loretoparisi That's another issue. Probably you are using categorical_crossentropy as the loss but you are passing the labels in sparse format instead of one-hot encoded format.

mkaze on 6 Dec 2018

Hi @mkaze, many thanks for your help.
In fact, we are using categorical_crossentropy and our labels y are one-hot encoded.
Shapes are:

X = np.empty((self.batch_size, 1, self.n_samples))
y = np.empty((self.batch_size, self.n_classes)))

where n_classes = 4
Assuming a batch_size = 5 for simplicity, I then compute my sample weights like this:

labs = [y.argmax() for y in y]
sample_weights = class_weight.compute_sample_weight('balanced', labs)

and the result is:

[1.25       0.83333333 0.83333333 1.25       0.83333333]
('sample_weights shape', (5,))

I tried to reshape it to (5, 1), but is not working.
The error is:

File "train.py", line 474, in train
    callbacks=callbacks
  File "/anaconda2/envs/s/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/anaconda2/envs/s/lib/python2.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/anaconda2/envs/s/lib/python2.7/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/anaconda2/envs/s/lib/python2.7/site-packages/keras/engine/training.py", line 1211, in train_on_batch
    class_weight=class_weight)
  File "/anaconda2/envs/s/lib/python2.7/site-packages/keras/engine/training.py", line 789, in _standardize_user_data
    exception_prefix='target')
  File "/anaconda2/envs/s/lib/python2.7/site-packages/keras/engine/training_utils.py", line 138, in standardize_input_data
    str(data_shape))
ValueError: Error when checking target: expected dense_1 to have shape (4,) but got array with shape (1,)

where dense_1 should be the last dense layer of my CNN, I am using a softmax activation function on that one. It seems to me that the network is taking my sample_weights as y label somehow.

shoegazerstella on 6 Dec 2018

👍1

@shoegazerstella The error says there is a mismatch between the output shape of the model and the labels you provide. To make sure, test your generator like this:

x, y, w = training_generator[0]
print(x.shape, y.shape, w.shape)

mkaze on 6 Dec 2018

The output of your code is:

((5, 1, 348000), (4,), (5,))

should it be something like this instead?
((5, 1, 348000), (4,), (1,))

shoegazerstella on 6 Dec 2018

@shoegazerstella Ok, then it should be like this:

x: (5, 1, 348000)  <--- 5 input samples, each with shape (1, 348000)
y: (5, 4)          <--- 5 labels for the 5 input samples, each label is a vector of length 4
w: (5, 1) or (5,)  <--- 5 weights for the 5 input samples

Actually 5 is the batch size here.

mkaze on 6 Dec 2018

👍1

Then my error was in this line when computing the sample weights:
labs = [y.argmax() for y in y]
it should be
labs = [j.argmax() for j in y]
Basically I was corrupting my labels doing like this!

Thanks a lot! I solved the issue, it is working correctly now.

shoegazerstella on 6 Dec 2018

❤2

So to sum up, basically the code above works with small fixes and the __getItem__ can do return X, y, sample_weight to return the tuple with the sample weights!

Thanks a lot guys!

loretoparisi on 6 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings