Keras: scikit-learn API: using fit_generator() with cross validation

Created on 28 Nov 2016  路  7Comments  路  Source: keras-team/keras

Is it possible to use Keras's scikit-learn API together with fit_generator() method? Or use another way to yield batches for training? I'm using SciPy's sparse matrices which must be converted to NumPy arrays before input to Keras, but I can't convert them at the same time because of high memory consumption. Here is my function to yield batches:

def batch_generator(X, y, batch_size):
    n_splits = len(X) // (batch_size - 1)
    X = np.array_split(X, n_splits)
    y = np.array_split(y, n_splits)

    while True:
        for i in range(len(X)):
            X_batch = []
            y_batch = []
            for ii in range(len(X[i])):
                X_batch.append(X[i][ii].toarray().astype(np.int8)) # conversion sparse matrix -> np.array
                y_batch.append(y[i][ii])
            yield (np.array(X_batch), np.array(y_batch))

and example code with cross validation:

from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn import datasets

from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.wrappers.scikit_learn import KerasClassifier

import numpy as np


def build_model(n_hidden=32):
    model = Sequential([
        Dense(n_hidden, input_dim=4),
        Activation("relu"),
        Dense(n_hidden),
        Activation("relu"),
        Dense(3),
        Activation("sigmoid")
    ])
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model


iris = datasets.load_iris()
X = iris["data"]
y = iris["target"].flatten()

param_grid = {
    "n_hidden": np.array([4, 8, 16]),
    "nb_epoch": np.array(range(50, 61, 5))
}

model = KerasClassifier(build_fn=build_model, verbose=0)
skf = StratifiedKFold(n_splits=5).split(X, y) # this yields (train_indices, test_indices)

grid = GridSearchCV(model, param_grid, cv=skf, verbose=2, n_jobs=4)
grid.fit(X, y)

print(grid.best_score_)
print(grid.cv_results_["params"][grid.best_index_])

To explain it more, it uses all the possible combinations of hyper-parameters in param_grid to build a model. Each model is then trained and tested one by one on the train-test data splits (folds) provided by StratifiedKFold generator. Then final score for a given model is a mean score from all folds.

So is it somehow possible to insert some preprocessing substep to the code above to convert data (sparse matrices) before the actual fitting?

I know I can write my own cross validation generator, but it must yield indexes, not the real data!

Most helpful comment

Personally, I would reopen this issue, to give a chance to the Keras project to validate this approach and move the code into the package, or propose a different solution.

All 7 comments

So I have slightly modified the solution from there: http://stackoverflow.com/a/40866543/1928742

class KerasBatchClassifier(KerasClassifier):
    """
    Add fit_generator to KerasClassifier to convert sparse matrices to numpy arrays before fitting.
    """

    def fit(self, X, y, **kwargs):
        if not issparse(X[0]):
            return super().fit(X, y, **kwargs)

        # taken from keras.wrappers.scikit_learn.KerasClassifier.fit ###################################################
        if self.build_fn is None:
            self.model = self.__call__(**self.filter_sk_params(self.__call__))
        elif not isinstance(self.build_fn, types.FunctionType) and not isinstance(self.build_fn, types.MethodType):
            self.model = self.build_fn(**self.filter_sk_params(self.build_fn.__call__))
        else:
            self.model = self.build_fn(**self.filter_sk_params(self.build_fn))

        loss_name = self.model.loss
        if hasattr(loss_name, '__name__'):
            loss_name = loss_name.__name__
        if loss_name == 'categorical_crossentropy' and len(y.shape) != 2:
            y = to_categorical(y)

        fit_args = copy.deepcopy(self.filter_sk_params(Sequential.fit_generator))
        fit_args.update(kwargs)
        ################################################################################################################

        early_stopping = EarlyStopping(monitor="val_loss", patience=3, verbose=5, mode="auto")
        model_checkpoint = ModelCheckpoint("results/best_weights.{epoch:02d}-{val_loss:.5f}.hdf5", monitor="val_loss", verbose=5, save_best_only=True, mode="auto")
        callbacks = [early_stopping, model_checkpoint]
        fit_args.update({"callbacks": callbacks})

        self.__history = self.model.fit_generator(
            self.batch_generator(X, y, batch_size=self.sk_params["batch_size"]),
            samples_per_epoch=X.shape[0],
            **fit_args)

        return self.__history

    def score(self, X, y, **kwargs):
        kwargs = self.filter_sk_params(Sequential.evaluate, kwargs)

        # sparse to numpy array
        X = KerasBatchClassifier.sparse_to_array(X)

        loss_name = self.model.loss
        if hasattr(loss_name, '__name__'):
            loss_name = loss_name.__name__
        if loss_name == 'categorical_crossentropy' and len(y.shape) != 2:
            y = to_categorical(y)
        outputs = self.model.evaluate(X, y, **kwargs)
        if type(outputs) is not list:
            outputs = [outputs]
        for name, output in zip(self.model.metrics_names, outputs):
            if name == 'acc':
                return output
        raise Exception('The model is not configured to compute accuracy. '
                        'You should pass `metrics=["accuracy"]` to '
                        'the `model.compile()` method.')

    @staticmethod
    def batch_generator(X, y, batch_size=128):
        n_splits = len(X) // (batch_size)
        X = np.array_split(X, n_splits)
        y = np.array_split(y, n_splits)

        while True:
            for Xy in zip(X, y):
                yield (KerasBatchClassifier.sparse_to_array(Xy[0]), Xy[1])

    @staticmethod
    def sparse_to_array(sparse_list):
        array_list = []
        for s in sparse_list:
            array_list.append(s.toarray().astype(np.int8))
        return np.array(array_list)

    @property
    def history(self):
        return self.__history

P.S. to run this with callbacks (see EarlyStopping and ModelCheckpoint in KerasBatchClassifier.fit()), you must edit the GridSearchCV source in scikit-learn. More here: https://github.com/fchollet/keras/issues/4278#issuecomment-264665803

for those of you who wants to do grid search with fit_generator, there's another option FYI:
https://github.com/keras-team/keras/issues/1591

update: it's strange that the link doesn't direct to the website, but you can copy paste it in the browser with the same url

@PeterPanUnderhill : Do you have the code? The link is empty

Personally, I would reopen this issue, to give a chance to the Keras project to validate this approach and move the code into the package, or propose a different solution.

Can this issue be re-opened?

Dear sir. how to use KearsClassifier with fit_generator in keras and use GridSearchCV to implement fit_generator() method? ?@gorgitko @PeterPanUnderhill @jonathanrocher

How can we use KerasBatchClassifier with a flow_from_directory generator?

Was this page helpful?
0 / 5 - 0 ratings