Keras: ModelCheckpoint not saving best version due to issue with opening h5py file

Created on 2 Oct 2018 · 15Comments · Source: keras-team/keras

Having checked that everything is as it should be (latest version of keras, and latest version of tensorflow both installed), I have found that running a model with a model checkpoint callback that saves the best model so far causes an issue with serialisation of the model.

Here's a script which, when run, shows the issue.

The output during imports and initialisation of the Tensorflow backend is:

~~
2018-10-02 12:34:47.868073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
totalMemory: 7.93GiB freeMemory: 7.09GiB
2018-10-02 12:34:47.868102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-10-02 12:34:48.075527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-02 12:34:48.075556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-10-02 12:34:48.075562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-10-02 12:34:48.075728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6837 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
Using TensorFlow backend.
2018-10-02 12:34:51.635814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-10-02 12:34:51.635853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-02 12:34:51.635859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-10-02 12:34:51.635863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-10-02 12:34:51.636042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6837 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
~~

The full error traceback is:

~~
Traceback (most recent call last):
File "selfcontained.py", line 107, in
print("75th percentile of test predictions is: {:.2e}".format(main(**CNN_params)))
File "selfcontained.py", line 92, in main
raise e
File "selfcontained.py", line 76, in main
shuffle=True, verbose=0, callbacks=[early_stopping_cb, model_saver_cb, test_csv_cb])
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
validation_steps=validation_steps)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 217, in fit_loop
callbacks.on_epoch_end(epoch, epoch_logs)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 446, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1090, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 382, in save_model
_serialize_model(model, f, include_optimizer)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 78, in _serialize_model
f['keras_version'] = str(keras_version).encode('utf8')
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/utils/io_utils.py", line 214, in __setitem__
'Group with name "{}" exists.'.format(attr))
KeyError: 'Cannot set attribute. Group with name "keras_version" exists.'
~~

The problem seems to arise from the fact that the mode flag for opening an h5py file is not propagated through the h5dict class in keras/utils/io_utils.py when opening the file, thus the h5py file is opened with default flags that prevent overwriting existing files.

The solution is simple (unless I am missing a key aspect of file management when it comes to serialisation) where line 186 in keras/utils/io_utils.py needs to be changed from

185        elif isinstance(path, str):
>>> 186            self.data = h5py.File(path,)
187            self._is_file = True

185        elif isinstance(path, str):
>>> 186            self.data = h5py.File(path,mode)
187            self._is_file = True

Doing this propagates the mode parameter in the __init__ call to the underlying h5py.File object.

As I'm not sure what the best way to submit a code patch is, I thought it would be best to create an issue outlining the problem and a potential solution.

contributions welcome buperformance

Source

Microno95

👍5

Most helpful comment

Thanks @Microno95 for the feedback!

gabrieldemarmiesse on 13 Oct 2018

❤1 👍1

All 15 comments

Thank you for this. Can you provide a script to reproduce the issue?

Also thanks for the workaround.

gabrieldemarmiesse on 2 Oct 2018

Here's the script:

import numpy as np
import os
import pandas as pd
import sys
import uuid
import sys

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
import keras
import keras.backend as K
import gc

def main(**kwargs):
    model_dir = os.path.join("temp/", str(uuid.uuid1())+"/")
    os.makedirs(model_dir, exist_ok=True)
    model_name = os.path.join(model_dir, "model.best.hdf5")

    trainX = np.random.rand(1024, 24, 10)
    trainY = np.random.rand(1024, 1)
    testX = np.random.rand(256, 24, 10)
    testY = np.random.rand(256, 1)

    try:
        input_layer = keras.layers.Input(shape=trainX.shape[1:])

        conv_layers = []
        conv_net_out = []
        conv_skip_out = []

        conv_layers.append(keras.layers.Conv1D(kwargs['conv_layer_filter_num'], 
                                               kwargs['conv_layer_size'], 
                                               activation=kwargs['conv_layer_activation'], 
                                               padding='causal', 
                                               dilation_rate=1,
                                               activity_regularizer=keras.regularizers.l1(kwargs['conv_l1_reg_parameter']),
                                               kernel_regularizer=keras.regularizers.l2(kwargs['conv_l2_reg_parameter']))(input_layer))

        intermediate_sum = keras.layers.BatchNormalization()(keras.layers.Activation("selu")(conv_layers[-1]))

        conv_skip_out.append(keras.layers.Conv1D(1,1)(intermediate_sum))
        intermediate_sum = keras.layers.Conv1D(1,1)(intermediate_sum)
        conv_net_out.append(keras.layers.Add()([intermediate_sum, input_layer]))

        for conv_idx in range(kwargs['conv_layer_count'] - 1):
            conv_layers.append(keras.layers.Conv1D(kwargs['conv_layer_filter_num'], 
                                                   kwargs['conv_layer_size'], 
                                                   activation=kwargs['conv_layer_activation'], 
                                                   padding='causal', 
                                                   dilation_rate=kwargs['conv_layer_dilation_rate'] * (conv_idx + 1),
                                                   activity_regularizer=keras.regularizers.l1(kwargs['conv_l1_reg_parameter']),
                                                   kernel_regularizer=keras.regularizers.l2(kwargs['conv_l2_reg_parameter']))(conv_net_out[-1]))

            intermediate_sum = keras.layers.BatchNormalization()(keras.layers.Activation("selu")(conv_layers[-1]))

            conv_skip_out.append(keras.layers.Conv1D(1,1)(intermediate_sum))
            intermediate_sum = keras.layers.Conv1D(1,1)(intermediate_sum)
            conv_net_out.append(keras.layers.Add()([intermediate_sum, conv_net_out[-1]]))

        conv_sum_layer = keras.layers.Add()(conv_net_out)
        csum_out = keras.layers.Activation('selu')(conv_sum_layer)
        oxo_layer1 = keras.layers.Conv1D(1,1)(csum_out)
        relu_layer = keras.layers.Activation('selu')(oxo_layer1)
        oxo_layer2 = keras.layers.Conv1D(1,1)(relu_layer)
        flat_oxo_layer = keras.layers.Flatten()(oxo_layer2)
        output_layer = keras.layers.Dense(trainY.shape[-1], activation="linear")(flat_oxo_layer)
        model = keras.models.Model(inputs=input_layer, outputs=output_layer)
        model.compile(loss=kwargs['loss'], optimizer=kwargs['optimizer'])
        test_csv_cb = keras.callbacks.CSVLogger(os.path.join(model_dir, 'progress.csv'), separator=',', append=False)
        early_stopping_cb = keras.callbacks.EarlyStopping(patience=15)
        model_saver_cb = keras.callbacks.ModelCheckpoint(model_name, monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=False, mode='auto', period=1)
        model.fit(trainX, trainY, validation_split=0.1, 
                  epochs=kwargs['training_epochs'], batch_size=kwargs['batch_size'],
                  shuffle=True, verbose=0, callbacks=[early_stopping_cb, model_saver_cb, test_csv_cb])
        model.save(model_name)

        try: 
            model = keras.models.load_model(model_name)

            predictions = model.predict(testX) - testY

            gc.collect()

            return np.percentile(np.square(predictions), 75)
        except Exception as e:
            raise e 


    except Exception as e:
        raise e

if __name__ == "__main__":
    CNN_params = {'optimizer': 'adam', 
                    'loss': 'mse', 
                    'conv_layer_count': 8,
                    'conv_layer_filter_num': 16, 
                    'conv_layer_size': 2, 
                    'conv_layer_dilation_rate': 2,
                    'conv_layer_activation': 'selu',
                    'conv_l2_reg_parameter': 5e-4,
                    'conv_l1_reg_parameter': 5e-5,
                    'training_epochs': 100,
                    'batch_size': 256}

    print("75th percentile of test predictions is: {:.2e}".format(main(**CNN_params)))

Microno95 on 2 Oct 2018

Thanks a lot!

gabrieldemarmiesse on 2 Oct 2018

Are there any other work-arounds other than changing the io_utils file?

#### Update

Worked for me after reverting to keras==2.2.2

Doesn't solve the issue with the latest version though

vijaik2k7 on 3 Oct 2018

What would be a minimal unit test to reproduce this issue?

fchollet on 3 Oct 2018

I should add, my Keras version is 2.2.3 and my Tensorflow version is 1.11.0 (custom compilation).

I can provide other configuration details if necessary.

Microno95 on 3 Oct 2018

For a minimal unit test, possibly a cut down version of what I have as the issue occurs when the ModelCheckpoint callback attempts to update the saved model. Put all that in a try-except block inside a function and have it return a pass if there are no exceptions/fail if there are. Save model to temporary directory that is cleared upon completion.

Alternatively, just force ModelCheckpoint to run twice on the same model and see if it fails (but I've no idea how to do that).

I can attempt to write one up, but it may require some work to make it useful.

Microno95 on 3 Oct 2018

Your fix is easy to apply, but we should have a simple unit test to go with it.

fchollet on 3 Oct 2018

Go the same problem here. Reverting to keras==2.2.2 worked.

cleuton on 3 Oct 2018

Can somebody confirm that this bug has been fixed in keras 2.2.4?

gabrieldemarmiesse on 11 Oct 2018

I can confirm that the bug has been fixed in Keras 2.2.4. I tested the script that I posted initially, and it no longer produces an error.

Microno95 on 13 Oct 2018

👍2

Thanks @Microno95 for the feedback!

gabrieldemarmiesse on 13 Oct 2018

❤1 👍1

I have the same error, I have opened the stack overflow issues as well.
https://stackoverflow.com/questions/52513973/cannot-set-attribute-group-with-name-keras-version-exists

ShettyHarapanahalli on 16 Oct 2018

is there anyone who solved this issue , as i have the same one.
I've keras 2.2.4 , and even after reverting it to 2.2.2 i got same problem.
Modelcheckpoint saving nothing ,
i'm just getting the message that val-acc is improved and model is saved at given path
but at the location nothing is saved

chaudherysharif on 24 Jun 2019

I also had this problem in Keras 2.2.4. The message changed a little bit in this version, but the problem persists. I suggest reopening this issue. I'm using multi-gpu and I'm using two TimeDistributed VGGs at the beginning of the net as feature extractors (one for optical flow and another one for rgb image). I'm also freezing the VGG layers. I guess something in this part is generating the problem since I debugged a little and saw that it breaks when it tried to create an h5py group with the same name as another existing layer (I guess it was Kernel/Conv1 something).

I was able to bypass this problem using alt-model-checkpoint (https://pypi.org/project/alt-model-checkpoint/) I'm checkpointing the original model (before multi-gpu part) instead of just using ModelCheckpoint in fit_generator of multi-gpu model.