Having checked that everything is as it should be (latest version of keras, and latest version of tensorflow both installed), I have found that running a model with a model checkpoint callback that saves the best model so far causes an issue with serialisation of the model.
Here's a script which, when run, shows the issue.
The output during imports and initialisation of the Tensorflow backend is:
~~~
2018-10-02 12:34:47.868073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
totalMemory: 7.93GiB freeMemory: 7.09GiB
2018-10-02 12:34:47.868102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-10-02 12:34:48.075527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-02 12:34:48.075556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-10-02 12:34:48.075562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-10-02 12:34:48.075728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6837 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
Using TensorFlow backend.
2018-10-02 12:34:51.635814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-10-02 12:34:51.635853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-02 12:34:51.635859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-10-02 12:34:51.635863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-10-02 12:34:51.636042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6837 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
~
The full error traceback is:
~~~
Traceback (most recent call last):
File "selfcontained.py", line 107, in
print("75th percentile of test predictions is: {:.2e}".format(main(**CNN_params)))
File "selfcontained.py", line 92, in main
raise e
File "selfcontained.py", line 76, in main
shuffle=True, verbose=0, callbacks=[early_stopping_cb, model_saver_cb, test_csv_cb])
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
validation_steps=validation_steps)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 217, in fit_loop
callbacks.on_epoch_end(epoch, epoch_logs)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 446, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1090, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 382, in save_model
_serialize_model(model, f, include_optimizer)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 78, in _serialize_model
f['keras_version'] = str(keras_version).encode('utf8')
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/utils/io_utils.py", line 214, in __setitem__
'Group with name "{}" exists.'.format(attr))
KeyError: 'Cannot set attribute. Group with name "keras_version" exists.'
~
The problem seems to arise from the fact that the mode flag for opening an h5py file is not propagated through the h5dict class in keras/utils/io_utils.py when opening the file, thus the h5py file is opened with default flags that prevent overwriting existing files.
The solution is simple (unless I am missing a key aspect of file management when it comes to serialisation) where line 186 in keras/utils/io_utils.py needs to be changed from
185 elif isinstance(path, str):
>>> 186 self.data = h5py.File(path,)
187 self._is_file = True
to
185 elif isinstance(path, str):
>>> 186 self.data = h5py.File(path,mode)
187 self._is_file = True
Doing this propagates the mode parameter in the __init__ call to the underlying h5py.File object.
As I'm not sure what the best way to submit a code patch is, I thought it would be best to create an issue outlining the problem and a potential solution.
Thank you for this. Can you provide a script to reproduce the issue?
Also thanks for the workaround.
Here's the script:
import numpy as np
import os
import pandas as pd
import sys
import uuid
import sys
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
import keras
import keras.backend as K
import gc
def main(**kwargs):
model_dir = os.path.join("temp/", str(uuid.uuid1())+"/")
os.makedirs(model_dir, exist_ok=True)
model_name = os.path.join(model_dir, "model.best.hdf5")
trainX = np.random.rand(1024, 24, 10)
trainY = np.random.rand(1024, 1)
testX = np.random.rand(256, 24, 10)
testY = np.random.rand(256, 1)
try:
input_layer = keras.layers.Input(shape=trainX.shape[1:])
conv_layers = []
conv_net_out = []
conv_skip_out = []
conv_layers.append(keras.layers.Conv1D(kwargs['conv_layer_filter_num'],
kwargs['conv_layer_size'],
activation=kwargs['conv_layer_activation'],
padding='causal',
dilation_rate=1,
activity_regularizer=keras.regularizers.l1(kwargs['conv_l1_reg_parameter']),
kernel_regularizer=keras.regularizers.l2(kwargs['conv_l2_reg_parameter']))(input_layer))
intermediate_sum = keras.layers.BatchNormalization()(keras.layers.Activation("selu")(conv_layers[-1]))
conv_skip_out.append(keras.layers.Conv1D(1,1)(intermediate_sum))
intermediate_sum = keras.layers.Conv1D(1,1)(intermediate_sum)
conv_net_out.append(keras.layers.Add()([intermediate_sum, input_layer]))
for conv_idx in range(kwargs['conv_layer_count'] - 1):
conv_layers.append(keras.layers.Conv1D(kwargs['conv_layer_filter_num'],
kwargs['conv_layer_size'],
activation=kwargs['conv_layer_activation'],
padding='causal',
dilation_rate=kwargs['conv_layer_dilation_rate'] * (conv_idx + 1),
activity_regularizer=keras.regularizers.l1(kwargs['conv_l1_reg_parameter']),
kernel_regularizer=keras.regularizers.l2(kwargs['conv_l2_reg_parameter']))(conv_net_out[-1]))
intermediate_sum = keras.layers.BatchNormalization()(keras.layers.Activation("selu")(conv_layers[-1]))
conv_skip_out.append(keras.layers.Conv1D(1,1)(intermediate_sum))
intermediate_sum = keras.layers.Conv1D(1,1)(intermediate_sum)
conv_net_out.append(keras.layers.Add()([intermediate_sum, conv_net_out[-1]]))
conv_sum_layer = keras.layers.Add()(conv_net_out)
csum_out = keras.layers.Activation('selu')(conv_sum_layer)
oxo_layer1 = keras.layers.Conv1D(1,1)(csum_out)
relu_layer = keras.layers.Activation('selu')(oxo_layer1)
oxo_layer2 = keras.layers.Conv1D(1,1)(relu_layer)
flat_oxo_layer = keras.layers.Flatten()(oxo_layer2)
output_layer = keras.layers.Dense(trainY.shape[-1], activation="linear")(flat_oxo_layer)
model = keras.models.Model(inputs=input_layer, outputs=output_layer)
model.compile(loss=kwargs['loss'], optimizer=kwargs['optimizer'])
test_csv_cb = keras.callbacks.CSVLogger(os.path.join(model_dir, 'progress.csv'), separator=',', append=False)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=15)
model_saver_cb = keras.callbacks.ModelCheckpoint(model_name, monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=False, mode='auto', period=1)
model.fit(trainX, trainY, validation_split=0.1,
epochs=kwargs['training_epochs'], batch_size=kwargs['batch_size'],
shuffle=True, verbose=0, callbacks=[early_stopping_cb, model_saver_cb, test_csv_cb])
model.save(model_name)
try:
model = keras.models.load_model(model_name)
predictions = model.predict(testX) - testY
gc.collect()
return np.percentile(np.square(predictions), 75)
except Exception as e:
raise e
except Exception as e:
raise e
if __name__ == "__main__":
CNN_params = {'optimizer': 'adam',
'loss': 'mse',
'conv_layer_count': 8,
'conv_layer_filter_num': 16,
'conv_layer_size': 2,
'conv_layer_dilation_rate': 2,
'conv_layer_activation': 'selu',
'conv_l2_reg_parameter': 5e-4,
'conv_l1_reg_parameter': 5e-5,
'training_epochs': 100,
'batch_size': 256}
print("75th percentile of test predictions is: {:.2e}".format(main(**CNN_params)))
Thanks a lot!
Are there any other work-arounds other than changing the io_utils file?
Worked for me after reverting to keras==2.2.2
Doesn't solve the issue with the latest version though
What would be a minimal unit test to reproduce this issue?
I should add, my Keras version is 2.2.3 and my Tensorflow version is 1.11.0 (custom compilation).
I can provide other configuration details if necessary.
For a minimal unit test, possibly a cut down version of what I have as the issue occurs when the ModelCheckpoint callback attempts to update the saved model. Put all that in a try-except block inside a function and have it return a pass if there are no exceptions/fail if there are. Save model to temporary directory that is cleared upon completion.
Alternatively, just force ModelCheckpoint to run twice on the same model and see if it fails (but I've no idea how to do that).
I can attempt to write one up, but it may require some work to make it useful.
Your fix is easy to apply, but we should have a simple unit test to go with it.
Go the same problem here. Reverting to keras==2.2.2 worked.
Can somebody confirm that this bug has been fixed in keras 2.2.4?
I can confirm that the bug has been fixed in Keras 2.2.4. I tested the script that I posted initially, and it no longer produces an error.
Thanks @Microno95 for the feedback!
I have the same error, I have opened the stack overflow issues as well.
https://stackoverflow.com/questions/52513973/cannot-set-attribute-group-with-name-keras-version-exists
is there anyone who solved this issue , as i have the same one.
I've keras 2.2.4 , and even after reverting it to 2.2.2 i got same problem.
Modelcheckpoint saving nothing ,
i'm just getting the message that val-acc is improved and model is saved at given path
but at the location nothing is saved
I also had this problem in Keras 2.2.4. The message changed a little bit in this version, but the problem persists. I suggest reopening this issue. I'm using multi-gpu and I'm using two TimeDistributed VGGs at the beginning of the net as feature extractors (one for optical flow and another one for rgb image). I'm also freezing the VGG layers. I guess something in this part is generating the problem since I debugged a little and saw that it breaks when it tried to create an h5py group with the same name as another existing layer (I guess it was Kernel/Conv1 something).
I was able to bypass this problem using alt-model-checkpoint (https://pypi.org/project/alt-model-checkpoint/) I'm checkpointing the original model (before multi-gpu part) instead of just using ModelCheckpoint in fit_generator of multi-gpu model.
Most helpful comment
Thanks @Microno95 for the feedback!