Keras: 08WARNING:tensorflow:Early stopping conditioned on metric `val_binary_accuracy` which is not available. Available metrics are:

Created on 12 Jan 2020  路  17Comments  路  Source: keras-team/keras

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 18.04):
  • TensorFlow backend (yes / no): yes
  • TensorFlow version: 2.0.0
  • Keras version: 2.2.4
  • Python version: 3.7.6
  • CUDA/cuDNN version: 10.1/7.6.5
  • GPU model and memory: nvidia 2080ti

Train on 58271 samples, validate on 10284 samples
Epoch 1/50
32/58271 [..............................] - ETA: 1:15:08WARNING:tensorflow:Early stopping conditioned on metric val_binary_accuracy which is not available. Available metrics are:


UnknownError Traceback (most recent call last)
in
3 earlystop = EarlyStopping(monitor = 'val_binary_accuracy',patience =4,mode = 'max')
4 #history = model.fit(X_train, Y_train,batch_size=15,validation_data=(X_val,Y_val),class_weight=train_weights,epochs=50,callbacks=[earlystop])
----> 5 history = model.fit(X_train, Y_train,batch_size=32,validation_split=0.15,class_weight=train_weights,epochs=50,callbacks=[earlystop])

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
726 max_queue_size=max_queue_size,
727 workers=workers,
--> 728 use_multiprocessing=use_multiprocessing)
729
730 def evaluate(self,

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, **kwargs)
322 mode=ModeKeys.TRAIN,
323 training_context=training_context,
--> 324 total_epochs=epochs)
325 cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
326

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
121 step=step, mode=mode, size=current_batch_size) as batch_logs:
122 try:
--> 123 batch_outs = execution_function(iterator)
124 except (StopIteration, errors.OutOfRangeError):
125 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in execution_function(input_fn)
84 # numpy translates Tensors to values in Eager mode.
85 return nest.map_structure(_non_none_constant_value,
---> 86 distributed_function(input_fn))
87
88 return execution_function

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py in __call__(self, args, *kwds)
455
456 tracing_count = self._get_tracing_count()
--> 457 result = self._call(args, *kwds)
458 if tracing_count == self._get_tracing_count():
459 self._call_counter.called_without_tracing()

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py in _call(self, args, *kwds)
518 # Lifting succeeded, so variables are initialized and we can run the
519 # stateless function.
--> 520 return self._stateless_fn(args, *kwds)
521 else:
522 canon_args, canon_kwds =

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py in __call__(self, args, *kwargs)
1821 """Calls a graph function specialized to the inputs."""
1822 graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 1823 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
1824
1825 @property

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py in _filtered_call(self, args, kwargs)
1139 if isinstance(t, (ops.Tensor,
1140 resource_variable_ops.BaseResourceVariable))),
-> 1141 self.captured_inputs)
1142
1143 def _call_flat(self, args, captured_inputs, cancellation_manager=None):

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1222 if executing_eagerly:
1223 flat_outputs = forward_function.call(
-> 1224 ctx, args, cancellation_manager=cancellation_manager)
1225 else:
1226 gradient_name = self._delayed_rewrite_functions.register()

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py in call(self, ctx, args, cancellation_manager)
509 inputs=args,
510 attrs=("executor_type", executor_type, "config_proto", config),
--> 511 ctx=ctx)
512 else:
513 outputs = execute.execute_with_cancellation(

~/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
65 else:
66 message = e.message
---> 67 six.raise_from(core._status_to_exception(e.code, message), None)
68 except TypeError as e:
69 keras_symbolic_tensors = [

~/anaconda3/envs/tf/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node model/conv1d/conv1d (defined at /home/subhashnerella/anaconda3/envs/tf/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_distributed_function_3729]

Function call stack:
distributed_function

I am not using keras as tensorflow.keras
My code is ConvLSTM which was working before but suddenly its not working.
I am getting the warning and the error shown above. I tried to reduce the batch size still did not work. Whats the reason for this issue and how can i fix this?

tensorflow support

Most helpful comment

I figured it out for my case. It actually wasnt Early stopping, it was modelcheckpoint.

In modelcheckpoint, If you are using the save_freq parameter to set when the modelcheckpoint saves and you align your save_freq so that it is the end of an epoch, then when the final on_batch_end() is called for the given epoch it will call _save_model() (because this will be the same number of ellapsed batches as the save_freq) and try to determine if it should save the model (based on the val_loss or other). But, this is premature because other prep work by on_epoch_end() hasn't been done yet to create/calculate the validation metrics and append them to the epoch_logs.

All 17 comments

I have the same problem.

I think the issue is somewhere around here:

keras/engine/training_generator.py
callbacks.on_epoch_end(epoch, epoch_logs)
generator.on_epoch_end()
keras/utils/data_utils.py
self.sequence.on_epoch_end()

If you call .on_epoch_end() without specifying logs= on EarlyStopping then you will always get an error. But that should be something about NoneType not having a .get() method.

So probably not that but somewhere there's an empty logs dict being passed to EarlyStopping.

Errors

Epoch 1/200
WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: 
Traceback (most recent call last):
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/backprop.py", line 616, in _num_elements
    return functools.reduce(operator.mul, grad.values._shape_tuple(), 1)  # pylint: disable=protected-access
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "individual_train_model_fasttext.py", line 274, in <module>
    verbose=2)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit
    total_epochs=epochs)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 503, in _call
    self._initialize(args, kwds, add_initializers_to=initializer_map)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 408, in _initialize
    *args, **kwds))
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1848, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2150, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2041, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 358, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function
    per_replica_function, args=(model, x, y, sample_weights))
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2132, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 264, in train_on_batch
    output_loss_metrics=model._output_loss_metrics)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 311, in train_on_batch
    output_loss_metrics=output_loss_metrics))
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 268, in _process_single_batch
    grads = tape.gradient(scaled_total_loss, trainable_weights)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/backprop.py", line 1014, in gradient
    unconnected_gradients=unconnected_gradients)
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/imperative_grad.py", line 76, in imperative_grad
    compat.as_str(unconnected_gradients.value))
  File "/home/morten/anaconda3/envs/preacc/lib/python3.6/site-packages/tensorflow_core/python/eager/backprop.py", line 598, in _aggregate_grads
    if len(gradients) == 1:
SystemError: <built-in function len> returned a result with an error set

EDIT: It might be good to know that if my inputs and outputs are lists of Pandas objects I get a warning but it runs (TF 2.0). If I try to run it in TF 2.1 it crashes whether I use lists of Pandas or NumPy objects.

tf.compat.v1.disable_eager_execution() lets it run.

I am encountering this error when using fit_generator with scant data. I have tried tf.compat.v1.disable_eager_execution() but I continue to get the same error. I have also tried replicating my dataset within the generator by a factor of 100, without success.

This is an outstanding problem for me.

Still struggling with this. Here are the highlights:

def f1(y_true, y_pred):
y_pred = K.round(y_pred)
tp = K.sum(K.cast(y_truey_pred, 'float'), axis=0)
fp = K.sum(K.cast((1-y_true)
y_pred, 'float'), axis=0)
fn = K.sum(K.cast(y_true(1-y_pred), 'float'), axis=0)
p = tp / (tp + fp + K.epsilon())
r = tp / (tp + fn + K.epsilon())
f1 = 2
p*r / (p+r+K.epsilon())
f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
return K.mean(f1)
earlystopping = keras.callbacks.EarlyStopping(monitor='val_f1',
min_delta=0,
patience=1,
verbose=0,
mode='auto',
restore_best_weights=True)
steps_per_epoch = batch_size // len(training_generator)
validation_steps = len(training_generator) // batch_size
history = training_model.fit_generator(generator=training_generator,
epochs=epochs,
validation_data=validating_generator,
callbacks=[earlystopping],
steps_per_epoch=steps_per_epoch,
validation_steps=validation_steps,
validation_freq=2,
use_multiprocessing=True,
verbose=1)

@grofte I totally thought that was going to be the fix, but didnt seem to work for me.

For those trying to figuer this out. It seems like what is happening in my case is that the model is trying to save (or test if the earlystopping criteria has been met) before the val_loss has been calculated. For example, it seems like the validation loss is calculated here:
https://github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/python/keras/engine/training_v2.py#L384

You can print the eval_result right after it is returned, and it will show up in the console AFTER you receive the warning that val_loss doesnt exist so early stopping wont work. Val loss then shortly after the warning comes up also shows up in the traditional progress bar.

How the save/early stopping callback is trying to happen before the validation loss is beyond me.... I assumed it had to do with eager_execution, but that doesnt seem to be it.

I figured it out for my case. It actually wasnt Early stopping, it was modelcheckpoint.

In modelcheckpoint, If you are using the save_freq parameter to set when the modelcheckpoint saves and you align your save_freq so that it is the end of an epoch, then when the final on_batch_end() is called for the given epoch it will call _save_model() (because this will be the same number of ellapsed batches as the save_freq) and try to determine if it should save the model (based on the val_loss or other). But, this is premature because other prep work by on_epoch_end() hasn't been done yet to create/calculate the validation metrics and append them to the epoch_logs.

Alas, I wasn't using modelcheckpoint. I reported earlier that this problem went away after I got additional data, but I was mistaken. The problem is still outstanding for me.

I have tried tinkering around with validation_steps, validation_freq, and steps_per_epoch. None of that voodoo worked.

Could be something weird with valiation_steps or valiation_freq in your .fit() or .fit_generator() call.

Essentially, as long as you are calling the right parameter name (val_loss or whatever it showing up in your console if you use verbose while training) in your early stopping, then if you are getting this error then it means the validation_loss or val_loss or whatever you are looking for isn't being calculated (or hasn't been calculated when it is looking for it)... my problem was that I was specifying a save_freq and it was aligned weirdly so that val_loss was calculated after early stopping was looking for it.

I tried rolling back from Keras 2.2.4 to Keras 2.2.3 but it didn't solve the problem.

Oh, this is too much. I gave up on monitoring F1 in earlystopping (custom metrics, who needs 'em, right?), and now I find I can't even use basic metrics.

"Early stopping conditioned on metric val_loss which is not available. Available metrics are: loss,accuracy,f1,precision,recall"

I tried rolling back from Keras 2.2.4 to Keras 2.2.3 but it didn't solve the problem.

Are you using Keras with tf2? Oh, I see that you are. Don't. Use tf.keras instead. Disable eager execution. That should do it. The standalone Keras is only for tf1.x or the other backends.

I have tried this (in Azure ML, which I thought had started using Tensorflow 2 months ago, but maybe I was wrong)...

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import Callback, ModelCheckpoint
from tensorflow.keras.initializers import Constant
from tensorflow.keras.layers import Conv1D, Dense, Dropout, Embedding, Flatten
from tensorflow.keras.layers import GaussianNoise, Input, MaxPooling1D
from tensorflow.keras.models import Model #, load_model, model_from_json, Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
import tensorflow.keras.backend as K
tf.compat.v1.disable_eager_execution()

... but now my training seems to hang at the very first epoch. "Epoch 1/3" is the last line of output.

Same problem here, then tried to change the batch_size to be not divisible by the number of samples (or number of sequences in my case of LSTM modelling), and the problem is solved. For example: I had problem with numSequences=300 and batch_size=6, then problem solved with numSequences=300 and batch_size =16. So, maybe some bugs in the 'tf.keras.callbacks.EarlyStopping()' code.

I figured it out for my case. It actually wasnt Early stopping, it was modelcheckpoint.

In modelcheckpoint, If you are using the save_freq parameter to set when the modelcheckpoint saves and you align your save_freq so that it is the end of an epoch, then when the final on_batch_end() is called for the given epoch it will call _save_model() (because this will be the same number of ellapsed batches as the save_freq) and try to determine if it should save the model (based on the val_loss or other). But, this is premature because other prep work by on_epoch_end() hasn't been done yet to create/calculate the validation metrics and append them to the epoch_logs.

@gattia has it figured out i believe. when using ModelCheckpoint(save_freq=<int>... combined with model.fit(validation_freq=<int>... then "WARNING:tensorflow:Can save best model only with val_loss available, skipping." is emitted. even when you've carefully calculated the two to be the same number of epochs. is there a work around? thanks.

Yes, seems like others might have different issues (maybe?) but this seems to be a bug for sure.

Just to re-iterate, when .fit() is called, in pseudo code, we do something like:

for epoch in range(n_epochs): 
    callbacks.on_epoch_begin(epoch).
    for step in n_steps:
        callbacks.on_train_batch_begin(step)
        callbacks.on_train_batch_end(step)
    #run validation
    callbacks.on_epoch_end(epoch, epoch_logs)

The real code for the above is found at:
https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/engine/training.py#L836-L876

The problem is that the ModelCheckpoint.on_train_batch_end() found here:
https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/callbacks.py#L1157-L1163

tries to save the model self._save_model(epoch=self._current_epoch, logs=logs), which if we are only saving the best model and that best model is based on a validation metric, then it looks to see if the validation metric exists.

which if it doesn't find the val_loss or whatever is being looked for it throws the aforementioned error:
https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/callbacks.py#L1198-L1203

Really, it all boils down to the fact that in the pseudo code above the callbacks.on_train_batch_end() is being called before the #run validation code is being executed:
https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/engine/training.py#L858-L874

This is because ModelCheckpoint does its saving based on the number of batches, so it makes sense that saving should come at the end of a batch (or the testing to see if saving should be tried). The problem arises when the number of batches is less than the number of batches in an epoch or even the same and therefore val_loss doesn't exist yet.

Hope this helps, happy to do whatever else I can do aid.

@gattia can you please file on issue with tensorflow? this keras repo is discontinued. i tried, but they wanted a MWE which i think you are in the best position to supply. thanks!

Will pull something together when I have some time.

Was this page helpful?
0 / 5 - 0 ratings