Describe the bug
The tests for FBeta and F1 both use a softmax function with just one output.
The effect is that the output (prediction) is always 1. The right activation function for a binary calssification (with just one output is sigmoid). But this is not the only problem. This bug in the test hides an even worse other bug.
The other bug is that F1 and FBeta both can not handle values other then 1 (1.0) and 0 (0.0) as the predicted result. But the predictions of a binary classification (with simoid) and multi class classification with softmax is always somewhere between 1 and 0 (one hot encoded for multi class). The result of this bug is that this implementation of F1 and FBeta always return 0.0 when the predicted results are not exactly 0 (0.0) or 1 (1.0) - which is not realistic.
Also all other tests of F1 and FBeta have values of 0 or 1 as the predicted results. This does not reflect reality.
Code to reproduce the issue
Just change softmax to sigmoid in the tests and change verbose=1 and see the results.
I guess something like this needs to be added:
https://github.com/tensorflow/tensorflow/blob/2ff39d00faf8f7e433ddcae0aa278f6e573b0c55/tensorflow/python/keras/metrics.py#L2767
Thanks, @PhilipMay. You are right that result will not be an exact number.
The initial implementation we had do not indicate the probability.
preds = tf.constant([[0, 0, 1], [1, 1, 0], [1, 1, 1]], dtype=tf.int32)
In this [0, 0, 1] does not indicate probability. Think we have 3 classes and the predicted is class 3 which is encoded as [0,0,1]
We may need to add argmax to change that in case if we are feeding directly.
Will take up this one and post the updates
We may need to add argmax to change that in case if we are feeding directly.
Yes - if we want to be able to add the metric to keras at training time we need that. Mybe you could have a look at the metrics implementations in Tensorflow.
Thanks you very much. :-)
@seanpmorgan Can you please assign this issue to me?
This here seems to work for me - not sure if it still has bugs in border cases or can be implemented more efficient:
def update_state(self, y_true, y_pred, sample_weight=None):
cond = tf.equal(y_pred, tf.reduce_max(y_pred))
y_pred = tf.where(cond, y_pred, tf.zeros_like(y_pred))
not_cond = tf.math.logical_not(cond)
y_pred = tf.where(not_cond, y_pred, tf.ones_like(y_pred))
y_true = tf.cast(y_true, tf.int32)
y_pred = tf.cast(y_pred, tf.int32)
For further tests I would suggest that you can compare your results with those from sklearn:
from sklearn.metrics import f1_score
ytrue = np.argmax(y_test, axis=-1)
ypred = np.argmax(y_hat_test, axis=-1)
print(f1_score(ytrue, ypred, average='macro'))
print(f1_score(ytrue, ypred, average='micro'))
print(f1_score(ytrue, ypred, average='weighted'))
print(f1_score(ytrue, ypred, average=None))
Thanks @PhilipMay
I have a working code ready now using threshold option. That will help in both multiclass and multilabel. Will create a PR today
Thanks @PhilipMay
I have a working code ready now using threshold option. That will help in both multiclass and multilabel. Will create a PR today
That is great. Thanks. I need that fix for my project. :-)
Maybe you can add a comparison with f1_score from sklearn to the tests.
@SSaishruthi I do not think that this implementation with threshold fixes the problem. Here is why:
When you have a "single-label categorial classification" where one sample belongs to exactly one class of many possible classes you apply a softmax funcation. Let me give an example:
You habe 3 classes: dog, cat and bike and do a one hot encoding.
dog = [1, 0, 0]
cat = [0, 1, 0]
bike = [0, 0, 1]
Now when you put a dog into the model you might get the following from the softmax function:
result = [0.5, 0.25, 0.25]
This means: dog because 0.5 is the highest value. And this is the reason why a threshold is not valid. You should apply something like reduce_max or argmax. Something like this: https://github.com/tensorflow/addons/issues/490#issuecomment-529882338
@PhilipMay
I am taking binary accuracy as a reference here: https://github.com/tensorflow/tensorflow/blob/2ff39d00faf8f7e433ddcae0aa278f6e573b0c55/tensorflow/python/keras/metrics.py#L630
In order for this to work you need to provide the threshold value as 0.49 above. I can probably fix the threshold default to be 0.5. I think below 0.5, it may not be considered as a good prediction.
I have this tested with scikit learn as (that's the usual procedure I follow)
well.https://colab.research.google.com/drive/1qSq0SsYkPqjdKUgM1RM4kKM67X75ocFj
Goal here is to make it compatible with both multi-class and multi-label