Addons: Bug in F1 and FBeta implementation and test

Created on 9 Sep 2019 · 10Comments · Source: tensorflow/addons

Describe the bug
The tests for FBeta and F1 both use a softmax function with just one output.

The effect is that the output (prediction) is always 1. The right activation function for a binary calssification (with just one output is sigmoid). But this is not the only problem. This bug in the test hides an even worse other bug.

The other bug is that F1 and FBeta both can not handle values other then 1 (1.0) and 0 (0.0) as the predicted result. But the predictions of a binary classification (with simoid) and multi class classification with softmax is always somewhere between 1 and 0 (one hot encoded for multi class). The result of this bug is that this implementation of F1 and FBeta always return 0.0 when the predicted results are not exactly 0 (0.0) or 1 (1.0) - which is not realistic.

Also all other tests of F1 and FBeta have values of 0 or 1 as the predicted results. This does not reflect reality.

Code to reproduce the issue
Just change softmax to sigmoid in the tests and change verbose=1 and see the results.

bug metrics

Source

PhilipMay

All 10 comments

I guess something like this needs to be added:
https://github.com/tensorflow/tensorflow/blob/2ff39d00faf8f7e433ddcae0aa278f6e573b0c55/tensorflow/python/keras/metrics.py#L2767

PhilipMay on 10 Sep 2019

Thanks, @PhilipMay. You are right that result will not be an exact number.
The initial implementation we had do not indicate the probability.

preds = tf.constant([[0, 0, 1], [1, 1, 0], [1, 1, 1]], dtype=tf.int32)

In this [0, 0, 1] does not indicate probability. Think we have 3 classes and the predicted is class 3 which is encoded as [0,0,1]

We may need to add argmax to change that in case if we are feeding directly.

Will take up this one and post the updates

SSaishruthi on 10 Sep 2019

👍1

We may need to add argmax to change that in case if we are feeding directly.

Yes - if we want to be able to add the metric to keras at training time we need that. Mybe you could have a look at the metrics implementations in Tensorflow.

Thanks you very much. :-)

PhilipMay on 10 Sep 2019

@seanpmorgan Can you please assign this issue to me?

SSaishruthi on 10 Sep 2019

This here seems to work for me - not sure if it still has bugs in border cases or can be implemented more efficient:

    def update_state(self, y_true, y_pred, sample_weight=None):
        cond = tf.equal(y_pred, tf.reduce_max(y_pred))
        y_pred = tf.where(cond, y_pred, tf.zeros_like(y_pred))
        not_cond = tf.math.logical_not(cond)
        y_pred = tf.where(not_cond, y_pred, tf.ones_like(y_pred))

        y_true = tf.cast(y_true, tf.int32)
        y_pred = tf.cast(y_pred, tf.int32)

PhilipMay on 10 Sep 2019

For further tests I would suggest that you can compare your results with those from sklearn:

from sklearn.metrics import f1_score
ytrue = np.argmax(y_test, axis=-1)
ypred = np.argmax(y_hat_test, axis=-1)

print(f1_score(ytrue, ypred, average='macro'))  
print(f1_score(ytrue, ypred, average='micro'))  
print(f1_score(ytrue, ypred, average='weighted'))
print(f1_score(ytrue, ypred, average=None))

PhilipMay on 10 Sep 2019

Thanks @PhilipMay
I have a working code ready now using threshold option. That will help in both multiclass and multilabel. Will create a PR today

SSaishruthi on 10 Sep 2019

👍1

Thanks @PhilipMay
I have a working code ready now using threshold option. That will help in both multiclass and multilabel. Will create a PR today

That is great. Thanks. I need that fix for my project. :-)
Maybe you can add a comparison with f1_score from sklearn to the tests.

PhilipMay on 10 Sep 2019

@SSaishruthi I do not think that this implementation with threshold fixes the problem. Here is why:

When you have a "single-label categorial classification" where one sample belongs to exactly one class of many possible classes you apply a softmax funcation. Let me give an example:

You habe 3 classes: dog, cat and bike and do a one hot encoding.

dog = [1, 0, 0]
cat = [0, 1, 0]
bike = [0, 0, 1]

Now when you put a dog into the model you might get the following from the softmax function:
result = [0.5, 0.25, 0.25]
This means: dog because 0.5 is the highest value. And this is the reason why a threshold is not valid. You should apply something like reduce_max or argmax. Something like this: https://github.com/tensorflow/addons/issues/490#issuecomment-529882338

PhilipMay on 11 Sep 2019

@PhilipMay

I am taking binary accuracy as a reference here: https://github.com/tensorflow/tensorflow/blob/2ff39d00faf8f7e433ddcae0aa278f6e573b0c55/tensorflow/python/keras/metrics.py#L630

In order for this to work you need to provide the threshold value as 0.49 above. I can probably fix the threshold default to be 0.5. I think below 0.5, it may not be considered as a good prediction.

I have this tested with scikit learn as (that's the usual procedure I follow)
well.https://colab.research.google.com/drive/1qSq0SsYkPqjdKUgM1RM4kKM67X75ocFj

Goal here is to make it compatible with both multi-class and multi-label

SSaishruthi on 11 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings