Addons: make F1-score usable with keras

Created on 3 Jan 2020 · 18Comments · Source: tensorflow/addons

Describe the feature and the current behavior/state.
Currently, F1-score cannot be meaningfully used as a metric in keras neural network models, because keras will call F1-score at each batch step at validation, which results in too small values. Therefore, F1-score was removed from keras, see https://github.com/keras-team/keras/issues/5794, where also some quick solution is proposed. Tfa's F1-score exhibits exactly the same problem when used with keras.
Relevant information

Are you willing to contribute it (yes/no):
Are you willing to maintain it going forward? (yes/no):
Is there a relevant academic paper? (if so, where):
Is there already an implementation in another framework? (if so, where):
The right way to do this is to use a custom callback function in a way like this: https://github.com/PhilipMay/mltb/blob/7fce1f77294dccf94f6d4c65b2edd058a654617b/mltb/keras.py. See also https://medium.com/@thongonary/how-to-compute-f1-score-for-each-epoch-in-keras-a1acd17715a2
Now @PhilipMay has removed the code from his repo https://github.com/PhilipMay/mltb and refers to tfa instead. But tfa does not provide a solution!
Was it part of tf.contrib? (if so, where):

Which API type would this fall under (layer, metric, optimizer, etc.)
metric

Who will benefit with this feature?
keras users

Any other info.
Here is some code showing the problem. fbeta_score is 0.6649 in the last epoch, although prediction is 100% accurate. The code needs tfa 0.7.0 with the threshold feature for F1-score.

import numpy as np
import random
import tensorflow_addons as tfa
from keras import models
from keras import layers
threshold = 0.9
random.seed(1)
train_labels = np.array([random.randint(0,1) for iter in range(150000)])
train_data = np.array([[x,0] for x in train_labels])
network = models.Sequential()
network.add(layers.Dense(1, activation='sigmoid', input_shape=(train_data.shape[1],)))
network.compile(optimizer='rmsprop',loss='mae',metrics=['acc',tfa.metrics.FBetaScore(num_classes=2, average="micro", threshold = threshold)])
network.fit(train_data, train_labels, epochs=10, batch_size=128) 
predicted_data = network.predict(train_data)
predicted_data_boolean = np.array([0 if x < threshold else 1 for x in predicted_data])
print(all(predicted_data_boolean == train_labels))

Here is the output, exhibiting a too low F1 score (it should be 1.0, because predicted labels are equal to training labels):

Epoch 1/10
150000/150000 [==============================] - 2s 11us/step - loss: 0.2835 - acc: 1.0000 - fbeta_score: 0.0000e+00
Epoch 2/10
150000/150000 [==============================] - 2s 10us/step - loss: 0.1757 - acc: 1.0000 - fbeta_score: 0.0000e+00
Epoch 3/10
150000/150000 [==============================] - 1s 10us/step - loss: 0.1063 - acc: 1.0000 - fbeta_score: 0.0848
Epoch 4/10
150000/150000 [==============================] - 1s 10us/step - loss: 0.0624 - acc: 1.0000 - fbeta_score: 0.4408
Epoch 5/10
150000/150000 [==============================] - 1s 10us/step - loss: 0.0359 - acc: 1.0000 - fbeta_score: 0.6159
Epoch 6/10
150000/150000 [==============================] - 1s 10us/step - loss: 0.0204 - acc: 1.0000 - fbeta_score: 0.7071
Epoch 7/10
150000/150000 [==============================] - 1s 10us/step - loss: 0.0115 - acc: 1.0000 - fbeta_score: 0.7632
Epoch 8/10
150000/150000 [==============================] - 2s 11us/step - loss: 0.0065 - acc: 1.0000 - fbeta_score: 0.8012
Epoch 9/10
150000/150000 [==============================] - 2s 13us/step - loss: 0.0036 - acc: 1.0000 - fbeta_score: 0.8286
Epoch 10/10
150000/150000 [==============================] - 2s 12us/step - loss: 0.0020 - acc: 1.0000 - fbeta_score: 0.8495
True

callbacks metrics

Source

tillmo

❤3 👍3

Most helpful comment

@tillmo Well, then I should bring the code back to my small tool lib...

@all: It's really a shame that we (the addons, the keras and the tensorflow team) do not manage to implement a proper f1 function. This is so basic that I would refuse to call any tool to be complete without it. Sorry for these self critical words.

PhilipMay on 26 Jan 2020

👍5

All 18 comments

I just found here that there is a way of directly computing precision, recall and related metrics (but not F1 score, it seems) in keras, without running into the mentioned batch problem, with:

METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
]
model.compile(...,metrics=METRICS)

tillmo on 12 Jan 2020

Hi,

Thanks for opening this issue! It looks like there are some global metrics that the Keras team removed starting Keras 2.0.0 because those global metrics do not provide good info when approximated batch-wise. I'll take a look at the callback workaround linked and help to contribute when I have time :)

shun-lin on 26 Jan 2020

It seems that keras.metrics.Precision(name='precision') and keras.metrics.Recall(name='recall') already solve the batch problem. So Keras would only need to add the obvious F1 computation from these values.

tillmo on 26 Jan 2020

There is a F1 Metric implementation for Keras here:
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/metrics/f_scores.py

This issue should be closed imo.

PhilipMay on 26 Jan 2020

There is a F1 Metric implementation for Keras here:
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/metrics/f_scores.py

This issue should be closed imo.

This metric suffers from the batch problem, as demonstrated by my code above.

tillmo on 26 Jan 2020

@tillmo Well, then I should bring the code back to my small tool lib...

PhilipMay on 26 Jan 2020

👍5

@Saishruthi and @squadrick what do you think about this?

PhilipMay on 26 Jan 2020

@PhilipMay I think you have implemented a proper f1 function. It just does not interact well with Keras. And maybe the place to have an f1 function that interacts well with Keras is Keras, and not tfa. After all, Keras already provides precision and recall, so f1 cannot be a big step.

tillmo on 26 Jan 2020

@tillmo Well, then I should bring the code back to my small tool lib...

I changed my old f1 code to tf.keras. It is back and usable now.

https://github.com/PhilipMay/mltb#module-keras-for-tfkeras
pip Release 0.3: https://pypi.org/project/mltb/

PhilipMay on 2 Feb 2020

@PhilipMay are there any issues you see with adding your implementation into Addons? We have precedent for function specific imports:
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/callbacks/tqdm_progress_bar.py#L68

And I would prefer a working implementation with external dependencies vs. a buggy one. If there are no other issues would you be willing to submit a PR?

seanpmorgan on 3 Feb 2020

👍2

Sure. Will do that the next days.

PhilipMay on 3 Feb 2020

🎉2 👍1

Thank you @PhilipMay for working on this. Please feel free to send a PR to the tensorflow repo directly and skip the migration step since this is a metric we want in the main repo.

pavithrasv on 9 Feb 2020

@pavithrasv I will do that. Although I am pretty sure that my implementation will need futher discussion and finetuning.

PhilipMay on 9 Feb 2020

@pavithrasv, @seanpmorgan and @karmel : started a discussion about the implementation here at TF repo: https://github.com/tensorflow/tensorflow/issues/36799

PhilipMay on 16 Feb 2020

👍3 🚀1

@PhilipMay I've been busy and couldn't sync up with this thread in a while. I'm following the discussion. Thanks for taking the time to do this.

Squadrick on 24 Feb 2020

Ok so I took a closer look at the script demonstrating the bug. I believe there are two small mistakes:

1) The code snippet uses multi-backend keras instead of tf.keras. TF addons classes were never intended to be used with multi-backend keras. TF addons subclasses a tf.keras.metrics.Metric object, but keras expects a keras.metrics.Metric object. Overall, I don't even know how this doesn't actually throw an error, but well, I guess it's multi-backend keras' fault for not raising an error here. For future readers: don't use multi-backend keras. It's deprecated.
2) The threashold for the Fbeta score is set to 0.9, while by default, the computed keras accuracy uses a threashold of 0.5, which explains the other discrepency between the accuracy numbers and the Fbeta.

Here is the version of the script with the two issues fixed:

import numpy as np
import random
import tensorflow_addons as tfa
from tensorflow.keras import models
from tensorflow.keras import layers
threshold = 0.5
random.seed(1)
train_labels = np.array([random.randint(0,1) for iter in range(150000)])
train_data = np.array([[x,0] for x in train_labels])
network = models.Sequential()
network.add(layers.Dense(1, activation='sigmoid', input_shape=(train_data.shape[1],)))
network.compile(optimizer='rmsprop',loss='mae',metrics=['acc',tfa.metrics.FBetaScore(num_classes=2, average="micro", threshold=threshold )])
network.fit(train_data, train_labels, epochs=10, batch_size=128)
predicted_data = network.predict(train_data)
predicted_data_boolean = np.array([0 if x < threshold  else 1 for x in predicted_data])
print(all(predicted_data_boolean == train_labels))

Here is the output:

2020-03-15 20:34:44.321274: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-15 20:34:44.325879: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3499995000 Hz
2020-03-15 20:34:44.326328: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55fa6a91c8b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-15 20:34:44.326373: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Train on 150000 samples
Epoch 1/10
150000/150000 [==============================] - 1s 9us/sample - loss: 0.4313 - acc: 0.8722 - fbeta_score: 0.8848
Epoch 2/10
150000/150000 [==============================] - 1s 8us/sample - loss: 0.2987 - acc: 1.0000 - fbeta_score: 1.0000
Epoch 3/10
150000/150000 [==============================] - 1s 8us/sample - loss: 0.1922 - acc: 1.0000 - fbeta_score: 1.0000
Epoch 4/10
150000/150000 [==============================] - 1s 8us/sample - loss: 0.1175 - acc: 1.0000 - fbeta_score: 1.0000
Epoch 5/10
150000/150000 [==============================] - 1s 8us/sample - loss: 0.0694 - acc: 1.0000 - fbeta_score: 1.0000
Epoch 6/10
150000/150000 [==============================] - 1s 7us/sample - loss: 0.0400 - acc: 1.0000 - fbeta_score: 1.0000
Epoch 7/10
150000/150000 [==============================] - 1s 8us/sample - loss: 0.0228 - acc: 1.0000 - fbeta_score: 1.0000
Epoch 8/10
150000/150000 [==============================] - 1s 8us/sample - loss: 0.0129 - acc: 1.0000 - fbeta_score: 1.0000
Epoch 9/10
150000/150000 [==============================] - 1s 7us/sample - loss: 0.0072 - acc: 1.0000 - fbeta_score: 1.0000
Epoch 10/10
150000/150000 [==============================] - 1s 7us/sample - loss: 0.0041 - acc: 1.0000 - fbeta_score: 1.0000
True

I believe the error we made here is not realizing that @tillmo was talking about multi-backend keras in all his messages (I just realized now). So to answer your question @tillmo:

Indeed F1 and Fbeta of TF addons don't work well with multi-backend keras. They were designed for tf.keras with tensorflow 2.x.
We will not work towards making it work with multi-backend keras because multi-backend keras is deprecated in favor of tf.keras. The keras-team/keras repo will soon be overwritten with the code of tf.keras. See https://github.com/tensorflow/community/blob/master/rfcs/20200205-standalone-keras-repository.md
If you want to use the F1 and Fbeta score of TF Addons, please use tf.keras.
Unless there are some other bugs we're not aware of, our implementation is bug-free and https://github.com/tensorflow/tensorflow/pull/31818 can be merged.

gabrieldemarmiesse on 15 Mar 2020

@gabrieldemarmiesse, thanks for the explanation. I was not aware of the difference between multi-backend keras and tf.keras, and the fact that the former is deprecated. Now your link provides some explanation, but only discusses a reorganisation of Keras in relation to Tensorflow. The end of the multi-backend nature is not discussed. Probably it is an implicit consequence? (A quite severe one...)

tillmo on 21 Mar 2020

You can get a bit more info about it at https://keras.io/

The major reason was this: it is not realistic for the keras maintainers to continue to maintain backends which represent only 2% of the users. Furthermore CNTK and Theano are both deprecated.

gabrieldemarmiesse on 21 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings