Addons: Support for new metrics under tf.metric

Created on 30 May 2019 · 63Comments · Source: tensorflow/addons

Describe the feature and the current behavior/state.

Reference issue: https://github.com/tensorflow/tensorflow/issues/28972#event-2376713783
Reviewer asked me to create a PR in addons.

Support for following metrics

[X] Cohen's Kappa
[X] R-Square
[x] F1 (micro/macro/weighted)
[x] F-Beta
[x] Multilabel confusion matrix.
[x] Hamming score
[ ] Jaccard score
[ ] Positive likelihood
[ ] Negative likelihood
[ ] Multiclass AUC
[ ] Adjusted R-Square

Feature Request metrics

Source

SSaishruthi

👍1

Most helpful comment

Thanks @facaiy
I will put a hold on F1-Beta to see if anyone is already working on that.
I will start working on other metrics. I will be working on F1-micro/macro along with others mentioned so that it does not block others work.

SSaishruthi on 31 May 2019

👍2

All 63 comments

Thank you, SSaishruthi.

Yeah, both @AakashKumarNain and @shashvatshahi1998 have made a request for F1 score https://github.com/tensorflow/addons/issues/232 . It looks necessary for our community.

related issue:

facaiy on 31 May 2019

SSaishruthi on 31 May 2019

👍2

Great! Feel free to reach us if any help is needed.

facaiy on 31 May 2019

👍1

Thanks for creating this! Just a note we do already have Focal Loss (also not really a metric... could we move the losses to a separate issue?):
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/losses/focal_loss.py

seanpmorgan on 31 May 2019

👍1

@seanpmorgan Sure.
I have a question before starting. Should this be created under tf.keras or tf.metric?
I was thinking about adding under tf.metric. Some of the linked current works are under tf.keras though.
@facaiy

SSaishruthi on 31 May 2019

Hi, @SSaishruthi. Welcome to addons community!

I have a question before starting. Should this be created under tf.keras or tf.metric?
I was thinking about adding under tf.metric. Some of the linked current works are under tf.keras though.

Do you mean if we should inherit from either tf.metrics.Metric or tf.keras.metrics.Metric? If so, they are just the alias in TF2.0. But for code style unification, let's simply follow https://github.com/tensorflow/addons/pull/267.

https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/metrics

WindQAQ on 3 Jun 2019

👍1

Thank you @WindQAQ

SSaishruthi on 3 Jun 2019

@WindQAQ @facaiy @seanpmorgan
I have R-square first version ready
https://colab.research.google.com/drive/1qteFJCiJlN3ougAoZM-QtnrZeeHlXHB3

Also, I have F1 - Micro/Macro/ weighted first version too if you are fine with that.

SSaishruthi on 5 Jun 2019

👍1

@SSaishruthi, well done! As the same question in https://github.com/tensorflow/addons/pull/267#pullrequestreview-244953599, is there any reason that we could only compute the metric on a single batch instead of accumulating states and computing overall R-square?

WindQAQ on 5 Jun 2019

Hi @WindQAQ
That's a great point. I was looking at that as well.

One option for user is to use Callback
https://keras.io/callbacks/#create-a-callback?

We need to save values in a list as we progress and do the calculation at the end.

SSaishruthi on 5 Jun 2019

Hi @SSaishruthi, I have to clarify my example:

y_true = np.array([0, 1, 0, 1, 0])
y_pred = np.array([0, 1, 0, 1, 1])
m = tf.keras.metrics.BinaryAccuracy()
m.update_state(y_true, y_pred) # 1st acc: 0.8
print(m.result().numpy()) # 0.8

y_true = np.array([0, 1, 0, 1])
y_pred = np.array([0, 1, 0, 1])
m.update_state(y_true, y_pred) # 2nd acc: 1.0
print(m.result().numpy()) # 0.9 = (0.8 + 1) / 2

m.reset_states() # reset if needed

Metrics inherited from tf.keras.metrics.Mean or MeanMetricWrapper take the average over batches instead of over samples. So the the second result above is computed by (0.8 + 1) / 2 rather than (4 + 4) / (5 + 4). In this scenario, I think it is much easier to accumulate states (maintain two variables total and count and result is equal to total / count, or simply inherit from tf.keras.metrics.Mean).

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/metrics.py#L525

@facaiy @seanpmorgan, hi Facai and Sean, what do you feel about this? Is it acceptable to inherit from tf.keras.metrics.Mean?

WindQAQ on 6 Jun 2019

@WindQAQ Which metric to inherit from Mean, Tzu-Wei?

facaiy on 6 Jun 2019

Not so sure about that... Some metrics in tf.keras.metrics like BinaryAccuracy average over each batches' measurement, while others like Precision compute the metric per sample and return an overall number. You may understand what I say in this notebook. So if metrics follow BinaryAccuracy's pattern, one could easily implement it when inheriting from Mean. But I am quite confused if metrics should follow this pattern or not...

WindQAQ on 6 Jun 2019

I will wait till we decide on this. Meanwhile, I have F1 (micro, macro and weighted) draft ready. Facing some issue when wrapping inside the class. Works when written as a function. Even this comes under current state (batch avg).
https://colab.research.google.com/drive/1TqY1zyjEjJY2YVVy_I_TtST4RVs3cS6c
will keep working on this and other.
Thanks @WindQAQ @facaiy

SSaishruthi on 6 Jun 2019

@WindQAQ Good question, Tzu-Wei. It's indeed confusing. I think all subclasses of MeanMetricWrapper expect 2-dim y_pred and y_true after I read their test cases.

In [31]: acc_obj = tf.keras.metrics.BinaryAccuracy()

In [32]: acc_obj.update_state([[1], [0]], [[1], [0]])
Out[32]: <tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=2.0>

In [33]: acc_obj.update_state([[1]], [[0]])
Out[33]: <tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=3.0>

In [34]: acc_obj.result()
Out[34]: <tf.Tensor: id=299, shape=(), dtype=float32, numpy=0.6666667>

facaiy on 9 Jun 2019

@facai Got it! Thanks for the clarification here!

WindQAQ on 9 Jun 2019

😄1

@WindQAQ @facaiy
Thanks for the pointers. How are we going to proceed with metrics here?
Do you have any suggestion?
I currently have solutions for F1 and R-Square here.

SSaishruthi on 9 Jun 2019

@SSaishruthi Hi, would you mind taking a look at @AakashKumarNain 's PR #267 for kappa? His new idea might be useful for you: always update confuse_matrix, and calculate results when needed.

facaiy on 9 Jun 2019

Per tf.keras.metrics.Metric:

To be implemented by subclasses:

__init__(): All state variables should be created in this method by
calling self.add_weight() like: self.var = self.add_weight(...)

update_state(): Has all updates to the state variables like:
self.var.assign_add(...).

result(): Computes and returns a value for the metric
from the state variables.

I think we should create variables for intermediate state, rather than final result. Those state variables are updated when invoking update_state. And metric computes the final result only until user calls result method.

facaiy on 9 Jun 2019

Thanks @facaiy . I will work on both the metrics and keep you posted.

SSaishruthi on 9 Jun 2019

😄1

Hi @facaiy @WindQAQ
I have updated R-Square metric. Looks like I am getting minute difference in the calculation.
Can you please let me know if this is fine?
https://colab.research.google.com/drive/1_s1bFZvFw6WhRoufuTw0Ucd_dtKExMTA

SSaishruthi on 10 Jun 2019

Hi @SSaishruthi, that's because the total variance is not equal to the summation of each batches' variance. Here is the revised notebook. Maintain additional variables like \sum y^2, \sum y and n so that we could compute the total sum of squares correctly. Still need more tests though :-)

WindQAQ on 11 Jun 2019

Thanks @WindQAQ
I will add more test cases. Also,
What should be my next step wrt to r2? Should I create a PR?
I am also modifying F1 based on this.

SSaishruthi on 11 Jun 2019

Please file the PR after test cases are done (one PR for one metric). For F1 score part, Instead of maintaining a whole n by n confusion matrix, I think a more lightweight method is to maintain true_positives, false_positives and false_negatives. Like the implementation for Precision
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/metrics.py#L1089

WindQAQ on 11 Jun 2019

Sure. Will do
Thanks a lot @WindQAQ

SSaishruthi on 11 Jun 2019

👍1

Hi @WindQAQ @facaiy

I have created the F1 Macro score draft version (variable names needs to be updated).

https://colab.research.google.com/drive/1LJ1yb8cUgisNP3ETfMLRYPQEyJivzDv-

Please let me know if this fine. Meanwhile, I will start working in R2 PR

SSaishruthi on 12 Jun 2019

@SSaishruthi Took a look at the F1 macro draft, it looks good but needs a few changes. Could you open a PR for F1 macro and we can review it there directly.

Squadrick on 12 Jun 2019

👍1

@SSaishruthi I edited your initial post for to track these a little easier since this will be our metrics tracking issue. Hope you don't mind.

seanpmorgan on 12 Jun 2019

👍1

Thanks, @Squadrick for the review. Will create a PR.

SSaishruthi on 12 Jun 2019

@seanpmorgan @WindQAQ @Squadrick

Hello,

I have a solution for F1-micro as well. Should I combine it with F1-macro or we can have it separately?

SSaishruthi on 12 Jun 2019

Put them in the same module, but make them separate classes.

Squadrick on 13 Jun 2019

👍1

Hello @facaiy @seanpmorgan @WindQAQ @Squadrick

I have two PRs.

F1 micro/macro/weighted
R-square

Colab notebooks:

F1 scores
https://colab.research.google.com/drive/1qSq0SsYkPqjdKUgM1RM4kKM67X75ocFj

R-Square
https://colab.research.google.com/drive/1Qh3zYhpoB_Ln6O2d6YSBqviYsuw0W9VS

Working on test cases now. Also, any suggestion on variable names will be helpful. I feel it needs to be more clear. Thanks @Squadrick for looking into the PRs

Should I use Cohen's Kappa test scripts as a sample?

SSaishruthi on 14 Jun 2019

❤1

@SSaishruthi The variable names are fine, it's clear enough. Yup, you can use Cohen's Kappa test scripts as an example.

Squadrick on 14 Jun 2019

Has anyone implemented hamming score in tensorflow? I am doing multilabel classification of news articles in keras. The test accuracy is 96% which I don't think is a proper measure. Hamming score has been of great interest in multilabel classification. Please add this to the list as well and if anyone has come up with a solution...please share it....:) Thank you

spartian on 14 Jun 2019

Hi @spartian
I have proposed this before and will be working on that as well. Looks like we have updated the issue and missed this metric. Thanks for bringing this up.

SSaishruthi on 14 Jun 2019

Hey @SSaishruthi,

Could you give a rough idea about the implementation? I am kindaa lost here.....

spartian on 14 Jun 2019

Hello again,

Tried a set of hamming metrics.

Hamming distance
Hamming loss (multiclass/multilabel)

https://colab.research.google.com/drive/1Msuv5xUu7lu5wDH1ei-VOPB-UnBolDfB

If it looks good, I can start wrapping inside metric.

SSaishruthi on 15 Jun 2019

👍1

Thank you @SSaishruthi

spartian on 15 Jun 2019

@SSaishruthi

One more thing...look at this thread
https://stats.stackexchange.com/questions/233275/multilabel-classification-metrics-on-scikit/234354#234354

It says here something about hamming_score. Is hamming_score and hamming distance same?

spartian on 15 Jun 2019

@SSaishruthi The code in the colab notebook looks alright, but it currently converts to np which isn't desirable. You can open the PR once it uses only tf.

Squadrick on 16 Jun 2019

👍1

@Squadrick Thanks for the review. I have updated to use only tf
https://colab.research.google.com/drive/1Msuv5xUu7lu5wDH1ei-VOPB-UnBolDfB

Will create PR for the same after adding test cases

SSaishruthi on 17 Jun 2019

@Squadrick @facaiy @seanpmorgan @WindQAQ
I am working on hamming metrics PR. Is it ok if I discuss hamming loss in this issue or create a separate one?

Do we need to hold state information for hamming loss?

SSaishruthi on 17 Jun 2019

@SSaishruthi please open a separate issue... thanks for all your work!

seanpmorgan on 17 Jun 2019

👍1

Thanks @seanpmorgan
I have created a new issue and will post my updates there.

SSaishruthi on 17 Jun 2019

For reference, hamming loss discussed here: #305

Squadrick on 18 Jun 2019

Working on other metrics. Will update soon

SSaishruthi on 18 Jun 2019

Tried f1-beta score and looks fine for now.

Most parts of the code match with f1-score except the final formula which adds a beta parameter.
Call we call f1-score inside the f-beta class and add a beta parameter to it?
By default, beta value will be 1 (normal f1) and when using f1-beta it can be modified.
This will match the sklearn API.

@Squadrick

SSaishruthi on 19 Jun 2019

@SSaishruthi Yup, that sounds good.

Squadrick on 19 Jun 2019

👍1

Thanks @Squadrick
Will create a PR for F1-beta soon. Meanwhile, here is the draft of Multilabel confusion matrix for your reference: https://colab.research.google.com/drive/1fsB4tUo23Ejg0MjrqRl-ND5A1amCiYOj

SSaishruthi on 20 Jun 2019

@SSaishruthi Feel free to open a PR for the multilabel confusion matrix.

Squadrick on 21 Jun 2019

👍1

Created PR for F1-Beta. Will create for confusion matrix soon.

SSaishruthi on 25 Jun 2019

Hi @Squadrick
I have wrapped multiclass confusion matrix and here is the colab link (under implementation)
https://colab.research.google.com/drive/1fsB4tUo23Ejg0MjrqRl-ND5A1amCiYOj#scrollTo=AT70Ux6EB6qk

Does it look good? If yes, I will go ahead with PR creation

SSaishruthi on 26 Jun 2019

Status: F-Beta, Multilabel confusion matrix and Hamming metrics are done. PRs are open for the same.
Working on next set as well as current metrics enhancement.

SSaishruthi on 12 Jul 2019

Hi,
I have a multi-class classification problem and I want to measure AUC on training and test data.

tf.keras has implemented AUC metric (tf.keras.metrics.AUC), but I'm not be able to see whether this metric could safely be used in multi-class problems. Even, the example "Classification on imbalanced data" on the official Web page is dedicated to a binary classification problem.

I have implemented a CNN model that predicts six classes, having a softmax layer that gives the probabilities of all the classes. I used this metric as follows

self.model.compile(loss='categorical_crossentropy',
optimizer=Adam(hp.get("learning_rate")),
metrics=['accuracy', AUC()]),

and the code was executed without any problem. However, sometimes I see some results that are quite strange for me. For example, the model reported an accuracy of 0.78333336 and AUC equal to 0.97327775, Is this possible? Can a model have a low accuracy and an AUC so high?

I wonder that, although the code does not give any error, the AUC metric is computing wrong.

Somebody may confirm me whether or not this metrics support multi-class classification problems?

Thanks in advance,

Oscar

ogreyesp on 30 Oct 2019

Hi,

@ogreyesp I am also in the same boat, was wondering if you could you find an answer?

I am mainly interested in AUC and 'confusion matrix' for multi-class classification.

Thanks in advance,
Puneet

puneetpandey37 on 31 Dec 2019

I am also waiting for the same. Like @ogreyesp said it doesn't throw any errors while using. I wondered how am I getting so high AUC score. IMHO at least the existing metric should throw error for multi class data to avoid embarrassment of posting wrong results (pun intended)

nayash on 17 Jan 2020

I have also this problem. @DarkKnight1991, @puneetpandey37, and @ogreyesp, Could you find any solution for that?

ROAbb on 18 Feb 2020

Metrics seem to have some gaping problems right now. Didn't get the time to go through each, I'll do it soon and post a fix, or at least a warning message for multi-class data. Sorry about the delay.

Squadrick on 24 Feb 2020

I also have the same problem. adding the hamming score and multi-class AUC would be nice.

Hosein47 on 25 Mar 2020

@ROAbb

I solved this problem by mean of creating a custom class that computes these metrics, and then passing it as callback when fitting a model.

from sklearn.metrics import matthews_corrcoef, f1_score, balanced_accuracy_score

# custom metrics for computing balance accuracy, mcc and f1
class CustomMetrics(Callback):

    def __init__(self, X, Y):

        super(CustomMetrics, self).__init__()
        self.X = X
        self.Y = Y.argmax(axis=-1)

    def on_epoch_end(self, epoch, logs=None):

        predictions = self.model.predict(self.X)

        y_pred = predictions.argmax(axis=-1)

        logs['val_bacc'] = balanced_accuracy_score(self.Y, y_pred)
        logs['val_f1'] = f1_score(self.Y, y_pred, average='micro')
        logs['val_mcc'] = matthews_corrcoef(self.Y, y_pred)

......

if X_test is not None:
     metrics = CustomMetrics(X_test, y_test)
else:
     metrics = CustomMetrics(X_train, y_train)

model.fit(X_train, y_train, validation_data=validation_data,
                              batch_size=8,
                              epochs=30,
                              callbacks=metrics)

I hope that this solution can help you. Best

ogreyesp on 30 Mar 2020

May I implement Positive Likelihood and Negative Likelihood?

marload on 20 Jul 2020

May I implement Positive Likelihood and Negative Likelihood?

Sure!

seanpmorgan on 20 Jul 2020

To @ogreyesp and others who are interested in AUC in a multiclass prediction task (i.e. softmax as model final output).
I am also running in this scenario. I am not exactly sure what tk kera metrics of AUC is computing in this case. If someone can read the code and summarize that will be great. Or just test it out with simple case. I find a conceptual extension from binary case to multiclass case is not that trivial.

For now, what I use it a “1 vs other” approach, and reduce the evaluation problem to many separate binary classification. E.g. if you are to predict red, green, or blue. You can consider p(red) vs. p(not red) and compute AUC as usual for this binary case. Do this separately for each class. Then these AUCs can be considered separately, or (weighted)-avg them if you want.

For binary case, i think it is possible for AUC and accuracy to have a significant difference. Accuracy, Precision, Recall, F1 depend on a “threshold” (this is actually a param in tf keras metrics). By the default, it is 0.5. I.e. if probability of something is higher than this, you interpret this as positive. But you can set this threshold higher at 0.9 for example. Then you will get fewer positives and most of the time, it is a higher precision and lower recall scenario. AUC is independent of this threshold since it is the area under the curve of this sort of trade off. It is a better summary of the predictive power of your model. But I found AUC not an “operational” param, since in the real deployment, you will need to choose a threshold and get some “definite” precision or recall.

I think it will be great if TF 2.0 can include “best practice” on what an aggregate AUC mean for multi-class, I ain’t sure if this is still an active area of research, or too application-specific. If so, writing your own custom metrics may be the way to go.

kechan on 13 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

WeightNormalization data init fails with Keras experimental_run_tf_function

seanpmorgan · 4Comments

Add contribution guideline for moving from tf.contrib

seanpmorgan · 3Comments

Complete black formatting

seanpmorgan · 3Comments

Keras model save using WeightedKappaLoss errors, not json serializable

ConnorBarnhill · 3Comments

Clean up tutorials

seanpmorgan · 3Comments