Describe the feature and the current behavior/state.
Reference issue: https://github.com/tensorflow/tensorflow/issues/28972#event-2376713783
Reviewer asked me to create a PR in addons.
Support for following metrics
Thank you, SSaishruthi.
Yeah, both @AakashKumarNain and @shashvatshahi1998 have made a request for F1 score https://github.com/tensorflow/addons/issues/232 . It looks necessary for our community.
related issue:
Thanks @facaiy
I will put a hold on F1-Beta to see if anyone is already working on that.
I will start working on other metrics. I will be working on F1-micro/macro along with others mentioned so that it does not block others work.
Great! Feel free to reach us if any help is needed.
Thanks for creating this! Just a note we do already have Focal Loss (also not really a metric... could we move the losses to a separate issue?):
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/losses/focal_loss.py
@seanpmorgan Sure.
I have a question before starting. Should this be created under tf.keras or tf.metric?
I was thinking about adding under tf.metric. Some of the linked current works are under tf.keras though.
@facaiy
Hi, @SSaishruthi. Welcome to addons community!
I have a question before starting. Should this be created under tf.keras or tf.metric?
I was thinking about adding under tf.metric. Some of the linked current works are under tf.keras though.
Do you mean if we should inherit from either tf.metrics.Metric or tf.keras.metrics.Metric? If so, they are just the alias in TF2.0. But for code style unification, let's simply follow https://github.com/tensorflow/addons/pull/267.
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/metrics
Thank you @WindQAQ
@WindQAQ @facaiy @seanpmorgan
I have R-square first version ready
https://colab.research.google.com/drive/1qteFJCiJlN3ougAoZM-QtnrZeeHlXHB3
Also, I have F1 - Micro/Macro/ weighted first version too if you are fine with that.
@SSaishruthi, well done! As the same question in https://github.com/tensorflow/addons/pull/267#pullrequestreview-244953599, is there any reason that we could only compute the metric on a single batch instead of accumulating states and computing overall R-square?
Hi @WindQAQ
That's a great point. I was looking at that as well.
CallbackHi @SSaishruthi, I have to clarify my example:
y_true = np.array([0, 1, 0, 1, 0])
y_pred = np.array([0, 1, 0, 1, 1])
m = tf.keras.metrics.BinaryAccuracy()
m.update_state(y_true, y_pred) # 1st acc: 0.8
print(m.result().numpy()) # 0.8
y_true = np.array([0, 1, 0, 1])
y_pred = np.array([0, 1, 0, 1])
m.update_state(y_true, y_pred) # 2nd acc: 1.0
print(m.result().numpy()) # 0.9 = (0.8 + 1) / 2
m.reset_states() # reset if needed
Metrics inherited from tf.keras.metrics.Mean or MeanMetricWrapper take the average over batches instead of over samples. So the the second result above is computed by (0.8 + 1) / 2 rather than (4 + 4) / (5 + 4). In this scenario, I think it is much easier to accumulate states (maintain two variables total and count and result is equal to total / count, or simply inherit from tf.keras.metrics.Mean).
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/metrics.py#L525
@facaiy @seanpmorgan, hi Facai and Sean, what do you feel about this? Is it acceptable to inherit from tf.keras.metrics.Mean?
@WindQAQ Which metric to inherit from Mean, Tzu-Wei?
Not so sure about that... Some metrics in tf.keras.metrics like BinaryAccuracy average over each batches' measurement, while others like Precision compute the metric per sample and return an overall number. You may understand what I say in this notebook. So if metrics follow BinaryAccuracy's pattern, one could easily implement it when inheriting from Mean. But I am quite confused if metrics should follow this pattern or not...
I will wait till we decide on this. Meanwhile, I have F1 (micro, macro and weighted) draft ready. Facing some issue when wrapping inside the class. Works when written as a function. Even this comes under current state (batch avg).
https://colab.research.google.com/drive/1TqY1zyjEjJY2YVVy_I_TtST4RVs3cS6c
will keep working on this and other.
Thanks @WindQAQ @facaiy
@WindQAQ Good question, Tzu-Wei. It's indeed confusing. I think all subclasses of MeanMetricWrapper expect 2-dim y_pred and y_true after I read their test cases.
In [31]: acc_obj = tf.keras.metrics.BinaryAccuracy()
In [32]: acc_obj.update_state([[1], [0]], [[1], [0]])
Out[32]: <tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=2.0>
In [33]: acc_obj.update_state([[1]], [[0]])
Out[33]: <tf.Variable 'UnreadVariable' shape=() dtype=float32, numpy=3.0>
In [34]: acc_obj.result()
Out[34]: <tf.Tensor: id=299, shape=(), dtype=float32, numpy=0.6666667>
@facai Got it! Thanks for the clarification here!
@WindQAQ @facaiy
Thanks for the pointers. How are we going to proceed with metrics here?
Do you have any suggestion?
I currently have solutions for F1 and R-Square here.
@SSaishruthi Hi, would you mind taking a look at @AakashKumarNain 's PR #267 for kappa? His new idea might be useful for you: always update confuse_matrix, and calculate results when needed.
To be implemented by subclasses:
__init__(): All state variables should be created in this method by
callingself.add_weight()like:self.var = self.add_weight(...)update_state(): Has all updates to the state variables like:
self.var.assign_add(...).result(): Computes and returns a value for the metric
from the state variables.
I think we should create variables for intermediate state, rather than final result. Those state variables are updated when invoking update_state. And metric computes the final result only until user calls result method.
Thanks @facaiy . I will work on both the metrics and keep you posted.
Hi @facaiy @WindQAQ
I have updated R-Square metric. Looks like I am getting minute difference in the calculation.
Can you please let me know if this is fine?
https://colab.research.google.com/drive/1_s1bFZvFw6WhRoufuTw0Ucd_dtKExMTA
Hi @SSaishruthi, that's because the total variance is not equal to the summation of each batches' variance. Here is the revised notebook. Maintain additional variables like \sum y^2, \sum y and n so that we could compute the total sum of squares correctly. Still need more tests though :-)
Thanks @WindQAQ
I will add more test cases. Also,
What should be my next step wrt to r2? Should I create a PR?
I am also modifying F1 based on this.
Please file the PR after test cases are done (one PR for one metric). For F1 score part, Instead of maintaining a whole n by n confusion matrix, I think a more lightweight method is to maintain true_positives, false_positives and false_negatives. Like the implementation for Precision
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/metrics.py#L1089
Sure. Will do
Thanks a lot @WindQAQ
Hi @WindQAQ @facaiy
I have created the F1 Macro score draft version (variable names needs to be updated).
https://colab.research.google.com/drive/1LJ1yb8cUgisNP3ETfMLRYPQEyJivzDv-
Please let me know if this fine. Meanwhile, I will start working in R2 PR
@SSaishruthi Took a look at the F1 macro draft, it looks good but needs a few changes. Could you open a PR for F1 macro and we can review it there directly.
@SSaishruthi I edited your initial post for to track these a little easier since this will be our metrics tracking issue. Hope you don't mind.
Thanks, @Squadrick for the review. Will create a PR.
@seanpmorgan @WindQAQ @Squadrick
Hello,
I have a solution for F1-micro as well. Should I combine it with F1-macro or we can have it separately?
Put them in the same module, but make them separate classes.
Hello @facaiy @seanpmorgan @WindQAQ @Squadrick
I have two PRs.
Colab notebooks:
F1 scores
https://colab.research.google.com/drive/1qSq0SsYkPqjdKUgM1RM4kKM67X75ocFj
R-Square
https://colab.research.google.com/drive/1Qh3zYhpoB_Ln6O2d6YSBqviYsuw0W9VS
Working on test cases now. Also, any suggestion on variable names will be helpful. I feel it needs to be more clear. Thanks @Squadrick for looking into the PRs
Should I use Cohen's Kappa test scripts as a sample?
@SSaishruthi The variable names are fine, it's clear enough. Yup, you can use Cohen's Kappa test scripts as an example.
Has anyone implemented hamming score in tensorflow? I am doing multilabel classification of news articles in keras. The test accuracy is 96% which I don't think is a proper measure. Hamming score has been of great interest in multilabel classification. Please add this to the list as well and if anyone has come up with a solution...please share it....:) Thank you
Hi @spartian
I have proposed this before and will be working on that as well. Looks like we have updated the issue and missed this metric. Thanks for bringing this up.
Hey @SSaishruthi,
Could you give a rough idea about the implementation? I am kindaa lost here.....
Hello again,
Tried a set of hamming metrics.
https://colab.research.google.com/drive/1Msuv5xUu7lu5wDH1ei-VOPB-UnBolDfB
If it looks good, I can start wrapping inside metric.
Thank you @SSaishruthi
@SSaishruthi
One more thing...look at this thread
https://stats.stackexchange.com/questions/233275/multilabel-classification-metrics-on-scikit/234354#234354
It says here something about hamming_score. Is hamming_score and hamming distance same?
@SSaishruthi The code in the colab notebook looks alright, but it currently converts to np which isn't desirable. You can open the PR once it uses only tf.
@Squadrick Thanks for the review. I have updated to use only tf
https://colab.research.google.com/drive/1Msuv5xUu7lu5wDH1ei-VOPB-UnBolDfB
Will create PR for the same after adding test cases
@Squadrick @facaiy @seanpmorgan @WindQAQ
I am working on hamming metrics PR. Is it ok if I discuss hamming loss in this issue or create a separate one?
Do we need to hold state information for hamming loss?
@SSaishruthi please open a separate issue... thanks for all your work!
Thanks @seanpmorgan
I have created a new issue and will post my updates there.
For reference, hamming loss discussed here: #305
Working on other metrics. Will update soon
Tried f1-beta score and looks fine for now.
Most parts of the code match with f1-score except the final formula which adds a beta parameter.
Call we call f1-score inside the f-beta class and add a beta parameter to it?
By default, beta value will be 1 (normal f1) and when using f1-beta it can be modified.
This will match the sklearn API.
@Squadrick
@SSaishruthi Yup, that sounds good.
Thanks @Squadrick
Will create a PR for F1-beta soon. Meanwhile, here is the draft of Multilabel confusion matrix for your reference: https://colab.research.google.com/drive/1fsB4tUo23Ejg0MjrqRl-ND5A1amCiYOj
@SSaishruthi Feel free to open a PR for the multilabel confusion matrix.
Created PR for F1-Beta. Will create for confusion matrix soon.
Hi @Squadrick
I have wrapped multiclass confusion matrix and here is the colab link (under implementation)
https://colab.research.google.com/drive/1fsB4tUo23Ejg0MjrqRl-ND5A1amCiYOj#scrollTo=AT70Ux6EB6qk
Does it look good? If yes, I will go ahead with PR creation
Status: F-Beta, Multilabel confusion matrix and Hamming metrics are done. PRs are open for the same.
Working on next set as well as current metrics enhancement.
Hi,
I have a multi-class classification problem and I want to measure AUC on training and test data.
tf.keras has implemented AUC metric (tf.keras.metrics.AUC), but I'm not be able to see whether this metric could safely be used in multi-class problems. Even, the example "Classification on imbalanced data" on the official Web page is dedicated to a binary classification problem.
I have implemented a CNN model that predicts six classes, having a softmax layer that gives the probabilities of all the classes. I used this metric as follows
self.model.compile(loss='categorical_crossentropy',
optimizer=Adam(hp.get("learning_rate")),
metrics=['accuracy', AUC()]),
and the code was executed without any problem. However, sometimes I see some results that are quite strange for me. For example, the model reported an accuracy of 0.78333336 and AUC equal to 0.97327775, Is this possible? Can a model have a low accuracy and an AUC so high?
I wonder that, although the code does not give any error, the AUC metric is computing wrong.
Somebody may confirm me whether or not this metrics support multi-class classification problems?
Thanks in advance,
Oscar
Hi,
@ogreyesp I am also in the same boat, was wondering if you could you find an answer?
I am mainly interested in AUC and 'confusion matrix' for multi-class classification.
Thanks in advance,
Puneet
I am also waiting for the same. Like @ogreyesp said it doesn't throw any errors while using. I wondered how am I getting so high AUC score. IMHO at least the existing metric should throw error for multi class data to avoid embarrassment of posting wrong results (pun intended)
I have also this problem. @DarkKnight1991, @puneetpandey37, and @ogreyesp, Could you find any solution for that?
Metrics seem to have some gaping problems right now. Didn't get the time to go through each, I'll do it soon and post a fix, or at least a warning message for multi-class data. Sorry about the delay.
I also have the same problem. adding the hamming score and multi-class AUC would be nice.
@ROAbb
I solved this problem by mean of creating a custom class that computes these metrics, and then passing it as callback when fitting a model.
from sklearn.metrics import matthews_corrcoef, f1_score, balanced_accuracy_score
# custom metrics for computing balance accuracy, mcc and f1
class CustomMetrics(Callback):
def __init__(self, X, Y):
super(CustomMetrics, self).__init__()
self.X = X
self.Y = Y.argmax(axis=-1)
def on_epoch_end(self, epoch, logs=None):
predictions = self.model.predict(self.X)
y_pred = predictions.argmax(axis=-1)
logs['val_bacc'] = balanced_accuracy_score(self.Y, y_pred)
logs['val_f1'] = f1_score(self.Y, y_pred, average='micro')
logs['val_mcc'] = matthews_corrcoef(self.Y, y_pred)
......
if X_test is not None:
metrics = CustomMetrics(X_test, y_test)
else:
metrics = CustomMetrics(X_train, y_train)
model.fit(X_train, y_train, validation_data=validation_data,
batch_size=8,
epochs=30,
callbacks=metrics)
I hope that this solution can help you. Best
May I implement Positive Likelihood and Negative Likelihood?
May I implement Positive Likelihood and Negative Likelihood?
Sure!
To @ogreyesp and others who are interested in AUC in a multiclass prediction task (i.e. softmax as model final output).
I am also running in this scenario. I am not exactly sure what tk kera metrics of AUC is computing in this case. If someone can read the code and summarize that will be great. Or just test it out with simple case. I find a conceptual extension from binary case to multiclass case is not that trivial.
For now, what I use it a “1 vs other” approach, and reduce the evaluation problem to many separate binary classification. E.g. if you are to predict red, green, or blue. You can consider p(red) vs. p(not red) and compute AUC as usual for this binary case. Do this separately for each class. Then these AUCs can be considered separately, or (weighted)-avg them if you want.
For binary case, i think it is possible for AUC and accuracy to have a significant difference. Accuracy, Precision, Recall, F1 depend on a “threshold” (this is actually a param in tf keras metrics). By the default, it is 0.5. I.e. if probability of something is higher than this, you interpret this as positive. But you can set this threshold higher at 0.9 for example. Then you will get fewer positives and most of the time, it is a higher precision and lower recall scenario. AUC is independent of this threshold since it is the area under the curve of this sort of trade off. It is a better summary of the predictive power of your model. But I found AUC not an “operational” param, since in the real deployment, you will need to choose a threshold and get some “definite” precision or recall.
I think it will be great if TF 2.0 can include “best practice” on what an aggregate AUC mean for multi-class, I ain’t sure if this is still an active area of research, or too application-specific. If so, writing your own custom metrics may be the way to go.
Most helpful comment
Thanks @facaiy
I will put a hold on F1-Beta to see if anyone is already working on that.
I will start working on other metrics. I will be working on F1-micro/macro along with others mentioned so that it does not block others work.