Issue noticed and described by @wwang500:
The calculation for multiclass evaluation doesn't look correct. Here it is one example:
{
"classification" : {
"accuracy" : {
"classes" : [
{
"class_name" : "1",
"accuracy" : 0.8878504672897196
},
{
"class_name" : "2",
"accuracy" : 0.883177570093458
},
{
"class_name" : "3",
"accuracy" : 0.9345794392523364
},
{
"class_name" : "5",
"accuracy" : 0.9485981308411215
},
{
"class_name" : "6",
"accuracy" : 0.9766355140186916
},
{
"class_name" : "7",
"accuracy" : 0.985981308411215
}
],
"overall_accuracy" : 0.8084112149532711
},
"multiclass_confusion_matrix" : {
"confusion_matrix" : [
{
"actual_class" : "1",
"actual_class_doc_count" : 70,
"predicted_classes" : [
{
"predicted_class" : "1",
"count" : 63
},
{
"predicted_class" : "2",
"count" : 7
},
{
"predicted_class" : "3",
"count" : 0
},
{
"predicted_class" : "5",
"count" : 0
},
{
"predicted_class" : "6",
"count" : 0
},
{
"predicted_class" : "7",
"count" : 0
}
],
"other_predicted_class_doc_count" : 0
},
{
"actual_class" : "2",
"actual_class_doc_count" : 76,
"predicted_classes" : [
{
"predicted_class" : "1",
"count" : 7
},
{
"predicted_class" : "2",
"count" : 64
},
{
"predicted_class" : "3",
"count" : 0
},
{
"predicted_class" : "5",
"count" : 4
},
{
"predicted_class" : "6",
"count" : 1
},
{
"predicted_class" : "7",
"count" : 0
}
],
"other_predicted_class_doc_count" : 0
},
{
"actual_class" : "3",
"actual_class_doc_count" : 17,
"predicted_classes" : [
{
"predicted_class" : "1",
"count" : 9
},
{
"predicted_class" : "2",
"count" : 5
},
{
"predicted_class" : "3",
"count" : 3
},
{
"predicted_class" : "5",
"count" : 0
},
{
"predicted_class" : "6",
"count" : 0
},
{
"predicted_class" : "7",
"count" : 0
}
],
"other_predicted_class_doc_count" : 0
},
{
"actual_class" : "5",
"actual_class_doc_count" : 13,
"predicted_classes" : [
{
"predicted_class" : "1",
"count" : 0
},
{
"predicted_class" : "2",
"count" : 1
},
{
"predicted_class" : "3",
"count" : 0
},
{
"predicted_class" : "5",
"count" : 8
},
{
"predicted_class" : "6",
"count" : 3
},
{
"predicted_class" : "7",
"count" : 1
}
],
"other_predicted_class_doc_count" : 0
},
{
"actual_class" : "6",
"actual_class_doc_count" : 9,
"predicted_classes" : [
{
"predicted_class" : "1",
"count" : 0
},
{
"predicted_class" : "2",
"count" : 0
},
{
"predicted_class" : "3",
"count" : 0
},
{
"predicted_class" : "5",
"count" : 1
},
{
"predicted_class" : "6",
"count" : 8
},
{
"predicted_class" : "7",
"count" : 0
}
],
"other_predicted_class_doc_count" : 0
},
{
"actual_class" : "7",
"actual_class_doc_count" : 29,
"predicted_classes" : [
{
"predicted_class" : "1",
"count" : 1
},
{
"predicted_class" : "2",
"count" : 0
},
{
"predicted_class" : "3",
"count" : 0
},
{
"predicted_class" : "5",
"count" : 1
},
{
"predicted_class" : "6",
"count" : 0
},
{
"predicted_class" : "7",
"count" : 27
}
],
"other_predicted_class_doc_count" : 0
}
],
"other_actual_class_count" : 0
}
}
}
the accuracy for class_name: 1 should be 63/70 = 0.9, but it says 0.8878504672897196 in the above. Also, the overall_accuracy should be at least 0.88, as all the per-class accuracy is bigger than 0.88. Could you take a look please?
here it is the job configure, in case if you need. I was using 7.7-2227 build
PUT _ml/data_frame/analytics/glass_identification3
{
"source": {
"index": "glass_identification"
},
"dest": {
"index":"dest_glass_identification3"
},
"analysis":
{
"classification": {
"dependent_variable": "glass_type",
"num_top_classes": 3,
"training_percent": 80
}
}
}
Pinging @elastic/ml-core (:ml)
I believe the numbers are correct. It's the interpretation of the numbers that may not be clear.
Please read the following explanation I gave in the code comment in Accuracy.java file:
/**
* {@link Accuracy} is a metric that answers the following two questions:
*
* 1. What is the fraction of documents for which predicted class equals the actual class?
*
* equation: overall_accuracy = 1/n * 危(y == y')
* where: n = total number of documents
* y = document's actual class
* y' = document's predicted class
*
* 2. For any given class X, what is the fraction of documents for which either
* a) both actual and predicted class are equal to X (true positives)
* or
* b) both actual and predicted class are not equal to X (true negatives)
*
* equation: accuracy(X) = 1/n * (TP(X) + TN(X))
* where: X = class being examined
* n = total number of documents
* TP(X) = number of true positives wrt X
* TN(X) = number of true negatives wrt X
*/
So the per-class accuracy for class "3" is calculated as (214-9-5)/214=0.93457943925
where 214 is total number of docs. In other words, samples of class "2" predicted as "5" count as predicted accurately for the sake of calculation accuracy of class "3".
| 1 | 2 | 3 | 5 | 6 | 7 | per-class accuracy
--|----|----|----|----|----|----|--------------
1 | 63 | 7 | 0 | 0 | 0 | 0 | 0.8878504673
2 | 7 | 64 | 0 | 4 | 1 | 0 | 0.8831775701
3 | 9 | 5 | 3 | 0 | 0 | 0 | 0.9345794393
5 | 0 | 1 | 0 | 8 | 3 | 1 | 0.9485981308
6 | 0 | 0 | 0 | 1 | 8 | 0 | 0.976635514
7 | 1 | 0 | 0 | 1 | 0 | 27 | 0.9859813084
Does it make sense @wwang500 ?
Thanks @przemekwitek Indeed, if I am using "confusion_matrix", "classification_report" and "accuracy_score" from "sklearn.metrics", I got the same result on overall_accuracy, confusion_matrix, recall and precision : 馃憤
sklearn.metrics confusion_matrix result:
[[63 7 0 0 0 0]
[ 7 64 0 4 1 0]
[ 9 5 3 0 0 0]
[ 0 1 0 8 3 1]
[ 0 0 0 1 8 0]
[ 1 0 0 1 0 27]]
sklearn.metrics classification_report result:
precision recall f1-score support
1 0.79 0.90 0.84 70
2 0.83 0.84 0.84 76
3 1.00 0.18 0.30 17
5 0.57 0.62 0.59 13
6 0.67 0.89 0.76 9
7 0.96 0.93 0.95 29
accuracy 0.81 214
macro avg 0.80 0.73 0.71 214
weighted avg 0.83 0.81 0.79 214
sklearn.metrics accuracy_score result:
0.8084112149532711
However, I am still not sure about the per class accuracy part. I will see if I can find some metrics to compare with.
More research about per class accuracy part. I kind of agree with someone from internet:
_Accuracy is a global measure, and there is no such thing as class-wise accuracy. The suggestions to normalize by true cases (rows) yields something called true-positive rate, sensitivity or recall, depending on the context. Likewise, if you normalize by prediction (columns), it's called precision or positive predictive value._
I understand why this per-class accuracy metric may be confusing. Still, I think it can be useful to some customers who want to evaluate a multiclass predictor as if it was a binary predictor (with a given particular class).
@tveasey, WDYT?