Elasticsearch: Accuracy metric for multiclass evaluation doesn't work as expected

Created on 20 Mar 2020 · 5Comments · Source: elastic/elasticsearch

Issue noticed and described by @wwang500:

The calculation for multiclass evaluation doesn't look correct. Here it is one example:

{
  "classification" : {
    "accuracy" : {
      "classes" : [
        {
          "class_name" : "1",
          "accuracy" : 0.8878504672897196
        },
        {
          "class_name" : "2",
          "accuracy" : 0.883177570093458
        },
        {
          "class_name" : "3",
          "accuracy" : 0.9345794392523364
        },
        {
          "class_name" : "5",
          "accuracy" : 0.9485981308411215
        },
        {
          "class_name" : "6",
          "accuracy" : 0.9766355140186916
        },
        {
          "class_name" : "7",
          "accuracy" : 0.985981308411215
        }
      ],
      "overall_accuracy" : 0.8084112149532711
    },
    "multiclass_confusion_matrix" : {
      "confusion_matrix" : [
        {
          "actual_class" : "1",
          "actual_class_doc_count" : 70,
          "predicted_classes" : [
            {
              "predicted_class" : "1",
              "count" : 63
            },
            {
              "predicted_class" : "2",
              "count" : 7
            },
            {
              "predicted_class" : "3",
              "count" : 0
            },
            {
              "predicted_class" : "5",
              "count" : 0
            },
            {
              "predicted_class" : "6",
              "count" : 0
            },
            {
              "predicted_class" : "7",
              "count" : 0
            }
          ],
          "other_predicted_class_doc_count" : 0
        },
        {
          "actual_class" : "2",
          "actual_class_doc_count" : 76,
          "predicted_classes" : [
            {
              "predicted_class" : "1",
              "count" : 7
            },
            {
              "predicted_class" : "2",
              "count" : 64
            },
            {
              "predicted_class" : "3",
              "count" : 0
            },
            {
              "predicted_class" : "5",
              "count" : 4
            },
            {
              "predicted_class" : "6",
              "count" : 1
            },
            {
              "predicted_class" : "7",
              "count" : 0
            }
          ],
          "other_predicted_class_doc_count" : 0
        },
        {
          "actual_class" : "3",
          "actual_class_doc_count" : 17,
          "predicted_classes" : [
            {
              "predicted_class" : "1",
              "count" : 9
            },
            {
              "predicted_class" : "2",
              "count" : 5
            },
            {
              "predicted_class" : "3",
              "count" : 3
            },
            {
              "predicted_class" : "5",
              "count" : 0
            },
            {
              "predicted_class" : "6",
              "count" : 0
            },
            {
              "predicted_class" : "7",
              "count" : 0
            }
          ],
          "other_predicted_class_doc_count" : 0
        },
        {
          "actual_class" : "5",
          "actual_class_doc_count" : 13,
          "predicted_classes" : [
            {
              "predicted_class" : "1",
              "count" : 0
            },
            {
              "predicted_class" : "2",
              "count" : 1
            },
            {
              "predicted_class" : "3",
              "count" : 0
            },
            {
              "predicted_class" : "5",
              "count" : 8
            },
            {
              "predicted_class" : "6",
              "count" : 3
            },
            {
              "predicted_class" : "7",
              "count" : 1
            }
          ],
          "other_predicted_class_doc_count" : 0
        },
        {
          "actual_class" : "6",
          "actual_class_doc_count" : 9,
          "predicted_classes" : [
            {
              "predicted_class" : "1",
              "count" : 0
            },
            {
              "predicted_class" : "2",
              "count" : 0
            },
            {
              "predicted_class" : "3",
              "count" : 0
            },
            {
              "predicted_class" : "5",
              "count" : 1
            },
            {
              "predicted_class" : "6",
              "count" : 8
            },
            {
              "predicted_class" : "7",
              "count" : 0
            }
          ],
          "other_predicted_class_doc_count" : 0
        },
        {
          "actual_class" : "7",
          "actual_class_doc_count" : 29,
          "predicted_classes" : [
            {
              "predicted_class" : "1",
              "count" : 1
            },
            {
              "predicted_class" : "2",
              "count" : 0
            },
            {
              "predicted_class" : "3",
              "count" : 0
            },
            {
              "predicted_class" : "5",
              "count" : 1
            },
            {
              "predicted_class" : "6",
              "count" : 0
            },
            {
              "predicted_class" : "7",
              "count" : 27
            }
          ],
          "other_predicted_class_doc_count" : 0
        }
      ],
      "other_actual_class_count" : 0
    }
  }
}

the accuracy for class_name: 1 should be 63/70 = 0.9, but it says 0.8878504672897196 in the above. Also, the overall_accuracy should be at least 0.88, as all the per-class accuracy is bigger than 0.88. Could you take a look please?

here it is the job configure, in case if you need. I was using 7.7-2227 build

PUT _ml/data_frame/analytics/glass_identification3
{
  "source": {
    "index": "glass_identification"
  },
  "dest": {
    "index":"dest_glass_identification3"
  },
  "analysis": 
    {
      "classification": {
        "dependent_variable": "glass_type",
        "num_top_classes": 3,
        "training_percent": 80
      }
    }
}

:ml >bug

Source

przemekwitek

All 5 comments

Pinging @elastic/ml-core (:ml)

elasticmachine on 20 Mar 2020

I believe the numbers are correct. It's the interpretation of the numbers that may not be clear.
Please read the following explanation I gave in the code comment in Accuracy.java file:

/**
 * {@link Accuracy} is a metric that answers the following two questions:
 *
 *   1. What is the fraction of documents for which predicted class equals the actual class?
 *
 *      equation: overall_accuracy = 1/n * Σ(y == y')
 *      where: n  = total number of documents
 *             y  = document's actual class
 *             y' = document's predicted class
 *
 *   2. For any given class X, what is the fraction of documents for which either
 *       a) both actual and predicted class are equal to X (true positives)
 *      or
 *       b) both actual and predicted class are not equal to X (true negatives)
 *
 *      equation: accuracy(X) = 1/n * (TP(X) + TN(X))
 *      where: X     = class being examined
 *             n     = total number of documents
 *             TP(X) = number of true positives wrt X
 *             TN(X) = number of true negatives wrt X
 */

So the per-class accuracy for class "3" is calculated as (214-9-5)/214=0.93457943925
where 214 is total number of docs. In other words, samples of class "2" predicted as "5" count as predicted accurately for the sake of calculation accuracy of class "3".

  |  1 |  2 |  3 |  5 |  6 |  7 | per-class accuracy
--|----|----|----|----|----|----|--------------
1 | 63 |  7 |  0 |  0 |  0 |  0 | 0.8878504673
2 |  7 | 64 |  0 |  4 |  1 |  0 | 0.8831775701
3 |  9 |  5 |  3 |  0 |  0 |  0 | 0.9345794393
5 |  0 |  1 |  0 |  8 |  3 |  1 | 0.9485981308
6 |  0 |  0 |  0 |  1 |  8 |  0 | 0.976635514
7 |  1 |  0 |  0 |  1 |  0 | 27 | 0.9859813084

Does it make sense @wwang500 ?

przemekwitek on 20 Mar 2020

Thanks @przemekwitek Indeed, if I am using "confusion_matrix", "classification_report" and "accuracy_score" from "sklearn.metrics", I got the same result on overall_accuracy, confusion_matrix, recall and precision : 👍

sklearn.metrics confusion_matrix result: 
[[63  7  0  0  0  0]
 [ 7 64  0  4  1  0]
 [ 9  5  3  0  0  0]
 [ 0  1  0  8  3  1]
 [ 0  0  0  1  8  0]
 [ 1  0  0  1  0 27]]
sklearn.metrics classification_report result: 
              precision    recall  f1-score   support

           1       0.79      0.90      0.84        70
           2       0.83      0.84      0.84        76
           3       1.00      0.18      0.30        17
           5       0.57      0.62      0.59        13
           6       0.67      0.89      0.76         9
           7       0.96      0.93      0.95        29

    accuracy                           0.81       214
   macro avg       0.80      0.73      0.71       214
weighted avg       0.83      0.81      0.79       214

sklearn.metrics accuracy_score result: 
0.8084112149532711

However, I am still not sure about the per class accuracy part. I will see if I can find some metrics to compare with.

wwang500 on 20 Mar 2020

👍1

More research about per class accuracy part. I kind of agree with someone from internet:

_Accuracy is a global measure, and there is no such thing as class-wise accuracy. The suggestions to normalize by true cases (rows) yields something called true-positive rate, sensitivity or recall, depending on the context. Likewise, if you normalize by prediction (columns), it's called precision or positive predictive value._

wwang500 on 21 Mar 2020

I understand why this per-class accuracy metric may be confusing. Still, I think it can be useful to some customers who want to evaluate a multiclass predictor as if it was a binary predictor (with a given particular class).

@tveasey, WDYT?

przemekwitek on 23 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings