Tpot: Balanced accuracy function can incur a "division by zero" error

Created on 5 Jun 2017 · 5Comments · Source: EpistasisLab/tpot

Context of the issue

This line in the balanced_accuracy function can incur a division by zero error.

Process to reproduce the issue

In multi-class classification problems, there may be a case for the balanced_accuracy function where y_true == this_class is False for all entries (this_class appeared in y_predicted but never appeared in y_true, possibly due to cross validation and a highly imbalanced data-set) and therefore dividing by float(sum((y_true == this_class))) will incur a division by zero error.

Expected result

No error.

Current result

ZeroDivisionError: float division by zero.

Possible fix

If I understood the function of balanced_accuracy correctly, then the alternative method below may do the same job but, because it uses sklearn's confusion matrix and some simple checks, it should be safe from this error.

cm = confusion_matrix(y_true, y_pred)
for i, r in enumerate(cm):
    if np.sum(r) == 0: continue
    cm[i] *= 1.0 / np.sum(r)
return np.trace(cm) / cm.shape[0]

bug need contributor

Source

KhaledSharif

All 5 comments

This does indeed seem to be possible in cases where the classifier predicts classes that are not in the test set. Probably doesn't happen very often :-) but worth addressing nonetheless.

I'm open to a PR making a change like the one you suggested. Whoever submits the PR, please include several example snippets proving that the outputs are the same in both versions of the function.

rhiever on 5 Jun 2017

The suggested fix is not completely right. After I implemented it and tested on some simple test cases, it does not yield the same results. The tests are followed:

Result on [0, 1, 2, 3] and [0, 1, 2, 3]:
Original:
1.0
Fixed:
1.0
Result on [0, 1, 2, 3] and [1, 1, 2, 3]:
Original:
0.833333333333
Fixed:
0.75
Result on [0, 1, 2, 3] and [1, 2, 3, 4]:
Original:
ZeroDivisionError: float division by zero
Fixed:
0.0

martinzc on 9 Jun 2017

👍1

Can we add this condition check?
if float(sum((y_true == this_class))):
this_class_sensitivity = 0
else:
this_class_sensitivity = \
float(sum((y_pred == this_class) & (y_true == this_class))) /\
float(sum((y_true == this_class)))
The reason is that if a predicted class is not in the test case, then it means the proportion of positives that are correctly identified as this class is zero.

martinzc on 10 Jun 2017

@martinzc: you are right. I believe this is because that, while both methods are trying to achieve the same thing (prediction accuracy per class that is insensitive to data-set imbalance), they are going about it differently. Adding safety to the original function instead is preferred if the goal is to keep the function output consistent.

KhaledSharif on 10 Jun 2017

Fix is merged into dev branch. Thanks again all.

rhiever on 14 Jun 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings