This line in the balanced_accuracy function can incur a division by zero error.
In multi-class classification problems, there may be a case for the balanced_accuracy function where y_true == this_class is False for all entries (this_class appeared in y_predicted but never appeared in y_true, possibly due to cross validation and a highly imbalanced data-set) and therefore dividing by float(sum((y_true == this_class))) will incur a division by zero error.
No error.
ZeroDivisionError: float division by zero.
If I understood the function of balanced_accuracy correctly, then the alternative method below may do the same job but, because it uses sklearn's confusion matrix and some simple checks, it should be safe from this error.
cm = confusion_matrix(y_true, y_pred)
for i, r in enumerate(cm):
if np.sum(r) == 0: continue
cm[i] *= 1.0 / np.sum(r)
return np.trace(cm) / cm.shape[0]
This does indeed seem to be possible in cases where the classifier predicts classes that are not in the test set. Probably doesn't happen very often :-) but worth addressing nonetheless.
I'm open to a PR making a change like the one you suggested. Whoever submits the PR, please include several example snippets proving that the outputs are the same in both versions of the function.
The suggested fix is not completely right. After I implemented it and tested on some simple test cases, it does not yield the same results. The tests are followed:
Result on [0, 1, 2, 3] and [0, 1, 2, 3]:
Original:
1.0
Fixed:
1.0
Result on [0, 1, 2, 3] and [1, 1, 2, 3]:
Original:
0.833333333333
Fixed:
0.75
Result on [0, 1, 2, 3] and [1, 2, 3, 4]:
Original:
ZeroDivisionError: float division by zero
Fixed:
0.0
Can we add this condition check?
if float(sum((y_true == this_class))):
this_class_sensitivity = 0
else:
this_class_sensitivity = \
float(sum((y_pred == this_class) & (y_true == this_class))) /\
float(sum((y_true == this_class)))
The reason is that if a predicted class is not in the test case, then it means the proportion of positives that are correctly identified as this class is zero.
@martinzc: you are right. I believe this is because that, while both methods are trying to achieve the same thing (prediction accuracy per class that is insensitive to data-set imbalance), they are going about it differently. Adding safety to the original function instead is preferred if the goal is to keep the function output consistent.
Fix is merged into dev branch. Thanks again all.