Currently the built-in Metrics seem to omit Average Precision, which is widely used in classification/object detection as the standard evaluation metrics. I suggest adding this metric as it is usually a better metric than accuracy, especially when data have imbalance classes.
@pkdogcom could you please provide more details on your feature request and why it is not possible to use built-in precision metrics.
For the classification task, built-in precision metric can compute precision for each class and the average can be easily computed and reported in a handler, for example like here.
@vfdev-5 By definition, Average Precision is the area under the precision-recall curve which tells how precision changes as recall increases (usually by varying threshold of confidence score) while a single precision value only tells the performance at some single uncertain recall value (in current precision metrics implementation the threshold for binary classification is 0.5).
Take face detection as an example. In the detection pipeline usually a binary face/non-face classification score is used, and if the precision metric is used rather than average precision metric, then one has to manually determine a threshold of confidence score, which in most case will be simply 0.5. Then it could be the case that under the given threshold, 99% of the bounding box classified as face are actual faces (i.e. precision 0.99, which looks reasonably good), while maybe only 50% of all faces are classified as face (i.e. recall 0.5, which may not be good in some application). And if the application requires the model detects most, say 95%, faces, then one has to lower the threshold of confidence score,which in turn will include some false faces and thus lower the precision. However, since a single precision metric is used, it is unknown that what such precision will be under the condition of recall 0.95, so the overall performance of the model is not fully evaluated. That's why most object detection and many classification tasks use AP (and mAP, which is the mean of average precision across all classes)
@pkdogcom thanks for the explanation!
@pkdogcom I think this is a good idea. Somewhat related, I think we should add metrics for AUC/ROC as well. These metrics (along with existing Precision/Recall) will likely share quite a bit of logic.
sounds like a great idea. @pkdogcom can you send a PR?
We can inspire from tnt/meters and from skorch
@jasonkriss @alykhantejani I was thinking about to add AUC/ROC and mAP metrics to ignite, so I tried firstly to adapt the code from tnt/aucmeter. Implementation can be easily adapted, but there are some problems with it:
Another possibilty instead of rewriting these functions, we can provided a sort of EpochMetric as in skorch that accumulates predictions and targets and can use sklearn metrics on it. We can avoid new dependency just asking for a callable.
What do you think about this ?
I'm not too familiar with Skorch, how would this look in ignite (i.e. what would the user code that uses this metric look like)?
Either way, I think these are useful metrics to have :)
Personally, I would like to have these functions as built-ins (at least AUC/ROC) without having to bring in sklearn. That being said, some sort of EpochMetric sounds like it could be a good idea. Could probably refactor some of the current metrics to utilize that and it would also allow us to piggyback on sklearn where there are gaps in our metrics. At least until we add them directly to ignite.
EpochMetric implementation to collect prediction is not that complicated and for AUC it could look like this:
Click to expand
class AUC(Metric):
def reset(self):
self._scores = torch.tensor([], dtype=torch.float32)
self._targets = torch.tensor([], dtype=torch.long)
def update(self, output):
y_pred, y = output
assert y_pred.ndimension() == 2, "Predictions should be of shape (batch_size, 1)"
assert y.ndimension() == 1, "Targets should be of shape (batch_size,)"
assert torch.equal(y**2, y), 'Targets should be binary (0 or 1)'
y_pred = y_pred.squeeze(dim=-1).to('cpu')
y = y.to('cpu')
self._scores = torch.cat([self._scores, y_pred.type_as(self._scores)], dim=0)
self._targets = torch.cat([self._targets, y], dim=0)
def compute(self):
n_samples = self._scores.shape[0]
if n_samples == 0:
raise NotComputableError('AUC must have at least one example before it can be computed')
# sorting the arrays
scores, sortind = torch.sort(self._scores, dim=0, descending=True)
# creating the roc curve
n = n_samples + 1
tpr = torch.zeros(n, dtype=torch.float64)
fpr = torch.zeros(n, dtype=torch.float64)
# THE FOLLOWING IS SLOW AND NOT CORRECT IF PROBAS ARE SAME
for i in range(1, n):
if self._targets[sortind[i - 1]] == 1:
tpr[i] = tpr[i - 1] + 1
fpr[i] = fpr[i - 1]
else:
tpr[i] = tpr[i - 1]
fpr[i] = fpr[i - 1] + 1
# End of THE FOLLOWING IS SLOW AND NOT CORRECT IF PROBAS ARE SAME
targets_sum = self._targets.sum().item() * 1.0
tpr /= targets_sum
inv_targets_sum = (n_samples - targets_sum) * 1.0
fpr /= inv_targets_sum
# calculating area under curve using trapezoidal rule
n = tpr.shape[0]
h = fpr[1:n] - fpr[0:n - 1]
sum_h = torch.zeros_like(fpr)
sum_h[0:n - 1] = h
sum_h[1:n] += h
area = (sum_h * tpr).sum().item() / 2.0
return area, tpr, fpr
So the idea was to ask user to code its compute function.
But I agree that built-in function can be better than sklearn dependency...
Sorry, I don't think I communicated that very clearly. I was agreeing with you for the most part. I was saying an EpochMetric where users can provide their own compute functions sounds like a good idea.
Eventually we can add the built-in metrics but in the meantime, the EpochMetric can make things easier than it is today.
@jasonkriss thanks for more the explanation. So, I can go on with a code similar to the one I proposed above and we'll see...
@vfdev-5 are you planning on implementing some of these metrics? Perhaps we should open a new issue and just gather a list of metrics we want to implement for the next release
@alykhantejani I'm planning at first to provide something like EpochMetric as described here and maybe latter go on with built-in AUC/mAP metrics.
Agree with you on new issues, we can go ahead with this too.
I can create firstly an issue on EpochMetric if we are ok with this approach..
This has now been added in #235
Most helpful comment
@vfdev-5 By definition, Average Precision is the area under the precision-recall curve which tells how precision changes as recall increases (usually by varying threshold of confidence score) while a single precision value only tells the performance at some single uncertain recall value (in current precision metrics implementation the threshold for binary classification is 0.5).
Take face detection as an example. In the detection pipeline usually a binary face/non-face classification score is used, and if the precision metric is used rather than average precision metric, then one has to manually determine a threshold of confidence score, which in most case will be simply 0.5. Then it could be the case that under the given threshold, 99% of the bounding box classified as face are actual faces (i.e. precision 0.99, which looks reasonably good), while maybe only 50% of all faces are classified as face (i.e. recall 0.5, which may not be good in some application). And if the application requires the model detects most, say 95%, faces, then one has to lower the threshold of confidence score,which in turn will include some false faces and thus lower the precision. However, since a single precision metric is used, it is unknown that what such precision will be under the condition of recall 0.95, so the overall performance of the model is not fully evaluated. That's why most object detection and many classification tasks use AP (and mAP, which is the mean of average precision across all classes)