Yolov3: PRECISION-RECALL CURVE

Created on 5 Mar 2020  ·  28Comments  ·  Source: ultralytics/yolov3

🚀 Feature

Precision Recall curves may be plotted by uncommenting code here when running test.py: https://github.com/ultralytics/yolov3/blob/1dc1761f45fe46f077694e1a70472cd7eb788e0c/utils/utils.py#L171

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --conf 0.001

For yolov3-spp-ultralytics.pt on COCO, the curves for all 80 classes look like this:
PR_curve

For a single class 0, or person, the curve looks like this. During testing we evaluate the area under the curve as average precision, AP. The curve should ideally go from P=1, R=0 in the top left towards P=0, R=1 at the bottom right to capture the full AP (area under the curve). By varying conf-thres you can select a single point on the curve to run your model at. Depending on your application, you may prioritize precision over recall, or vice versa.
PR_curve (1)

Stale tutorial

Most helpful comment

@jas-nat I visited this tutorial.it used accuracy to find the corresponding conf-thres,but there is no accuracy in target detection.it says "a certain binary classification metric" at the beginning of the article,so this method is not suitable.
I don't think the best threshold is calculated by some formula.It's depend on your project.For example,some project need high recall and the precision isn't very important and other projects may require the opposite.So the threshold should be appropriate for your own project.
@glenn-jocher thank you for your reply!

All 28 comments

@TheophileBlard I'm thinking that perhaps we should plot P, R and mAP at seperate --conf-thres. mAP would naturally be computed near zero (i.e. 0.001), but P and R would perhaps be reported at 0.5 --conf-thres. This would be similar to Google AutoML reported results.
https://cloud.google.com/vision/automl/docs/beginners-guide?authuser=1#how_do_i_interpret_the_precision-recall_curves

0.1 confidence

Screen Shot 2020-03-06 at 1 44 44 PM

0.5 confidence

Screen Shot 2020-03-06 at 1 43 11 PM

0.9 confidence

Screen Shot 2020-03-06 at 1 44 53 PM

@glenn-jocher Sounds great! Current P&R curves are quite misleading, as the 0.001 threshold is defined in the code.

@TheophileBlard all done in feea9c1a65c73475803847c83545b5e7ee6c528c. Thanks for raising the issue, I think this update will help everyone! Here is a before and after run of the cooc64img.data tutorial. Let me know if you see any other problems.
results

I may misunderstand the PR and RECALL at training stage. The plot below is what I got when training (using my custom data that has two classes: stop sign and yield sign, and I used the default setting to split data into train/val). You can see PR, RECALL and mAP are super bad (I used the default conf).

results

However, when I run the test code for all the data together, as below:
python3 test.py --data data/stopsigns.data --cfg cfg/yolov3-spp-stopsigns.cfg \ --weights weights/yolov3-spp-ultralytics-stopsigns.pt
I got results:

           Class    Images   Targets         P         R   [email protected]        F1: 100%|
             all       554       543     0.979     0.947     0.991     0.963
        stopsign       554       276      0.97     0.938      0.99     0.954
       yieldsign       554       267     0.988     0.955     0.992     0.971

Does it mean the model overfits the dataset a lot? But when I used the model to predict some random street pictures downloaded from internet, the performance seems okay.

@rightly0716 testing on your training data is only useful us a sanity check. It serves no purpose in terms of checking for generalization, which is what the test set is for. You P and R don't matter, as you select these yourself.

mAP is the metric that matters. If your training results are not to your liking, then its time for you to experiment on ways to improve them.

I see. I have only ~500 labelled data, and am wondering whether that can be a reason. Will do more deep analysis and see.

Thanks!

@rightly0716 definitely more data would help. Also make sure you are training at an appropriate image size, and check your train.jpg and test.jpg images for correct labeling.

I canceled the code commented by ap_per_class in utils as follows:
# Plot fig, ax = plt.subplots(1, 1, figsize=(5, 5)) ax.plot(recall, precision) ax.set_xlabel('Recall') ax.set_ylabel('Precision') ax.set_xlim(0, 1.01) ax.set_ylim(0, 1.01) fig.tight_layout() fig.savefig('PR_curve.png', dpi=300)
There are two classes of my data set, but there is only one class in the PR curve graph. How can I solve it?
PR_curve

@tinothy22 ah yes, I see what you mean. The graph is inside the for loop, so it will plot one graph per class and save it (overwriting the previous one). If you want to overlay all of your classes you must modify the plotting code a bit, to create the figure before the loop, plot as is, and then save the figure after the loop.

thank you ,I try to change the code

@tinothy22 we definitely want to add this to tensorboard output in the future, for now unfortunately this is the only way to do it.

that's great! thank you for your guidance, I have got the PR curve

Hello thank you for the clear explanation. I just want to clarify my understanding of precision and recall curve threshold, as I have been reading this over and over again.

  1. Is it true that threshold can vary for each label?
  2. In feea9c1 why did you change the PR_threshold to be 0.5, but currently when I checked the code it is changed to be 0.1? Where should we specify the threshold for drawing the precision and recall curve.
  3. Is it the same if this line is changed to precision = tpc / n_p ? https://github.com/ultralytics/yolov3/blob/82f653b0f579db97f8908800d45e8f5287f79bd3/utils/utils.py#L177

Thanks and would like to hear your answers!

@jas-nat the curve has no threshold, it is plotted for all thresholds.

@glenn-jocher Got it. Thanks for answering!

hello,thank you for the clear explanation.The curves with silder seems useful.It can help me to select conf-thres.I want to know how to do this.

@risemeup you'd need to code up an interactive version of the plot above with something like plotly dashboard maybe. Let me know if you come up with a solution!

hello,thank you for the clear explanation.The curves with silder seems useful.It can help me to select conf-thres.I want to know how to do this.

Hi, I want to ask how can we know the best theshold from the curve? Is it from the results.txt or where?

@jas-nat There does not seem to be such information in result.txt.The optimal threshold is near the turning point of the PR curve,which have both high precision and recall.You can add some code in ap_per_class function to write every confidence about PR curve and find the best conf-thres.So it will be very convenient if we can plot the curve with slider.

@risemeup I am trying to implement it. Can you guide me how to find the best conf-thres?

I followed this tutorial but it applied precision_recall_curve from scikit-learn. A little confused in finding the corresponding variables in utils.py

Can high F1 score indicate the best conf-thres?

@risemeup @jas-nat there is no "optimal" or "best" threshold. It is up to the user to set this however they like, depending on the compromise they desire between increasing recall and reducing FPs.

@jas-nat I visited this tutorial.it used accuracy to find the corresponding conf-thres,but there is no accuracy in target detection.it says "a certain binary classification metric" at the beginning of the article,so this method is not suitable.
I don't think the best threshold is calculated by some formula.It's depend on your project.For example,some project need high recall and the precision isn't very important and other projects may require the opposite.So the threshold should be appropriate for your own project.
@glenn-jocher thank you for your reply!

@glenn-jocher @risemeup Thank you for the replies!

Sorry I am still trying to understand the codes.
https://github.com/ultralytics/yolov3/blob/bdf546150df5aaeacd1eb415b5dc830096079880/utils/utils.py#L188
In that line, as far as I understand, it will create a new interpolation point referring to conf[i] for x axis and precision[:, 0] or recall[:, 0] for y axis, am I right?

I have 2 questions:

  1. I don't see whenpr_score changes in the codes. Doesn't np.interp() function need the new points at the first argument to draw the interpolation? If I miss something, let me know.
  2. when I try to print p[ci] it only shows 1 value. Does it mean the generated interpolated value?

For your information, I only train for 1 label.

@jas-nat I will try to explain two questions from my understanding.If I make mistake,let me know.

  1. pr_scorewas set to a fixed parameter.we can get a set of precision,recall and conf when drawing PR curve.But we only need one precision to describe current training status,so we can select the precision when conf-thres set as pr_score.
    https://github.com/ultralytics/yolov3/blob/8241bf67bb0cc1c11634bdb4cc76e06ac072192b/utils/utils.py#L167

2.Yes,p[ci] is generated by interpolation.It should be explained above.

I wise it can help you. If there is anything wrong, please point it out.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@glenn-jocher

  1. According to what you said above, is the P, R, mAP and F1 obtained from training your own data have no reference value? Is there no value in getting P, R, mAP and F1 from the test? How to evaluate the quality of the training model?
  1. test.py Why is conf-thres set to 0.01?

  2. I only use one category, do I need to set --single-cls?

thank you!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings