Ignite: Metrics Implementation Question

Created on 30 May 2020  路  16Comments  路  Source: pytorch/ignite

Thanks for the great library, especially the metrics. I have a few questions to better understand the implementation:

During the update stage, why are the values converted to Python floats instead of keeping them as torch values (e.g. here)? This operation incurs a device->host transfer, so the operation is blocking, right? Wouldn't it be better to keep the metric values as torch values on the GPU so the update is async? Then, they can be converted to python floats in the compute method.

In the distributed case, the values are put back in a tensor before the all-reduce, so why not keep them as tensors to begin with?

question

Most helpful comment

@n2cholas in some sense maybe it could make sense to update metrics code, such that internal cumulators where is not done become tensors and user can specify the storage by device. We already have this argument but it is unsed in most of the cases... We also should be careful about specfic implementations where double precision is required...

All 16 comments

@n2cholas thanks for good question !

Depending on metric, but, yes, majority of metrics internally stores scalars and that's true that passing those scalars to CPU on update has a slight overhead (~20 ms per 1000 updates).
Reasons for that were: implicit graph detach, no GPU storage (tiny part). However, I agree that we can store them as tensors, use device arg to define where to keep the tensors. This can be a nice enhancement to benchmark.

Thanks for the quick reply! If we keep them as tensors, would it be better to keep them on the GPU or CPU by default? I believe tf.keras metrics are on the GPU by default. This would allow the update method to be completely async.

I would be happy to contribute this enhancement to the metrics, though I don't have experience benchmarking performance to see if we get an improvement/regression.

@n2cholas thank you to contribute on that issue :)

IMO the first point should be to have a performance comparison between keep tensors on GPU or not. Bench could be done on cifar10 for instance.

@n2cholas sounds great about contributing ! If you would like to start right now, please use idist branch as reference code, otherwise we plan to merge it within 1-2 days, so you can work from the master.

About benchmark, I think a simple time measure of 2 codes where we run validation phase with metrics computation several times (e.g. 10) can be OK.

I ran the benchmarks using this script: https://gist.github.com/n2cholas/654850d6345242c4406e0bef23e3d060. Essentially, I evaluated the accuracy metric for training on CIFAR 10 over one epoch. I'd love feedback on this benchmarking script, if it's flawed in any way.

Over 20 runs, here's a summary:
|Accuracy Impl| Avg Time (s) | Std Deviation|
|----------| ---------| ----------|
|Ignite Master| 15.1008 | 0.0706 |
|Custom on CPU| 15.1145 | 0.0640|
|Custom on GPU| 14.9935 | 0.0461|

There is a slight advantage to using the metric with the value on the GPU, but only if the GPU is saturated during training.

I also tracked the traces for each of these implementations.

For the ignite metrics:
ignite-trace

For my accuracy metric where the tensors are on the CPU:
custom-cpu-trace

For my accuracy metric where the tensors are on the GPU:
custom-cuda-trace

These are produced by 3 iterations over the dataset (you can see the code in the benchmark script). When the GPU is saturated, the ignite metrics' .item() call blocks progress on the CPU until the CUDA code has caught up. Similarly, the my custom metric where the tensor is on the CPU has a .to('cpu') call that blocks. Only the custom metric with the cuda tensor does not block progress. However, the impact of this overall is quite small in this example. I'd imagine there would be a bigger difference if there is more complex python code around the PyTorch code.

But in general, people have blocking operations in their code all the time, especially when doing periodic reporting. Making this change doesn't have as big of an impact as I expected, so I'm not sure if it's worth changing the current behaviour (especially given current user expectations). Any thoughts?

@n2cholas Thank you for this work. Your code looks perfect.

Results are interesting and I agree with you. The fact that it could be completely async is nice and could be useful for complex applications. The performance gap is small, that鈥檚 a good point for legacy code. I agree your conclusion, atm no worth to do something.

I鈥檓 curious about similar experiment in testing mode rather than training.

Another interesting point should be to benchmark as you did the ConfusionMatrix metric with segmentation. I鈥檓 too curious 馃槄

@sdesrozis there was an issue on ConfusionMatrix: https://github.com/pytorch/ignite/issues/684
Its implementation uses torch tensors on cpu and double fp for precision.

@n2cholas thanks for the gist, nice code !
I rerun it on Nvidia 1080 Ti and here is my numbers:

Done trace: ignite_cpu_trace
Mean Time: 10.918659210205078s  Std Time: 0.17898264527320862s  All Times: [10.5758359375, 10.5846796875, 10.6189111328125, 10.6439775390625, 10.753939453125, 10.9914912109375, 11.062501953125, 11.0441865234375, 11.043784179687501, 10.942732421875, 11.0252587890625, 10.94737890625, 11.0996953125, 11.072447265625, 10.8777783203125, 10.988880859375, 11.0728125, 10.9588857421875, 11.002666015625, 11.065345703125]
Done trace: custom_cuda_trace
Mean Time: 10.895665168762207s  Std Time: 0.14294369518756866s  All Times: [10.5118779296875, 10.517384765625, 10.74994140625, 10.9200849609375, 11.005560546875, 10.9894736328125, 11.014259765625, 10.959455078125, 10.9215078125, 10.9549453125, 10.9544873046875, 10.968365234375, 10.8974111328125, 10.92003515625, 10.9572177734375, 11.0078251953125, 10.9378408203125, 10.8899716796875, 10.87141796875, 10.9642353515625]



md5-92a7a69f6714fec08615723a4a668b44



custom cuda: Mean Time: 6.645386695861816s  Std Time: 0.23643706738948822s
ignite: Mean Time: 6.786988735198975s  Std Time: 0.16527831554412842s

@vfdev-5 thanks for the rerun and choosing a more realistic batch size.

@sdesrozis Here is a similar script for evaluating a validation loop instead of training. I used a batch size of 512 (like @vfdev-5 ) and 50 runs to get a tighter standard deviation. This and my previous runs were both on a GTX 1080. The results are pretty similar, but the traces are much cleaner. The completely async nature of the custom cuda implementation is much more apparent.

Ignite:

Mean Time: 1.059605598449707s
Std Time: 0.014668694697320461s
All Times: [1.111772705078125, 1.0760010986328126, 1.0779454345703126, 1.0475296630859374, 1.063631591796875, 1.0401522216796875, 1.08554150390625, 1.0513619384765625, 1.0520213623046875, 1.0756719970703126, 1.05441259765625, 1.06146484375, 1.0483077392578126, 1.0805594482421874, 1.047625732421875, 1.050948486328125, 1.0686212158203126, 1.040948974609375, 1.068043701171875, 1.0732552490234375, 1.0576787109375, 1.038017578125, 1.0690546875, 1.0786173095703124, 1.0792237548828125, 1.067673583984375, 1.0534521484375001, 1.048385498046875, 1.0591666259765624, 1.0659741210937501, 1.0362990722656251, 1.0420843505859376, 1.053845947265625, 1.0706207275390625, 1.0574412841796874, 1.05014111328125, 1.0607860107421876, 1.0439991455078126, 1.060889404296875, 1.051964599609375, 1.0492752685546876, 1.070322021484375, 1.0469660644531251, 1.0494393310546875, 1.0538858642578126, 1.0563857421875, 1.0503135986328125, 1.0760842285156251, 1.043965087890625, 1.0625087890625]

Custom on CPU:

Mean Time: 1.0600998401641846s
Std Time: 0.016978954896330833s
All Times: [1.0984403076171876, 1.0494051513671876, 1.035371826171875, 1.0610870361328124, 1.045078125, 1.0598719482421874, 1.0756278076171875, 1.0297471923828125, 1.027935302734375, 1.065294189453125, 1.0613526611328126, 1.0637432861328124, 1.0517838134765625, 1.0540438232421876, 1.0337780761718751, 1.0409871826171875, 1.0708486328125, 1.0449569091796875, 1.0629097900390625, 1.0386883544921874, 1.0573258056640624, 1.0498575439453126, 1.0445811767578126, 1.0379779052734375, 1.0409044189453125, 1.05157470703125, 1.03104833984375, 1.080220703125, 1.0673245849609376, 1.0762749023437501, 1.062837646484375, 1.051602294921875, 1.0489697265625, 1.0705521240234375, 1.070385498046875, 1.057185546875, 1.0765167236328126, 1.0771417236328125, 1.058330322265625, 1.0861689453125, 1.083775146484375, 1.0895032958984374, 1.077263916015625, 1.0667071533203125, 1.0633187255859375, 1.0628101806640626, 1.0629976806640624, 1.0817080078125, 1.0624017333984375, 1.0867723388671875]

Custom on GPU:

Mean Time: 0.9974555969238281s
Std Time: 0.012949894182384014s
All Times: [1.026460693359375, 1.0256392822265625, 0.9810296630859375, 0.992499267578125, 0.9777049560546875, 0.9932820434570313, 0.9784326171875001, 0.9781923828125, 0.990439453125, 0.984932373046875, 1.008616455078125, 1.013127197265625, 0.992532470703125, 0.9864641723632813, 0.9835632934570313, 1.0048839721679688, 0.9925211791992188, 0.9954232177734376, 1.000791015625, 0.999962646484375, 0.9947289428710938, 0.98918603515625, 0.9813984985351563, 1.003957275390625, 0.9944736938476563, 0.9956331787109375, 0.9954109497070313, 1.0007716064453125, 0.99830859375, 1.0178343505859375, 1.0066236572265626, 0.9960570678710937, 0.9823938598632813, 1.0055481567382814, 0.997375, 1.0058368530273438, 0.9903257446289063, 0.9843046264648437, 0.9871902465820312, 1.0250125732421875, 1.0187028198242187, 0.996167236328125, 0.9972598266601562, 0.9860249633789062, 1.012158447265625, 0.98768896484375, 1.024879638671875, 1.0026895141601562, 0.9879009399414063, 1.000437744140625]

Traces:

Ignite:
ignite2

Custom CPU:
custom-cpu2

Custom GPU:
custom-cuda2

@n2cholas in some sense maybe it could make sense to update metrics code, such that internal cumulators where is not done become tensors and user can specify the storage by device. We already have this argument but it is unsed in most of the cases... We also should be careful about specfic implementations where double precision is required...

@n2cholas Thank you for this. We can do a FR on this following @vfdev-5 remarks.

@n2cholas would you be interested in working on this feature request ?

@vfdev-5 @sdesrozis Sorry about not following up: I currently don't have the bandwidth to work on this. I will have more time in August, and would be happy to work on this feature request then, if the feature can wait. If it can't wait, then I think someone else should work on it.

@n2cholas thank you for your answer. Let鈥檚 see if someone take it, otherwise it鈥檚 yours in august 馃槉

I'm available to take this now!

Based on the discussion, the compute method should return tensors with the same device as the input, and update method should move the tensors to that metric's device (if needed). The accumulators would become tensors. This affects the following classes:

  • Accuracy
  • Fbeta
  • Loss
  • MeanAbsoluteError
  • MeanPairwiseDistance
  • MeanSquaredError
  • _BasePrecisionRecall
  • Recall
  • RootMeanSquaredError
  • TopKCategoricalAccuracy

We had some discussion on ConfusionMatrix, but it looks like that was fixed (everything is on the device the user specifies).

In addition, I would change some of the documentation so that examples don't call .item() (e.g. this line: https://github.com/pytorch/ignite/blob/master/ignite/metrics/precision.py#L95 ).

A couple of questions:

  1. Unit-testing wise, I will add tests to make sure accumulators and results are on the right device. Is anything else needed?
  2. Right now, the return type of compute is Union[torch.Tensor, float]. After this change, the return type should just be torch.Tensor. Does that sound ok?

@n2cholas that's great !

Yes, your suggestions sounds good ! Additionally, we have to also ensure that update method breaks the graph for y_pred and we manipulate just tensors.

Unit-testing wise, I will add tests to make sure accumulators and results are on the right device. Is anything else needed?

Yes. Probably, ensure that accumulators are detached.

Right now, the return type of compute is Union[torch.Tensor, float]. After this change, the return type should just be torch.Tensor. Does that sound ok?

I think only internal accumulators can be tensors, let compute result remain scalar (e.g. float) or vector or matrix according to the metric definition.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

samarth-robo picture samarth-robo  路  3Comments

sisp picture sisp  路  3Comments

vfdev-5 picture vfdev-5  路  3Comments

alykhantejani picture alykhantejani  路  3Comments

jphdotam picture jphdotam  路  4Comments