I can't achieve to cap memory consumption in my work flow.
I create and release sessions over time, from a single long running process (an HTTP real-time serving application), in order to infer on daily retrained models.
I create one new session per new trained model
Outdated model sessions are closed when replaced by newer ones, but memory usage keep growing.
C APII tried to replicate my work flow in a demo app : we can see memory increasing (despite go runtime memory is constant)
git clone https://github.com/glutamatt/onnxleak.git && cd onnxleak
docker build --tag tmponnxleak . && docker run -it --rm tmponnxleak
I need help to cap the memory usage sooner in the go process
Additional context
demo output
#1 New Session (6 threads, 10 MB resident, 71 MB go sys)
#2 New Session (7 threads, 46 MB resident, 73 MB go sys)
#3 New Session (22 threads, 55 MB resident, 73 MB go sys)
#4 New Session (22 threads, 62 MB resident, 73 MB go sys)
#5 New Session (25 threads, 65 MB resident, 73 MB go sys)
...
#20 New Session (28 threads, 136 MB resident, 73 MB go sys)
...
#50 New Session (30 threads, 238 MB resident, 74 MB go sys)
...
#100 New Session (35 threads, 292 MB resident, 74 MB go sys)
...
#200 New Session (43 threads, 301 MB resident, 74 MB go sys)
...
#500 New Session (50 threads, 321 MB resident, 74 MB go sys)
...
#750 New Session (50 threads, 326 MB resident, 74 MB go sys)
...
#1000 New Session (55 threads, 336 MB resident, 74 MB go sys)
here is running one session after the other (open, infer, close, open, infer, close, ...) with 10000 concurrent predictions for each session
the model is always the same (a 3MB onnx file, with 3 hidden dense layers, 1 input (shape [1, 1415]) , 1 output (shape [1, 100]))
the input is always the same "zero filled" vector
(I tried an other work flow, infinitely running predictions over one single session, the memory is well capped)
@glutamatt : Thanks for the well-designed case demo, some memory may failed to be released timely if not leaked. Will try dig in.
@glutamatt : when setting predPerSession to 1, the memory consumption stabilize after few minutes of running. Were there any mem leaks, for sure we should be able to see the increase on regular basis no?
When letting predPerSession to be of 2 or higher value, the time for memory becoming flat was obvious longer, but indeed it run flat finally.
So for your case where predPerSession is 10000, maybe we just need to wait with bit more patience.
the thing is, go GC has its own algo for when to release the unused resources, before due it may just hold on to them for a while.
Thanks for spending time on this !!!
For sure, the increase of memory usage tend to lower over time (at least it seems to).
But with a decent amount sessions, and continuous predictions per session, the overhead of memory usage is a concern.
Maybe the term _leak_ is misused, the issue is more a "_too much memory_ used" case
I would expect much more memory freed on session close : ideally closing a session would free all the memory allocated for the session.
Some production figures : each pod serve 8 models concurrently. The performances (latencies) are immediately perfect when to pod startup, with about 250 MB of resident memory usage.
After few days and (each model being "reloaded" daily), each pod reach 1GB of resident memory usage, and keep growing (because of redeploy and kubernetes memory limits, I never reached any "approximate flat line" of memory usage)
I just look forward to maintain a memory usage at about 250 MB.
I saw a memory related feature in the freshly released version 1.5.1 :
Sharing of allocators between multiple sessions. This allows much better utilization of memory ...
maybe it would help ?
@glutamatt
I agree that "too much memory used", one possible alternative is to utilize ORT C/C++ API where you are empowered to release anything no longer needed, or you could go with C# where one could force GC to collect unused, I have tested C# API on your model - as long as as disposing everything timely and call GC.collect(...), the mem is flat.
Shared allocators could indeed help the case when there are a lot of common weights between sessions, worth trying at least.
Thanks Randy for your work and advices
Please one last help :
utilize ORT C/C++ API where you are empowered to release anything
:eyes: I already try to releases as much as possible with the C Api.
Which Api are you talking exactly ?
Do you have some example or snippet to help me to start ?
Thanks again :pray:
@glutamatt:
https://github.com/microsoft/onnxruntime/blob/master/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp
:-)
Most helpful comment
@glutamatt:
https://github.com/microsoft/onnxruntime/blob/master/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp
:-)