Actix-web: HTTP2 memory leak with many concurrent streams

Created on 28 Jul 2019  路  30Comments  路  Source: actix/actix-web

Take the static_index example from https://github.com/actix/examples

cd static_index
cargo run --release 2>/dev/null

And then run h2load -n 1000000 -c 10 -m 200000 http://127.0.0.1:8080/

When the benchmark is done, memory stays used in the server.

Most helpful comment

Taking a quick look at this is seems that the issue you're seeing here is twofold:

1) Your system allocator is suboptimal, so even after h2load finishes a lot of memory is simply lost to fragmentation. If you switch to jemalloc your memory usage will drop by a lot.
2) There is some buffer/structure bloat in actix-web itself where even after the test finishes its internal structures (which were previously bloated due to a massive amount of requests) are not shrunk back to their normal size. I've attached a flamegraph (courtesy of my memory-profiler; GitHub unfortunately doesn't allow attaching raw svg so I had to gzip it - just unpack it and open in your web browser) where you can see around ~140MB that was kept allocated long after h2load finished.

All 30 comments

Does it continue to grow on multiple runs?

@fafhrd91 Yes

i ran this on an up to date intel linux distro a dozen times with same params to h2load and memory usage grows to about 530MB and then subsequent runs don't increase memory usage.

Here's a demonstration of the growing memory usage issue. (Fedora 30 with latest Linux kernel on IBM POWER9)

Peek 2019-08-01 00-29

@fafhrd91 I'm interested in helping out on this project. Can you assign me this issue?

@schulace sure. i looked into this, i am not sure where is the bug and is it a bug at all

@leo-lb can you please delete those comments and use something like a gist ? This is unreadable..

Thanks, that looks way better. If you delete those massive comments now, it'll be much easier for all to read & keep an oversight of the thread.

Taking a quick look at this is seems that the issue you're seeing here is twofold:

1) Your system allocator is suboptimal, so even after h2load finishes a lot of memory is simply lost to fragmentation. If you switch to jemalloc your memory usage will drop by a lot.
2) There is some buffer/structure bloat in actix-web itself where even after the test finishes its internal structures (which were previously bloated due to a massive amount of requests) are not shrunk back to their normal size. I've attached a flamegraph (courtesy of my memory-profiler; GitHub unfortunately doesn't allow attaching raw svg so I had to gzip it - just unpack it and open in your web browser) where you can see around ~140MB that was kept allocated long after h2load finished.

i came to similar conclusion as @koute

Taking a quick look at this is seems that the issue you're seeing here is twofold:

  1. Your system allocator is suboptimal, so even after h2load finishes a lot of memory is simply lost to fragmentation. If you switch to jemalloc your memory usage will drop by a lot.
  2. There is some buffer/structure bloat in actix-web itself where even after the test finishes its internal structures (which were previously bloated due to a massive amount of requests) are not shrunk back to their normal size. I've attached a flamegraph (courtesy of my memory-profiler; GitHub unfortunately doesn't allow attaching raw svg so I had to gzip it - just unpack it and open in your web browser) where you can see around ~140MB that was kept allocated long after h2load finished.

BTW that can't be true. I'm having an unlimited amount of leaks here. My system has 64GB of memory. If I let it running for longer, it can start taking more than 60GB of memory and start causing the system to swap thrash, so no, it's not being lost in fragmentation.... and my system allocator is fine, the rest of my system runs fine.

@fafhrd91
And if there's some structure bloat to be solved, why would you close this issue? That's definitely an issue to be resolved.

Please re-open the issue.

It is not possible to fix without reproducible example.

It is not possible to fix without reproducible example.

I am confused, I posted the reproduction scenario in the first post, it reproduces 100% of the time for me, whether on x86 machines or my POWER9 machine.

This is a Resource Exhaustion DoS vulnerability.

I think @fafhrd91 means that your example relies currently on running a 64Gig RAM machine and spamming connections. Thus it is hard to reproduce (hardware and time wise) and far from a minimal, verifiable test case. So if you could break this down a bit more that would probably help a lot.

BTW that can't be true. I'm having an unlimited amount of leaks here. My system has 64GB of memory. If I let it running for longer, it can start taking more than 60GB of memory

When I was testing this on my system that's not the behavior I got - it got up to a certain point and then stopped growing and stabilized. I'm guessing that's also what would have happened on your system, except probably due to the fact that you have a lot more cores than I do the server is being able to do more concurrent work, hence higher maximum memory usage.

and start causing the system to swap thrash, so no, it's not being lost in fragmentation....

How did you come to this conclusion? (: Whenever your system starts swapping or not has nothing to do with this. The term "lost to fragmentation" has a very specific meaning in the context of memory allocators, and in this case you can basically think of it as memory which is unused by the application but is still being kept allocated by the memory allocator itself either because of the allocator's limitations or due to bookkeeping constraints, so from the perspective of the operating system that memory is treated as allocated even though it's in reality unused.

and my system allocator is fine, the rest of my system runs fine.

That's not what I meant. If you're on Linux you're most likely using glibc, and glibc uses ptmalloc as its allocator, which uses sbrk to allocate memory, which makes it exhibit very pathological behavior in certain cases. Cases which usually look like the scenario you're testing here.

I recommend you do a couple of retests on your machine to investigate this further:

1) Limit actix-web so that it can use at most one or two cores simultaneously. (I'd probably be most effective to patch actix-web or something like that so that it thinks that there are only two cores instead of e.g. setting the affinity and just simply not letting it get scheduled on those cores.)
2) Use jemalloc instead of the system allocator.
3) Limit the cores and use jemalloc simultaneously.

Does your issue reproduce in all of these three cases? What's each case's maximum memory usage?

closing, as proper reproducible example is not provided

what the.. the example is provided in the first post, the example cannot get any more basic than this.
The issue is not related to high core count or large amounts of RAM, it also happens on a x86 machine that has 4 cores and 16GB of RAM.

i ran this on an up to date intel linux distro a dozen times with same params to h2load and memory usage grows to about 530MB and then subsequent runs don't increase memory usage.

And in ANY case, 530MB of idle usage is not acceptable.

I can not reproduce on my Mac.

I can not reproduce on my Mac.

And that means the issue must be non-existent?

@koute When memory is "lost in fragmentation", it eventually gets reused later by future allocations. Fragmentation does not cause endless increase of memory usage. If a system starts swap thrashing, it means that the memory isnt being re-used, the usage is increasing indefinitely. Glibc's allocator does not cause leaks of memory such as this.

You could make a favor to open source community and debug the problem, especially because you can easily reproduce it.

You could make a favor to open source community and debug the problem, especially because you can easily reproduce it.

Yes, I regularly do so, and I tried already but I havent had enough time to end up with a positive result, however, that doesnt warrant closing the issue.

@leo-lb the issue is now 5 months old, actix has had some major rewrite due to async/await and apparently no one in the core team can reproduce this.
I'd suggest toning it down a little and investigating yourself a bit, or trying to find someone who can help you with that.

@leo-lb the issue is now 5 months old, actix has had some major rewrite due to async/await and apparently no one in the core team can reproduce this.
I'd suggest toning it down a little and investigating yourself a bit, or trying to find someone who can help you with that.

I'm not sure I understand what you're suggesting here.
I assure you I do not go and invent issues, this issue isnt a request for anyone to work on anything, it simply is a log, a todo list, a tool to keep track of things.
So please re-open it?

try to run app with MALLOC_ARENA_MAX=1

try to run app with MALLOC_ARENA_MAX=1

I am getting similar results, same kind of leak, and it continues leaking over time with repeated h2load runs

It seems that the latest examples do not support http2. At least not without tls maybe?
Just tried with jemalloc, same leak.
I pushed my modifications here: https://github.com/leo-lb/examples/tree/http2-memleak/static_index

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ufosky picture ufosky  路  3Comments

zhaobingss picture zhaobingss  路  4Comments

naturallymitchell picture naturallymitchell  路  4Comments

Dadibom picture Dadibom  路  4Comments

fafhrd91 picture fafhrd91  路  5Comments