Lightning: Anormal CPU consumption

Created on 20 Aug 2019  路  28Comments  路  Source: ElementsProject/lightning

When I develop for BTCPayServer, I use a docker-compose which run all the dependencies needed for a comfortable dev environment on regtest. (on 0.7.2)

In this docker-compose I have two clightning: A and B where A open channel to B and send payment to B. (never the reverse)

I noticed that A was taking 100% of my CPU while being idle. Everything seemed to works fine, just that the CPU was strangely high.

I have not yet tried to update to 0.7.2 in production, so I don't know yet if this is a general problem, if this is specific to 0.7.2, or specific to regtest and my own setup.

I thought reporting it in case somebody could reproduce.

voodoo

Most helpful comment

I finally took a look at the issue, and it seems like _SC_OPEN_MAX in docker images is huge:

#include <stdio.h>
#include <unistd.h>
int main() {
  int max = sysconf(_SC_OPEN_MAX);
  printf("_SC_OPEN_MAX %d\n", max);
  return 0;
}

This returns the following:

| Context | Value |
|---------------|---------|
| Host (User) | 1024 |
| Host (root) | 1024 |
| Docker (root) | 1048576 |
| Docker (user) | 1024 |

So no wonder that got a lot slower, we are issuing 1024x syscalls (close()) on each subprocess (in particular calling bitcoind for polling). So my suggested quickfix would be not to run as root in docker (might be a good idea anyway).

@NicolasDorier would that be an option?

All 28 comments

Ok it did not happened to me only.

https://github.com/btcpayserver/btcpayserver/issues/978

I have good tracing tool, will eventually figure out the issue, but there is a bug somewhere in 0.7.2

@rustyrussell @cdecker @ZmnSCPxj here are some flamegraph I took:

https://aois.blob.core.windows.net/public/1.svg

https://aois.blob.core.windows.net/public/2.svg

I tested again with 0.7.1, no problem. In prod, 0.7.2 does not show this problem, seems to be only regtest 0.7.1.

Can you also make flamegraph for 0.7.1? Thanks.

The flamegraphs for 0.7.2 seem to blame something called _arch_local_irq_enable, I am uncertain if this is relevant or not.

@ZmnSCPxj it appears relevant https://aois.blob.core.windows.net/public/3.svg

Note that on 3.svg, most of the traces are inside bitcoin-cli, not inside lightningd, because it's mainly idle.

Ouch. I have no idea what _arch_local_irq_enable is, it is nowhere in our codebase. Looks like digging through the libc and/or kernel is in order ....

@ZmnSCPxj unsure if it can help you our dockerfile is the same as yours outside stuff for tracing https://github.com/btcpayserver/lightning/blob/master/Dockerfile

Might it be the fact we run on alpine... but strange the problem was not here before. And no idea about __clone ?

__clone either means fork or pthread_create, and since we mostly use fork (and we use it a lot) it is likely just fork. Unless some library we use actually uses pthread_create underneath. And since __clone is likely to reset the stack for the new process that is likely the reason why you do not see it attributed properly as to who called the fork, because the information is erased by __clone clearing the stack.

I think there was a recent thing where we had to close fds when starting plugins, lemme try to dig that...

Could you check before and after this commit to see if this introduced the slowdown? https://github.com/ElementsProject/lightning/commit/25ddb8082369451024f471bc2d77fa7e7207cc47

If not then maybe bisect from there, haha...

@ZmnSCPxj or maybe I can just remove stack clearing with some magic compile flag?

testing without the commit

@ZmnSCPxj that's an incredible nose for smelling bugs that you have. Reverting this commit fix the issue. But why I can't see a problem in prod? Maybe because my node has no connection.

@ZmnSCPxj or maybe I can just remove stack clearing with some magic compile flag?

No, clone is a Linux kernel thing. You have to modify the kernel, not just C-Lightning.

@ZmnSCPxj that's an incredible nose for smelling bugs that you have.

Thank you very much. Principle is new code is always suspicious.

The other clue is that the flamegraph partially blames close from within the __clone call. And the commit in question adds a loop of close calls after fork but before exec.

Reverting this commit fix the issue. But why I can't see a problem in prod? Maybe because my node has no connection.

I am unsure. The code is correct functionally. Time to bring in @rustyrussell .

It could be that we are now doing The Right Thing but we erroneously did not do this before. Maybe the extra CPU load is fine?

I have gone back to 0.7.1 for our tests. I keep 0.7.2 in production, will report if user complains.

Will check the value of _SC_OPEN_MAX

so I tried to debug with printf by logging before and after the call.

Sadly nothing get printed. :(
Let me know if I can do something to help debug this.

Does the wizard can take a look at the voodoo? or let me know anything I can do to help?

Get me a replacement for my day job?

Sorry, this is some rather low-level OS stuff. The code added in the commit is correct, from what I know of Linux and POSIX. The extra CPU load might be acceptable. We also do the same code when starting a subdaemon. The difference is now we do the Right Thing (TM) on a bunch of other calls, including bitcoin-cli, which I suspect is the main cause in your particular case. Ping @rustyrussell again as he has more OS-level knowledge I suspect.

mmh maybe another issue might be that the docker is on musl (fetching glib from another source).
I vaguely remember complains about musl on some low performance of some low level stuff like this.

Might try to switch to another base image.

The added loop is a loop on calls to close, and I find it hard to imagine that such a simple syscall could have additional overhead from the C library. Still, it does not hurt to try.

We also now call into bitcoin-cli more often, and I also noticed that we recently greatly increased our running time in our Travis tests, so it seems to me likely, that these changes in total greatly increased our run time for calling into bitcoin-cli (more calls into bitcoin-cli, and more overhead per call due to closeing unused fds before calling into bitcoin-cli). So it might not be just specific to your environment.

I tried moving the docker from alpine to stretch. Same issue.

I finally took a look at the issue, and it seems like _SC_OPEN_MAX in docker images is huge:

#include <stdio.h>
#include <unistd.h>
int main() {
  int max = sysconf(_SC_OPEN_MAX);
  printf("_SC_OPEN_MAX %d\n", max);
  return 0;
}

This returns the following:

| Context | Value |
|---------------|---------|
| Host (User) | 1024 |
| Host (root) | 1024 |
| Docker (root) | 1048576 |
| Docker (user) | 1024 |

So no wonder that got a lot slower, we are issuing 1024x syscalls (close()) on each subprocess (in particular calling bitcoind for polling). So my suggested quickfix would be not to run as root in docker (might be a good idea anyway).

@NicolasDorier would that be an option?

mmh I don't know, I can try. The big problem docker and permission is that images which use non-root users can't share data via volume sharing with other containers.

The c-lightning container is sharing different directory:

  1. The tor hidden_service directory to get the onion address
  2. The bitcoin data directory to get the cookiefile for authenticating on RPC
  3. The lightning-rpc that btcpay container is using

Tor and Bitcoin containers are using their own non root user with different UID, and btcpay container is using root user.

I don't want to make change to several different images to fix this situation, as this can be a can of worms. Fixing one image which also share data might make another image fail, which need me to fix etc... and I am myself not maintainer of some images.

I can probably work around 2. by not relying on cookie auth for rpc authentication, and use a fixed password to RPC (we do this when the user select LND because LND cache the cookiefile value, and end up with bad data in cache if bitcoind reboot), I don't like this because all bitcoin RPC in BTCPay will have the same password.... We don't expose it on internet, but still I don't like it.

I think 3. should not be a problem, because this btcpay container runs as root, it will still have access to lightning-rpc.

For 1. I can fix it by using the same UID between Tor container and c-lightning.

On top of this I will need to change the permissions of all files in the c-lightning data directory, because existing user will have the files owned by root. Maybe it is easy, maybe not. I can try...

Frankly for all those problem I wonder if it is not just easier to cap the max variable to max 1024

Frankly for all those problem I wonder if it is not just easier to cap the max variable to max 1024

That wouldn't really work: though unlikely, a process might open >1024 fds, since running under root allows us to do so, and then only closing the first 1024 would leave the remainder lingering.

So far I only found one way that this works in docker reliably: run the container with --ulimit nofile=1024. This is also supported in docker-compose files. I'm still trying to see if there is a way to build this limit into the docker image itself, since having everybody add that will lead to some confusion, but I think for your use-case that is a solution.

Ok, I found another variant that might work: limit ourselves to only ever open up to 1024 fds, so we don't have to close() anything beyond that range.

Thanks a lot!

Was this page helpful?
0 / 5 - 0 ratings