lnd not responding to stop comamnd

Created on 6 May 2019  路  17Comments  路  Source: lightningnetwork/lnd

Background

The lnd demon does not obey the stop command. And even default kill (TERM) signal doesn't affect.

Your environment

  • version of lnd 0.6.0-beta commit=v0.6.0-beta-116-g985902be2779a5dcc7ac1d5b3d7eb7929608816d

Expected behaviour

Before this commit i didn't have a problem.

Actual behaviour

The lnd demon does not obey the stop command. And even default kill (TERM) signal doesn't affect. By logs - the lnd works as if nothing had happened. In my estimation, this happens with about 50% servers. Only after several attempts of stop or kill some demons stop.

It seems that the masking of signals has been added to the code and for a long time the code is under the influence of a signal mask. Is it possible?

Most helpful comment

@LNBIG-COM I think we have identified the source of the issue, the fix should be in #3049. You're welcome to try out that PR and confirm that it fixes the reported issue

@Donno1994 That seems to be a separate issue, but indeed it has an easy fix. See #3050

thank you for the quick fix :)

All 17 comments

I am now updating the server and this problem is massive. Sometimes I have to wait until 5-20 minutes until LND will be stopped.

It feels like lnd catches a signal, but postpones it or tries to complete connections or gossip connections and until everything finishes, it does not stop. Previously, he completed work within a few seconds after the stop command.

Even after sending multiple commands stop a lnd continues to execute commands gRPC (lncli)... It's stopping hell...

Servers where i updated to lnd v0.6.1-beta-rc1 (commit 7f08c09)
Things are even worse there. Now there 100% of demons do not obey the command stop.

You need to urgently do something!

I have only one way to stop it - to send the signal "9", but this may violate the consistency of the data...

After 10 minutes out of 8 demons only one stopped (version v0.6.1-beta-rc1). Other ones work. But worst of all - they caught this command and will execute its maybe during 10 minutes or maybe in an hour. I have to sit and monitor when they deign to end. If I leave and they will finish - it will be very bad...

The bug is located between these commits: 2b43da42209..985902b
When i had version 2b43da42209 i didn't have problem.
The 985902b commit already had one.

I'm unable to reproduce this on any of my nodes. Can you provide a goroutine dump via the profiling feature when you attempt to stop any of the nodes?

In the future, when you update nodes, it would be wise to go with a canary-like upgrade procedure. So you'd update one or two nodes to see if everything went smoothly, or even keep one of them on master at all times, them update the rest once stability is attained for the canary deployment.

@LNBIG-COM after issuing the stop, what subsystems do you still see active in the logs? My hypothesis would be DISC, but if you could confirm/deny that would point us in a direction.

I experience the same problem. Updated to v0.6 today with an old node (had to catch up from 22nd Feb).
Rescanning blocks takes very long (about 10 sec/block), which would take me about 10 hours to complete so I wanted to stop lnd with "lncli stop".
Shutdown Request was announced (see log file) but my node keeps syncing.
I uploaded part of the log file ( check timestamp 2019-05-06 21:29:23.206)

Node is still syncing at time of writing this post (35 minutes later)

I am not familiar with a goroutine dump. Sorry.

Debug.txt

I'm unable to reproduce this on any of my nodes. Can you provide a goroutine dump via the profiling feature when you attempt to stop any of the nodes?

I do not know how to do that. Also, get me right - shutdown lasts for hours and I have no desire to experiment now.

In the future, when you update nodes, it would be wise to go with a canary-like upgrade procedure. So you'd update one or two nodes to see if everything went smoothly, or even keep one of them on master at all times, them update the rest once stability is attained for the canary deployment.

I usually do that. But such an error is difficult to detect. To do this, you need to update and stop at least once after launch.

after issuing the stop, what subsystems do you still see active in the logs? My hypothesis would be DISC, but if you could confirm/deny that would point us in a direction.

I will prepare one of the logs now

@LNBIG-COM I think we have identified the source of the issue, the fix should be in #3049. You're welcome to try out that PR and confirm that it fixes the reported issue

@Donno1994 That seems to be a separate issue, but indeed it has an easy fix. See #3050

https://drive.google.com/file/d/1vu-KU_jgOTpHA5SXR05cGQk0tpwpoMpP/view?usp=sharing

It's LND-19
This is truncated log - i trauncated it before update (i didn't know about this bug in that moment)
After upgrade (0.6.1-beta) i did stop command (many times, maybe 40-50 times for example by kill and lncli stop) to check this bug in new version (0.6.1-beta)
It could not be stopped maybe during 2-3 hours. This moment is last lines of log (finish stopping)

I think we have identified the source of the issue, the fix should be in #3049. You're welcome to try out that PR and confirm that it fixes the reported issue

Do i need to merge this PR with my local branch and to try, right?
I can do it right now with LND-19. It stopped right now so i can check it quickly

I did:

cd $GOPATH/src/github.com/lightningnetwork/lnd
git pull origin pull/3049/head
make && make install

After i started and waited of full starting. After i made stop - no problem
I did it twicely - now there is quickly stopping!

I already checked this fix (3049) already three times
Last time i waited 10-15 minutes after starting and than i stopped
Bug was fixed in this PR!

@LNBIG-COM glad to hear that resolved the issue :)

@LNBIG-COM I think we have identified the source of the issue, the fix should be in #3049. You're welcome to try out that PR and confirm that it fixes the reported issue

@Donno1994 That seems to be a separate issue, but indeed it has an easy fix. See #3050

thank you for the quick fix :)

I already checked this fix (3049) already three times
Last time i waited 10-15 minutes after starting and than i stopped
Bug was fixed in this PR!

@LNBIG-COM You're welcome.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AnthonyRonning picture AnthonyRonning  路  3Comments

hxsquid picture hxsquid  路  3Comments

Richard87 picture Richard87  路  3Comments

pm47 picture pm47  路  3Comments

stevenroose picture stevenroose  路  3Comments