Lisk-sdk: Application halts clean up when processing block

Created on 31 Jul 2018  路  20Comments  路  Source: LiskHQ/lisk-sdk

Expected behavior

The app should clean up the all the modules gracefully.

Actual behavior

Cleaning up of block processing gets stuck, due to which the application never restarts and continues logging message: Waiting for block processing to finish...

Steps to reproduce

Unclear. Occurred once in snapshotting process.

Which version(s) does this affect? (Environment, OS, etc...)

1.X

framewornode task

Most helpful comment

To check if pg-promise v8.5.4 solves the problem.

All 20 comments

@SargeKhan please confirm that the extra changes made by Nazar aren't fixing the issue:
https://github.com/LiskHQ/lisk/pull/2270.

Error caught by domain go now through the cleanup process to exit from the app with an error code.

Moreover the new process.on('unhandledRejection') handler was added.

@MaciejBaj, i don't think it's related to #2270. The application logs are this:

[dbg] 2018-07-13 07:39:37 | Cache - onNewBlock | Status: false                         
[dbg] 2018-07-13 07:39:37 | Processing block - 1820992849071984899                     
[dbg] 2018-07-13 07:39:37 | Performing forward tick                                    
[dbg] 2018-07-13 07:39:37 | Cache - onNewBlock | Status: false                         
[dbg] 2018-07-13 07:39:37 | Processing block - 14841452819557376132                    
[dbg] 2018-07-13 07:39:37 | Performing forward tick                                    
[dbg] 2018-07-13 07:39:37 | Cache - onNewBlock | Status: false                         
[dbg] 2018-07-13 07:39:37 | Processing block - 11639934055572720661                    
[dbg] 2018-07-13 07:39:37 | Performing forward tick                                    
[dbg] 2018-07-13 07:39:37 | Performing round snapshot...                               
[dbg] 2018-07-13 07:39:37 | Cache - onNewBlock | Status: false                         
[dbg] 2018-07-13 07:39:37 | Processing block - 9576206185259817746                     
[dbg] 2018-07-13 07:39:37 | Summing round - 237                                        
[dbg] 2018-07-13 07:39:37 | Performing forward tick                                    
[FTL] 2018-07-13 07:39:37 | Error: DB Connection Error                                 
[inf] 2018-07-13 07:39:37 | Cleaning up...                                             
[inf] 2018-07-13 07:39:37 | Waiting for block processing to finish...                  
[inf] 2018-07-13 07:39:40 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:39:40 | Queue empty                                                
[inf] 2018-07-13 07:39:45 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:39:45 | Queue empty                                                
[inf] 2018-07-13 07:39:47 | Waiting for block processing to finish...                  
[inf] 2018-07-13 07:39:50 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:39:50 | Queue empty                                                
[inf] 2018-07-13 07:39:55 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:39:55 | Queue empty                                                
[inf] 2018-07-13 07:39:57 | Waiting for block processing to finish...                  
[inf] 2018-07-13 07:40:00 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:00 | Queue empty                                                
[inf] 2018-07-13 07:40:05 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:05 | Queue empty                                                
[inf] 2018-07-13 07:40:07 | Waiting for block processing to finish...                  
[inf] 2018-07-13 07:40:10 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:10 | Queue empty                                                
[inf] 2018-07-13 07:40:15 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:15 | Queue empty                                                
[inf] 2018-07-13 07:40:17 | Waiting for block processing to finish...                  
[inf] 2018-07-13 07:40:20 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:20 | Queue empty                                                
[inf] 2018-07-13 07:40:25 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:25 | Queue empty                                                
[inf] 2018-07-13 07:40:27 | Waiting for block processing to finish...                  
[inf] 2018-07-13 07:40:30 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:30 | Queue empty                                                
[inf] 2018-07-13 07:40:35 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:35 | Queue empty                                                
[inf] 2018-07-13 07:40:37 | Waiting for block processing to finish...                  
[inf] 2018-07-13 07:40:40 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:40 | Queue empty                                                
[inf] 2018-07-13 07:40:45 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:45 | Queue empty                                                
[inf] 2018-07-13 07:40:47 | Waiting for block processing to finish...                  
[inf] 2018-07-13 07:40:50 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:50 | Queue empty                                                
[inf] 2018-07-13 07:40:55 | Releasing enqueued broadcasts                              
[inf] 2018-07-13 07:40:55 | Queue empty                                                
[inf] 2018-07-13 07:40:57 | Waiting for block processing to finish...                 

And this continues endlessly.

@SargeKhan We got DB connection error and triggered the cleanup:

https://github.com/LiskHQ/lisk/blob/9fc3ea2873756f79fb4403c8038aa897be87b147/db/index.js#L88-L95

That triggered async.eachSeries to cleanup all modules

https://github.com/LiskHQ/lisk/blob/9fc3ea2873756f79fb4403c8038aa897be87b147/app.js#L764-L768

While blocks module is not going to cleanup unless the active sql query get executed and its not going to finish and connection been dropped.

https://github.com/LiskHQ/lisk/blob/9fc3ea2873756f79fb4403c8038aa897be87b147/modules/blocks.js#L248-L256

So it's a race condition. It can happen to any other module as well in future. So probably there are three solutions;

  1. Either set a timeout in module cleanup process, and skip any module cleanup if taking more time
  2. If there is DB connection error, simply log fatal error and exit the process
  3. Add a timeout for atomic block write in its promise chain

There is no query timeout feature available directly in pg-promise so we have to use bluebird method timeout in start of the promise chain. Just add a timeout entry after tx call.

https://github.com/LiskHQ/lisk/blob/9fc3ea2873756f79fb4403c8038aa897be87b147/modules/blocks/chain.js#L501-L508

@vitaly-t Can provide more input on it.

@nazarhussain not sure what else to add here, except to drop a link: Query timeout in pg-promise.

@nazarhussain can you explain a bit more as to what exactly is the race condition here? Do you mean that this error will occur when there is an existing database write in progress and I/O is waiting for a response? If so, do you have any suggestion on how we can reproduce this issue?
I've tried to reproduce this issue by losing database connection after applyUnconfirmedStep and before applyConfirmedStep. In this scenario, the application ends gracefully.

I am not sure, but we can try out. Try stopping the database while block write was in progress. That may trigger this scenario.

  1. Add some artificial time delay in the block processing
  2. You stopped the database
  3. Database connection error catches
  4. It trigger application cleanup
  5. It triggers the module cleanup
  6. While block module cleanup didn't responded back as the block processing was active
  7. This way application will keep halted.

@nazarhussain, PGPromise is able to handle the scenario you mentioned. But I was able to reproduce the problem by adding a debug point prior to t.batch here: https://github.com/LiskHQ/lisk/blob/development/modules/rounds.js#L229
And once the breakpoint was triggered, I killed the database server and then resumed the application. The problem I got was that t.batch neither went into the catch block or then block. Causing the application to come to a halt. Unfortunately, I was unable to reproduce the same behavior using an automated test using. I tried killing database connection using https://github.com/vitaly-t/pg-promise#library-de-initialization
and through a custom query which kills all connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity where datname = '{databaseName}'

@vitaly-t Do you have any idea, why batch is not handling the case which normal tx handle well.

My guess - something goes wrong with one of the queries, as they already execute at that point.

Might I suggest, not to create any of those queries until they are needed, i.e. try and do it synchronously instead?

For that, replace each query in the batch with a function that returns the query:

.batch([
        () => t.rounds.clearRoundSnapshot(),
        () => t.rounds.performRoundSnapshot(),
                () => t.rounds.clearVotesSnapshot(),
        () => t.rounds.performVotesSnapshot()
    ])

Let me know ;)

@vitaly-t your suggestion also doesn't work. .then() or .catch() block of .batch never gets called.

Seems related to this: https://github.com/brianc/node-postgres/issues/1454

Perhaps sometimes even an extra check doesn't help.

We need an extra research on how to handle PostgreSQL Error: DB Connection Error generally - for all of the queries we're performing: batch, query, one, tx.

Let us know @vitaly-t if you have some further ideas.

I'm keeping an eye on this one :wink:

@SargeKhan @MaciejBaj - Also https://github.com/LiskHQ/lisk/pull/2260 (merged in 1.2.x) might affect this issue.

@vitaly-t any news on this?

@diego-G Version 8.5.3 started supporting timeouts on the low level, and handle broken connections better.

So upgrading to the latest driver, plus configuring that timeout might be able to help here.

Plus, version 8.5.4 now properly releases dead connections, which is important for a process that runs for a long time.

@nazarhussain F.Y.I.

To check if pg-promise v8.5.4 solves the problem.

FYI @yatki, I've noticed a wrong similar behaviour even after upgrading pg-promise. Sometimes after closing the APP, it takes a few extra seconds until it finally closes. It's not related to the snapshotting process. It happens as well during synchronisation.

@diego-G That sounds like not shutting down the database connection pools on time, with the call to pgp.end().

Closing this since it has not happened on the latest releases

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hendrikhofstadt picture hendrikhofstadt  路  4Comments

ScrewchMcMuffin picture ScrewchMcMuffin  路  3Comments

toschdev picture toschdev  路  3Comments

karek314 picture karek314  路  3Comments

Isabello picture Isabello  路  4Comments