We recently closed https://github.com/LiskHQ/lisk-scripts/issues/73. The existence of that issue demonstrates the fragility of grepping a log file to find out if the snapshotting process is done or not.
To be defined. Propose suggestions in the comments. Some have been suggested in the linked issue above
one approach that comes to mind - snapshot.sh already creates a file to use as a lock to stop other instances of snapshot.sh from doing anything, in case someone starts it twice. We could use that file with flock and have core use flock on it when snapshotting and when core's done with it and releases its flock, the script obtains the flock and continues, shutting down core and starting it up normally as it currently does. Though not sure on the support Mac OS has for this
The only indication given by core that it's done snapshotting is an entry in a log file
All
Referring to comment in linked issue - process should terminate when snapshot is finished (beta.8). If thats's not the case then it's a bug.
@4miners nice idea.
The issue https://github.com/LiskHQ/lisk/issues/2075 will ensure it's happening.
Then:
The issue https://github.com/LiskHQ/lisk-scripts/issues/80 will ensure that created snapshots are working. After Lisk Core process finished and shut down, cold start a new Lisk Core instance and check if it stays in sync.
New Jenkins nightly jobs using the provided snapshotting script will provide the tested snapshots every day then for Mainnet, Testnet and Betanet.
I confirmed this after snapshotting the process terminated successfully.
[inf] 2018-05-30 13:23:27 | Snapshot creation finished
[inf] 2018-05-30 13:23:27 | Cleaning up...
[dbg] 2018-05-30 13:23:27 | Cache - Clean up database
[dbg] 2018-05-30 13:23:27 | Cache - Quit database
[dbg] 2018-05-30 13:23:27 | Export peers to database failed: Peers list empty
[inf] 2018-05-30 13:23:27 | Cleaned up successfully
Since in production we are using pm2 it starts the script after it exits.
pm2 start app.js -- -s 1
1|app | [inf] 2018-05-30 13:29:25 | Cleaned up successfully
PM2 | App [app] with id [1] and pid [36350], exited with code [1] via signal [SIGINT]
PM2 | Starting execution sequence in -fork mode- for app name:app id:1
PM2 | App name:app id:1 online
1|app | [dbg] 2018-05-30 13:29:26 | Cache Enabled
1|app | [inf] 2018-05-30 13:29:26 | App connected with redis server
1|app | [inf] 2018-05-30 13:29:27 | Socket Cluster ready for incoming connections
But we have proper config file to not start it... :)
https://github.com/LiskHQ/lisk-scripts/blob/252493405c8e39f103dc1b2f79b3e60a3ed14371/packaged/etc/pm2-snapshot.json#L10
Now will look into bash file to see what's wrong there.
I suggest to use following approach. In script lisk_snapshot.sh where the lines are:
until tail -n10 "$LOG_LOCATION" | (grep -q "Snapshot finished"); do
Instead of using logs we should rely on our process manager which is `pm2. So we can use any of following:
pm2 jlist | jq -c '.[] | select(.name | contains("lisk.snapshot")) | .pm2_env.status'
or
pm2 info lisk.snapshot | grep status | awk '{print $4}'
And use it
until [ !`pm2 info lisk.snapshot | grep status | awk '{print $4}'` = "stopped"]; do
To get the status of the snapshot process, so any dependency on lisk core logic should be removed.
@Nazgolze What are your thoughts?
@nazarhussain I tested this out, this works like a charm
Yes it should work as we are suing similar stuff in lisk.sh already at https://github.com/LiskHQ/lisk-scripts/blob/0fc090254df42d1e7b32928b07eb3d515dd8d311/packaged/lisk.sh#L315
@MaciejBaj We can close this issue.
Most helpful comment
Yes it should work as we are suing similar stuff in
lisk.shalready at https://github.com/LiskHQ/lisk-scripts/blob/0fc090254df42d1e7b32928b07eb3d515dd8d311/packaged/lisk.sh#L315