Lightning: Change sqlite page_size to 4096 - massive performance improvement

Created on 10 Jan 2021 · 20Comments · Source: ElementsProject/lightning

Disregard, see next reply

Issue and Steps to Reproduce

I'm having a strange issue with lightningd becoming unresponsive and taking up 100% of a cpu core. There's no output in the logs, and it's not responding to rpc commands, but I do notice that lightningd.sqlite3-journal is continuously growing.

It cleared up after 12 hours or so the first time I let it run, and I assumed it was some kind of database migration or cleanup after upgrading, but it's happening again without me doing anything.

`getinfo` output

It won't respond to getinfo, but I upgraded to the latest master as of this post.

Source

mb300sd

All 20 comments

OK, I think I figured out what is going on.

I went ahead and tried VACUUM on the database, and also read some articles about sqlite performance.

I stopped lightningd, did a PRAGMA page_size = 4096; VACUUM;, and restarted it.

Now I finally got 2 lines of

2021-01-11T03:09:47.067Z INFO xxxxxxxxxxxxxxxxxxxxxxxxxxx-chan#xxxxxx: onchaind complete, forgetting peer

and it started up. It froze on the first one before, but I didn't think anything of it until now.

The database performance with page_size=1024 is horrible. I was monitoring disk activity trying to figure out what was going on before and it was ~30kb/s for both read and write. With page_size=4096, there are no reads and ~200kb/s writes.

Since most disks have a block size of 4096 now, I assume 1024 was thrashing it with read-modify-writes.

Any reason not to make this the default? I see a massive performance improvement, from hours to minutes to forget a peer.

mb300sd on 11 Jan 2021

Thanks for the proposal @mb300sd, I have to admit I'm far from an expert in tweaking the parameters of sqlite3 and your explanation as of why 4KB pages work better makes absolute sense. I'll dig a bit into it and see if there are any downsides to increasing the page size.

May I ask how big your database file is on disk? Prolonged pauses with the tiny transactions we run is something we'd need to look into anyway.

cdecker on 11 Jan 2021

It seems like any database created with sqlite 3.12.0 onwards already has the 4096 byte page size: https://www.sqlite.org/pgszchng2016.html

What version have you compiled c-lightning with? Ubuntu 18.04 onwards are using sqlite 3.22.0 which has the larger page size already (and that's what we use for repro-builds).

cdecker on 11 Jan 2021

The database is around 2GB and pretty old (>2 years). I have 3.27.29 now, no idea what it was created with, but it should have been what ever version Debian Stretch distributed back then.

I had another long running channel closed on me because of the down time, and took the opportunity to experiment a little.

16kb pages perform even better than 4k, but with this one, it's still taking hours. This is on a 3.8GHz Intel CPU with 64GB RAM and NVMe storage. I also tweaked PRAGMA default_cache_size from 2000 to 8000 which seems to have helped a little bit. I'm basing that on the rate that the journal file is increasing in size since there's no other output to be seen.

I'm going to try to migrate to PostgreSQL today. The downtime caused by this seems to have triggered a chain reaction - some other long-term channels have closed on me and it's freezing for hours again.

Just for an idea of how big these deletes are,

SELECT count(*) FROM channel_htlcs;
1246298

mb300sd on 11 Jan 2021

Update - I've migrated the database to pgsql after working through a couple bugs in mcldsp with @fiatjaf

Killed lightningd cancelling a database transaction that was taking over an hour with sqlite, and it completed in just barely over 2 minutes with pgsql.

2021-01-12T03:17:12.211Z INFO    xxxxxxxxxxxxxxxxxxxxxxxxxxxxx-chan#100174: onchaind complete, forgetting peer                                                               
2021-01-12T03:19:16.309Z INFO .....

mb300sd on 12 Jan 2021

Perhaps a new issue should be created for general tuning of SQLite for better performance, and another for docs recommending progresql for larger routing nodes. Googling "sqlite delete slow" comes up with a lot of complaints, so I don't think even optimal tuning will provide acceptable performance for busy nodes.

mb300sd on 12 Jan 2021

SELECT count(*) FROM channel_htlcs;
1246298

Wow, that's a massive number of HTLCs processed, congrats :+1:

cdecker on 12 Jan 2021

@mb300sd If you have hints or references on how to migrate lightningd from Sqlite to PostgreSQL that you can post here - that would be great. My node is nowhere as big as yours but getting bigger every day (mostly thanks to clboss taking care of everything) and migrating to PostgreSQL makes sense from a performance/backup perspective.

hosiawak on 17 Jan 2021

@hosiawak as far as I know he used https://github.com/fiatjaf/mcldsp (after fixing some of my bugs there).

fiatjaf on 18 Jan 2021

👍1

It is generally recommended that you tune your filesystem and your database to match the pagesize of the database to the most efficient for your filesystem. For example, if you are using ZFS, it is recommended that you make a separate dataset for your SQLITE database(s), set its ZFS recordsize to 64k, and set the SQLITE pagesize to 64k as well. https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#sqlite

4k pagesize would probably work well for EXT4. For XFS, XFS has a stripewidth parameter set at filesystem creation time, which it takes from the underlying md if it exists, not certain what it gets set to if you are not on SW RAID, and you should probably tune SQLITE pageszie to same as this.

This applies to most DB engines I believe; you should match some "page size" or "allocation size" parameter of the database to a corresponding parameter in the underlying filesystem. Alternately, some DB engines can actually store directly on raw block devices, which is supposedly more beneficial.

ZmnSCPxj on 18 Jan 2021

👍1

@mb300sd are you willing to provide a copy of your SQLITE database? I could try seeing what can be done on lightningd side.

I want to experiment on a few things:

Channel HTLCs are deleted via foreign key ON DELETE CASCADE, which precludes us from doing this deletion in batches in our own code (we cannot easily change schemas in SQLITE to change the ON DELETE CASCADE that already exists in current databases to an ON DELETE NO ACTION), so we can't actually do that
- Unless we create a new channel_htlcs_2 table that has ON DELETE NO ACTION, copy the existing table to the new table, drop the old table, and rename the new table over the old table. Then we can actually perform deletes by batches, which will take longer overall (and really really really hurt db_write hook) but allows us to return to the mainloop to respond to everything occuring around us. In theory we could do DELETE ..LIMIT x but see https://sqlite.org/compile.html#enable_update_delete_limit and notice that CLBOSS got hit with this: https://github.com/ZmnSCPxj/clboss/issues/47#issuecomment-735214617
Add a specific CREATE INDEX channel_htlcs_channel_id ON channel_htlcs(channel_id);, which should help SQLITE speed up deletions. Sure there is already a UNIQUE constraint that also includes channel_id but maybe SQLITE is not smart enough to use that?
Periodically VACUUM, which should speed up SQLITE in general. On the other hand if it is a few gigabytes already then VACUUM itself could take a good chunk of time, during which we cannot respond to the rest of the world. Alternately PRAGMA auto_vacuum=on;,
Consider WAL mode for journaling, which is reputedly faster, however take note that if you are doing snapshot copies of the database as suggested in BACKUP.md you need to include the WAL file as well.

Do note that sharing the DB file leads to a privacy leak, you are revealing the history of your interaction with others and leak your node ID. I could always try using a copy of my own DB.

ZmnSCPxj on 18 Jan 2021

Add a specific CREATE INDEX channel_htlcs_channel_id ON channel_htlcs(channel_id);, which should help SQLITE speed up deletions. Sure there is already a UNIQUE constraint that also includes channel_id but maybe SQLITE is not smart enough to use that?

On a snapshot copy of my own db this does not speed up deletion.

ZmnSCPxj on 18 Jan 2021

Here is the query plan that SQLITE3 creates for a DELETE FROM peers WHERE id=?:

QUERY PLAN
|--SEARCH TABLE peers USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN TABLE channels
|--SCAN TABLE channels
|--SCAN TABLE channel_state_changes
|--SEARCH TABLE penalty_bases USING COVERING INDEX sqlite_autoindex_penalty_bases_1 (channel_id=?)
|--SEARCH TABLE channel_feerates USING COVERING INDEX sqlite_autoindex_channel_feerates_1 (channel_id=?)
|--SCAN TABLE transaction_annotations
|--SCAN TABLE channeltxs
|--SEARCH TABLE htlc_sigs USING COVERING INDEX channel_idx (channelid=?)
|--SEARCH TABLE channel_htlcs USING COVERING INDEX sqlite_autoindex_channel_htlcs_1 (channel_id=?)
|--SCAN TABLE channel_state_changes
|--SEARCH TABLE penalty_bases USING COVERING INDEX sqlite_autoindex_penalty_bases_1 (channel_id=?)
|--SEARCH TABLE channel_feerates USING COVERING INDEX sqlite_autoindex_channel_feerates_1 (channel_id=?)
|--SCAN TABLE channeltxs
|--SEARCH TABLE htlc_sigs USING COVERING INDEX channel_idx (channelid=?)
|--SEARCH TABLE channel_htlcs USING COVERING INDEX sqlite_autoindex_channel_htlcs_1 (channel_id=?)
|--SCAN TABLE forwarded_payments USING COVERING INDEX sqlite_autoindex_forwarded_payments_1
|--SEARCH TABLE forwarded_payments USING COVERING INDEX sqlite_autoindex_forwarded_payments_1 (in_htlc_id=?)
|--SCAN TABLE forwarded_payments
`--SEARCH TABLE forwarded_payments USING COVERING INDEX sqlite_autoindex_forwarded_payments_1 (in_htlc_id=?)

I believe every SCAN TABLE is a lost opportunity to speed up the deletion; our goal would be to look at each line above that has SCAN TABLE and figure out what index is missing.

For example, forwarded_payments is a fair ~6000 row table on my system (which has a mere ~150,000-row channel_htlcs). According to SQLITE the .schema forwarded_payments is:

CREATE TABLE forwarded_payments
  (  in_htlc_id INTEGER REFERENCES channel_htlcs(id) ON DELETE SET NULL,
     out_htlc_id INTEGER REFERENCES channel_htlcs(id) ON DELETE SET NULL,
     in_channel_scid INTEGER,
     out_channel_scid INTEGER,
     in_msatoshi INTEGER,
     out_msatoshi INTEGER,
     state INTEGER,
     received_time INTEGER,
     resolved_time INTEGER,
     failcode INTEGER,
     UNIQUE(in_htlc_id, out_htlc_id)
  );

So there are two foreign constraints, one on in_htlc_id the other on out_htlc_id. Let us look at the result from determining the query plan related to forwarded_payments:

|--SCAN TABLE forwarded_payments USING COVERING INDEX sqlite_autoindex_forwarded_payments_1
|--SEARCH TABLE forwarded_payments USING COVERING INDEX sqlite_autoindex_forwarded_payments_1 (in_htlc_id=?)
|--SCAN TABLE forwarded_payments
`--SEARCH TABLE forwarded_payments USING COVERING INDEX sqlite_autoindex_forwarded_payments_1 (in_htlc_id=?)

It looks like SQLITE3 is able to scan for in_htlc_id using the autoindex, which presumably was magically created by SQLITE3 from the UNIQUE specification. This suggests to me that SQLITE3 is able to index via a single column from a multi-column index if the that column is the leftmost in the multi-column index --- presumably the UNIQUE(in_htlc_id, out_htlc_id) specification creates a multi-column index and SQLITE3 is only able to use it for single columns for the leftmost in_htlc_id in the index, and not out_htlc_id.

So on my experimental copy of my database I added:

CREATE INDEX forwarded_payments_out ON forwarded_payments(out_htlc_id);

And I got these numbers:

sqlite> delete from peers where id=7;                                
Run Time: real 0.197 user 0.059066 sys 0.042177

Without the index:

sqlite> delete from peers where id=7;
Run Time: real 1.022 user 0.864742 sys 0.060958

That is a massive 80% reduction in time for removing one peer.

ZmnSCPxj on 18 Jan 2021

🚀1

@hosiawak @fiatjaf is correct. I used https://github.com/fiatjaf/mcldsp to migrate it. Was pretty smooth once we worked out some bugs. If your database is very old like mine was, there may be some leftover records causing foreign key violations, those are safe to manually delete from SQLite before migrating.

@ZmnSCPxj I don't think I'd feel comfortable giving out the database. It's quite a large node with a large amount of funds.

mb300sd on 18 Jan 2021

@mb300sd well, if you have a copy of your SQLITE3 database file still, are you willing to try this and report the time spent?

Create two copies, one named with-index.sqlite3 the other named without-index.sqlite3.
For with-index:
- sqlite3 with-index.sqlite3
- CREATE INDEX forwarded_payments_out_htlc_id ON forwarded_payments (out_htlc_id);
- PRAGMA foreign_keys=on; <- notice the plural, if you misspell foreign_keys it will not error so be careful.
- .timer on
- DELETE FROM peers ORDER BY id LIMIT 1; <- this should eventually fail with "foreign key error" but the timing should be accurate. If it complains about ORDER BY or LIMIT it means the sqlite3 is not compiled for this syntax though (it is optional) and that is a bit more complex to figure out.
- .quit
For without-index.sqlite3
- sqlite3 without-index.sqlite3
- PRAGMA foreign_keys=on;
- .timer on
- DELETE FROM peers ORDER BY id LIMIT 1;
- .quit
Delete the with-index.sqlite3 and without-index.sqlite3 to recover the space.

Totally optional though. I just need confirmation whether the proposed fix is large enough to have solved your issue; it is certainly large enough to be much more performant on the database I have, but I am just removing almost a second from a second of delete time.

ZmnSCPxj on 19 Jan 2021

I actually deleted it from the live vm, but I have daily incremental backups. Give me a few hours to run a restore into a temp vm and I'll give it a shot.

mb300sd on 19 Jan 2021

Restore went pretty quick for my RAID being degraded/rebuilding right now...

sqlite3 test-index.sqlite3
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> CREATE INDEX forwarded_payments_out_htlc_id ON forwarded_payments (out_htlc_id);
sqlite> PRAGMA foreign_keys=on;
sqlite> .timer on
sqlite> DELETE FROM peers ORDER BY id LIMIT 1;
Run Time: real 1.875 user 0.078349 sys 0.318645
Error: FOREIGN KEY constraint failed
sqlite> .quit

sqlite3 test-noindex.sqlite3
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> PRAGMA foreign_keys=on;
sqlite> .timer on
sqlite> DELETE FROM peers ORDER BY id LIMIT 1;
Run Time: real 469.100 user 260.638115 sys 208.137506
Error: FOREIGN KEY constraint failed
sqlite> .quit

mb300sd on 19 Jan 2021

Thanks, that looks like a very big decrease in time to search. Now all we have left is SQLITE3 deletion LOL (which was not actually done in the tests here due to the FOREIGN KEY constraint failing).

Going over the query plan for deleting a peer, it looks like the largest bonuses are from indexing forwarded_payments.

SQLITE3 considers foreign key support to be optional --- it is the reason why we have to give PRAGMA foreign_keys=on; each and every time we open a db with it. Thus, we can expect that SQLITE3 does not really optimize well for this support.

From what I gather elsewhere, if foreign key constraints are involved, SQLITE will delete one row at a time, then check each dependent table if that row was involved in some ON DELETE dependency. In the case of our db schema: deleting a row in peers cascades deletes in channels which cascades into channel_htlcs which cascades into forwarded_payments.

We have a low number of channels rows per peers row (usually just one, though if we close and then open to the same peer in rapid succession I expect there might be two), so SCAN TABLE there will not be costly (though as the number of peers/channels rows grows that might become uncomfortable in the future). There are also only a small number of channel_state_changes per channels row as well, so SCAN TABLE there will not be costly. transaction_annotations seems to be intended to be JOINed with channeltxs. Now, channeltxs contains primarily the closing transactions of channels, and if the channel gets closed while it has a few hundred HTLCs in-flight, that would have a few hundred rows per channels row as well, so it is something we should consider in the near future to optimize as well.

But the real cake-taker is forwarded_payments. This is because we have to store every historical HTLC --- a theft attempt could take any state in our history, meaning all those ancient HTLCs remain potential theft vectors (we could have removed this need by rotating how transactions are laid out --- put the ability to revoke first before emitting the state, which I think was first presented in the Decker-Russell-Osuntokun paper, and which will be adopted by the alternate Poon-Dryja-style that Fournier is working on). So we have to remember each and every HTLC so we can correct them.

There could be several thousand rows of channel_htlcs per channels row, and because the forwarded_payments has a dependent key, SQLITE3 will delete one at a time, then scan the forwarded_payments table for the deleted row in channel_htlcs. And since there is no index on forwarded_payments(out_htlc_id) each deleted channel_htlcs row will trigger a scan of all rows of forwarded_payments, greatly multiplying the time taken.

Now, if you are primarily operating as a forwarding node, and not doing a lot of rebalances or sends or receives of money, then for every two rows in channel_htlcs you are likely to have a row in forwarded_payments. So the one-delete-at-a-time of channel_htlcs compounded with the linear SCAN TABLE of forwarded_payments to satisfy the foreign key constraint leads to awful, awful O( n^2 ) time consumption! Creating an index on forwarded_payments(out_htlc_id) should lead to O( n log n ) time instead.

Probably the Postgres advantage is that it might auto-create an index on columns with a foreign key constraint, and possibly batching multiple deletions on the parent table so it can make fewer range queries on the index of the child table, simply because Postgres foreign key support is not an afterthought like in SQLITE3.

Thanks, looks like the root fix is simply to add the index, which is benign on Postgres and will be a massive performance boost on SQLITE3. SQLITE3 deletions will probably still be slow, but shrug that is a lot harder to fix.

ZmnSCPxj on 19 Jan 2021

@mb300sd, @ZmnSCPxj merging #4337 auto-closed this issue. Would you like me to reopen, or shall we create a separate thread for further sqlite3 optimizations?

cdecker on 19 Jan 2021

Suggest separate thread, maybe link some of the relevant text in this thread. Some of the other SCAN TABLEs in the query plan might be worth pursuing but as per my analysis the expected speed increase would be significantly lower.

I could create a short document for tweaking parameters of your SQLITE based on the underlying filesystem characteristics.

ZmnSCPxj on 19 Jan 2021

👍1

Was this page helpful?

0 / 5 - 0 ratings