Chainhammer v55 is fully automated now = benchmark parity with two lines of CLI commands !
The parity aura TPS are still not satisfying, and I am optimistic you find a better combination of CLI switches for parity, to speed it up. Why? In Q2, I am going publish a comparison paper, and it would be nice to have better results by then, no? Please you help now finding better CLI settings for parity. Thanks.
slower than some other clients
comparable TPS
Spinning up a t2.medium machine on AWS, using my newest AMI is for sure the safer & easier way. Or alternatively
git clone https://github.com/drandreaskrueger/chainhammer CH
cd CH
scripts/install.sh
and accept each step of the installation script (complex, not recommended. Use the AMI.).
networks/parity-configure-aura.sh v1.11.11
CH_TXS=10000 CH_THREADING="sequential" ./run.sh ParityAura parity
Then just wait (only perhaps watch the logfile tail -n 10 logs/network.log).
If all goes well, you are told when the experiment has ended, and you will then have a summary file in results/runs/ - which includes time series diagrams, and TPS estimates.
You first want to read the script run.sh to be able to understand which (eight or) 10 steps are executed when running one whole experiment. Then:
networks/parity-configure-aura.sh v2.2.3
CH_TXS=10000 CH_THREADING="sequential" ./run.sh ParityAura parity
... should be a bit faster than v1.11.11.
The above "sequential" is hammering transactions at parity in a simple for loop, non-async. Obviously, that is not the fastest possible way. However, unfortunately parity v2.x.y has an unsolved issue with multi-threaded sending of transactions, but you can try this with v1.11.11 where it always worked:
networks/parity-configure-aura.sh v1.11.11
CH_TXS=10000 CH_THREADING="threaded2 20" ./run.sh ParityAura parity
It uses a queue with 20 concurrent multi-threading workers, and --> should result in higher TPS than the "sequential" approach above.
When you try to start the latter with v2.2.3 instead of the v1.11.11, it might never reach its planned end, because parity very often just stops accepting new transactions, usually after a few thousand TX. The above mentioned issue.
Then when you are out of patience, and interrupt the experiment manually, you will end up with dangling processes. This script here helps:
scripts/kill-leftovers.sh
Warning: It is rather radical, and e.g. removes all docker containers from that system, so (a) first read the script, and (b) only run it on a disposable virtualbox, or cloud machine. Plus it is not 100% complete yet, so keep you eyes open which other processes might have survived when you manually end the experiment before it fully ran through.
Have a look at the new "whole laboratory in one command" scripts run-all_large.sh and run-all_large.sh and the instructions in docs/reproduce.md#how-to-replicate-the-results.
README.md#install-and-run
docs/cloud.md#readymade-amazon-ami
networks/parity-start.sh, networks/parity-stop.sh, and networks/parity-clean.sh.
and perhaps there are remaining parity.md --> issues that can now be solved too, with chainhammer v55 ?
Hope this helps. Please keep me posted. Thanks.
Hey @drandreaskrueger,
I have re-run the results and made some modifications for Parity flags. Below please find detailed analysis of what was going on.
There is a bunch of things missing in the installation script, afair it was:
snap install docker
That later caused issued killing containers, because of apparmor, so I disabled it without investigation.
systemctl stop apparmor && systemctl disable apparmor
apt install pkg-config ipython autoreconf dh-autoreconf
pip3 install ipython ipykernel
apt install secp256k1
Test machine was Scaleway's START1-M with Ubuntu 18.04
ALL tests were run with 50k transactions and threaded2 20 concurrency mode.
(start1-m-Geth) Geth v1.8.14 with 50000 txs: 120.6 TPS
(start1-m-Parity-instantseal) Parity v2.3.4 with 50000 txs: 135.1 TPS
(start1-m-Parity-aura) Parity v2.3.4 with 50000 txs: 128.6 TPS
(start1-m-Quorum) Quorum v1.7.2 with 50000 txs: 138.0 TPS
Note the results were just run once, and have no statistical significance whatsoever.
I just run them to get rough order of magnitude of transactions to resolve the issue.
Find the *.md files from tests attached:
results.zip
Parity flags:
--nodes 4 --config aura --geth --gasprice 0 --gas-floor-target=40000000 --jsonrpc-server-threads 8 --jsonrpc-threads=0 --tx-queue-mem-limit 0 --tx-queue-per-sender 8192 --tx-queue-size 32768 --no-discovery --fast-unlock
Chainhammer modifications:
sed -i send.py s/duration=3600/duration=0/
Aura block time: 10s
--gas-floor-target=40M
The starting gas in the spec files is 40M, but the default floor target is 8M, so authorities start with 40M, but then keep voting the block gas down, so the blocks get smaller. This option prevents that.
--jsonrpc-server-threads 8
The eth_sendTransaction requests are received over RPC and are sent from multiple threads asynchronously, to make it more efficient we spawn 8 threads of the RPC server to process them in parallel. 8 was choosen as 2*cores on my machine (by default it's 4).
--jsonrpc-threads=0
In recent versions this is a no-op, but the RPC requests used to be dispatched to another thread pool for processing. Since in this setup it's redundant (cause we don't need to maximize RPC throughput), spawning additional threads and passing data between them are just waste of time, it's fine for us to process less requests per seconds, cause it's not a limitting factor anyway.
--tx-queue-per-sender 8192
New transaction queue implementation (since 1.10 afair) has a different set of limits. We have a limit for total transactions in the queue, but also limit the amount of transactions from a single sender. In real world scenarios there are not that many queued transactions from a single sender. Lack of this setting together with an issue recently fixed in #10375 was a cause of #9582. Setting this option allows Parity to run in concurrency mode as well.
--no-discovery
We're running a closed network so it doesn't make sense to even try to discover new peers.
Lack of --force-sealing
--force-sealing is a recommended setting for aura authorities, especially if there is no transactions in the network (this gives a predictable block production). However this has an unintended consequence of all authorities producing a pending block, even if it's not their slot to seal one. This is just waste of resources if all nodes are running on a single machine (basically means that every pending block is processed 4 (force-sealed) + 3 (propagated) = 7 times).
--fast-unlock
And the last one, but the most important setting of all. This flag changes the way Parity Ethereum handles the JSON keyfiles and local transaction signing.
By default (for security reasons) Parity Ethereum does not store RAW wallet secrets in memory for longer than it is required to sign the payload (we store the password instead). So it means that for every incoming signing request we have to:
a) Read the keyfile from disk
b) Decrypt the keyfile with password (AES)
c) Perform key derivation (PBKDF2/Scrypt) to get the secret.
This is obviously extremely costly, but prevents storing raw secrets in memory. The flag changes the behaviour to actually cache RAW secrets instead, although it's effective only if the account is unlocked permanently (either via CLI or via unlockAccount without duration or with duration=0).
Hence I had to modify the chainhammer code a bit for this setting to be effective.
Block duration 10s
I've noticed that since the limitting factor is signing not block processing the blocks were only half-full. Extending the block duration does make them utilised better, and prevents wasting too much time for their production.
Note this might have a negative consequence of lowering the TPS since the last block is going to be half-empty and may just wait for up to 9seconds without processing any more transactions.
The results are pretty similar between multiple implementations, cause clearly transaction signing is the limitting factor.
Since all implementations use exactly the same secp256k1 under the hood the results are almost identical.
So the benchmark currently is able to measure how fast modern processors can sign secp256k1, and a little bit of RPC performance.
Note that since all transactions come from a single sender importing to the queue has to be done sequentailly for all clients (Quorum might be doing some batch importing thanks to their async method), it completely doesn't reflect a real world scenario where transactions are received over the network in batches (on mainnet even in batches of 8k in one packet) and then can be verified and imported in parallel. See suggestions below.
I'd say that it might be worth to specify more detailed objectives of what we want to compare between implementations.
Some suggestions an be found below.
Currently the results are heavily dependent on how the transactions get distributed in the blocks and how long is the block time.
With longer block times the TPS goes down, cause the last block is always mostly empty.
I'd suggest to make sure that all blocks are full (calculate how many transactions should be there and keep the block gas limit constant) or just discard results from the last block for TPS calculation.
This explains lower tps on Aura vs Instantseal.
To avoid testing the signing speed we can prepare a pre-signed transacitons and submit them via eth_sendRawTransaction.
Unfortunatelly this will still be a bit sequential, perhaps we could introduce another RPC method that allows to import multiple RAW transactions at once, which would greatly improve importing speed.
This will emulate real world a bit more, that's also what transaction pools are optimized for (note that Parity was failing for you due to incorrect per-sender limit setting: #9582)
The idea for this test is to:
eth_sendTransaction or eth_sendRawTransaction or eth_sendBatchOfRawTransactions (but it's not part of the test)At some point the RPC might become a bottleneck, to test the nodes communication it might be better to issue requests to multiple nodes and see how they consolidate the transaction pools.
Currently the test code might affect the execution (spawning 20 threads), it might be worth separating the two.
Closing since I believe there are no issues on our side.
Hooray, you did it! Congrats!
I am very happy. That you found a way - and that I kept on believing in parity. In all those 185 days since I submitted this problem for the first time, I could not believe that parity should be slower than geth; that is why I was so persistent.
And now you have finally proven the point. Thanks a million. Very good.
In the next posts, I am going through your suggestions. I think you have mainly solved it, but there are some remaining -hopefully minor- problems.
Chainhammer installation issues (Ubuntu 18.04)
There is a bunch of things missing in the installation script, afair it was:
...
Thanks a lot.
Those issues must be Ubuntu-related, because on Debian, I have never seen them.
For now, I have simply put a note to your instructions into docs/FAQ.md#install-on-Ubuntu - please click. Perhaps you could contribute those scripts? Thanks.
Great work, well done!
(start1-m-Geth) Geth v1.8.14 with 50000 txs: 120.6 TPS
(start1-m-Parity-instantseal) Parity v2.3.4 with 50000 txs: 135.1 TPS
(start1-m-Parity-aura) Parity v2.3.4 with 50000 txs: 128.6 TPS
(start1-m-Quorum) Quorum v1.7.2 with 50000 txs: 138.0 TPS
That is looking very good.
I knew it ... So parity is just not configured optimally when run out-of-the-box.
Your suggested code & CLI changes are now in this branch "issues/parity10382" https://github.com/drandreaskrueger/chainhammer/compare/issues/parity10382 - please have a look.
--nodes 4 --config aura --geth --gasprice 0 --gas-floor-target=40000000 --jsonrpc-server-threads 8 --jsonrpc-threads=0 --tx-queue-mem-limit 0 --tx-queue-per-sender 8192 --tx-queue-size 32768 --no-discovery --fast-unlock
Thanks. THAT is exactly what I had been hoping for, in the last 6 months. Great. Well done!
Those switches, together with parity version v2.3.4 seem to do the trick that parity keeps on accepting transactions, so this https://github.com/paritytech/parity-ethereum/issues/9582 is probably solved. Most of the time, not always though (on a 1 CPU machine, I have still seen runs where it got stuck).
Two issues remaining when trying instantseal with "threaded2 20": Once it happened that parity instantseal was stalling again, even with your new CLI switches. Plus, I got a serious new problem now: not all transactions ended up in the chain! Repeatedly send.py reported this:
Check control sample.
Waiting for 50 transaction receipts, can possibly take a while ...
Bad: Timeout, received receipts only for 45 out of 50 sampled transactions.
Sample of 45 transactions checked ... hints at: -AT LEAST PARTIAL- FAILURE :-(
Is instantseal perhaps not sealing the very last block?
When you debug that, you can use the terminator script - then you always also see the output send.py.log. I am also explaining it in the new video at minute 10.
For now, I keep instantseal with "sequential": run-all_large.sh and run-all_small.sh but for the aura-runs it is now changed to "threaded2 20".
I assume duration=0 means "keep it open forever"?
sed -i send.py s/duration=3600/duration=0/
The syntax of the sed command is the other way around. And it is in clienttools.py not in send.py.
I have now introduced this parity specific part into the unlockAccount() call. Thanks for that.
Explanations
Thanks a lot! Very helpful.
- Lack of
--force-sealing
I tried without it. But it did not work, So I have re-introduced it, keeping all your switches, but adding --force-sealing at the end. See the above mentioned branch.
--force-sealingis a recommended setting for aura authorities
Yes. Afri told me to always use that switch.
especially if there is no transactions in the network (this gives a predictable block production).
Yes, and apart from other problems that I saw today when ommitting it ... I also would not be able to get the final 10 empty blocks, for better plotting of the diagrams.
However this has an unintended consequence of all authorities producing a pending block, even if it's not their slot to seal one.
oh, oops.
I suggest you change that part in your aura algorithm then.
An authority knows whenever it's NOT their slot, so it can simply not seal then - right?
This is just waste of resources if all nodes are running on a single machine (basically means that every pending block is processed 4 (force-sealed) + 3 (propagated) = 7 times).
I see.
Better find a way to uncouple the good effects of that switch from those unintended side effects.
--fast-unlock
Nice one, thanks.
- Block duration
10s
I've noticed that since the limitting factor is signing not block processing the blocks were only half-full. Extending the block duration does make them utilised better, and prevents wasting too much time for their production.
Do you really think that would make a huge difference? (and if 10s is better, then why not directly go for e.g. 30s?)
You can now run a whole batch of experiments easily, to dis/prove that point, have a look here, I made that for you: scripts/run-parity-vary-blocktime.sh - I tried a bit, but quickly ran into issues with gas-full blocks, so the gas limit has to be increased, which in turn seems to cause problems when running with short blocktimes (and ddorgan's parity-deploy provides no option to simply configure a different gas limit so it needs patching) ... but that new script scripts/run-parity-vary-blocktime.sh is there now, so it should be relatively easy to get some numbers ...
(Me, I have no time for testing & debugging all that right now ... sorry.).
Note this might have a negative consequence of lowering the TPS since the last block is going to be half-empty and may just wait for up to 9seconds without processing any more transactions.
With 2000 transactions that is a considerable difference, but with 20000 or 50000 transactions submitted, i.e. many blocks in that experiment, the difference should become negligible, right?
Plus the other clients (quorum, geth) are facing the exact same situation anyways, right?
your Opinion and ... Suggestions ...
Great thanks. I have linked to it in docs/TODO.md#other-peoples-suggestions so that it won't get forgotten.
Now even though you have found this optimization, I still do not think that you want to change the default behaviour of parity, right? What about this instead:
--gas-floor-target=40000000
--jsonrpc-server-threads 8
--jsonrpc-threads=0
--tx-queue-mem-limit 0
--tx-queue-per-sender 8192
--tx-queue-size 32768
--no-discovery
--fast-unlock
Your 8 CLI switches to configure parity optimally ... are just too many, IMHO. And to find this exact combination among the ~100 CLI parity switches ... is almost impossible for an end consumer who just wants to run a fast parity. Even though geth and quorum can perhaps be optimized further, they already do run fast "out of the box", without any such clever CLI switches (and waiting 185 days lol).
So, my suggestion for you: What about creating "profiles" that combine many different CLI switches ?
Example: I would get all your 8 switches enabled in one go, if I simply type this:
parity --profile aura1 or parity --profile aurafasttps1 (or however you want to call it).
Then you can leave the default setup of parity as it currently is, but additionally provide a quickstart for people want to run parity with the fastest possible TPS setup.
What do you think about that?
Most helpful comment
Hey @drandreaskrueger,
I have re-run the results and made some modifications for Parity flags. Below please find detailed analysis of what was going on.
Chainhammer installation issues (Ubuntu 18.04)
There is a bunch of things missing in the installation script, afair it was:
That later caused issued killing containers, because of apparmor, so I disabled it without investigation.
Results?
Test machine was Scaleway's START1-M with Ubuntu 18.04
ALL tests were run with 50k transactions and
threaded2 20concurrency mode.Note the results were just run once, and have no statistical significance whatsoever.
I just run them to get rough order of magnitude of transactions to resolve the issue.
Find the *.md files from tests attached:
results.zip
How?
Parity flags:
Chainhammer modifications:
Aura block time:
10sExplanations
--gas-floor-target=40MThe starting gas in the spec files is 40M, but the default floor target is 8M, so authorities start with 40M, but then keep voting the block gas down, so the blocks get smaller. This option prevents that.
--jsonrpc-server-threads 8The
eth_sendTransactionrequests are received over RPC and are sent from multiple threads asynchronously, to make it more efficient we spawn 8 threads of the RPC server to process them in parallel. 8 was choosen as2*coreson my machine (by default it's 4).--jsonrpc-threads=0In recent versions this is a no-op, but the RPC requests used to be dispatched to another thread pool for processing. Since in this setup it's redundant (cause we don't need to maximize RPC throughput), spawning additional threads and passing data between them are just waste of time, it's fine for us to process less requests per seconds, cause it's not a limitting factor anyway.
--tx-queue-per-sender 8192New transaction queue implementation (since 1.10 afair) has a different set of limits. We have a limit for total transactions in the queue, but also limit the amount of transactions from a single sender. In real world scenarios there are not that many queued transactions from a single sender. Lack of this setting together with an issue recently fixed in #10375 was a cause of #9582. Setting this option allows Parity to run in concurrency mode as well.
--no-discoveryWe're running a closed network so it doesn't make sense to even try to discover new peers.
Lack of
--force-sealing--force-sealingis a recommended setting for aura authorities, especially if there is no transactions in the network (this gives a predictable block production). However this has an unintended consequence of all authorities producing a pending block, even if it's not their slot to seal one. This is just waste of resources if all nodes are running on a single machine (basically means that every pending block is processed 4 (force-sealed) + 3 (propagated) = 7 times).--fast-unlockAnd the last one, but the most important setting of all. This flag changes the way Parity Ethereum handles the JSON keyfiles and local transaction signing.
By default (for security reasons) Parity Ethereum does not store RAW wallet secrets in memory for longer than it is required to sign the payload (we store the password instead). So it means that for every incoming signing request we have to:
a) Read the keyfile from disk
b) Decrypt the keyfile with password (AES)
c) Perform key derivation (PBKDF2/Scrypt) to get the secret.
This is obviously extremely costly, but prevents storing raw secrets in memory. The flag changes the behaviour to actually cache RAW secrets instead, although it's effective only if the account is unlocked permanently (either via CLI or via unlockAccount without duration or with duration=0).
Hence I had to modify the chainhammer code a bit for this setting to be effective.
Block duration
10sI've noticed that since the limitting factor is signing not block processing the blocks were only half-full. Extending the block duration does make them utilised better, and prevents wasting too much time for their production.
Note this might have a negative consequence of lowering the TPS since the last block is going to be half-empty and may just wait for up to 9seconds without processing any more transactions.
Opinion
The results are pretty similar between multiple implementations, cause clearly transaction signing is the limitting factor.
Since all implementations use exactly the same secp256k1 under the hood the results are almost identical.
So the benchmark currently is able to measure how fast modern processors can sign secp256k1, and a little bit of RPC performance.
Note that since all transactions come from a single sender importing to the queue has to be done sequentailly for all clients (Quorum might be doing some batch importing thanks to their
asyncmethod), it completely doesn't reflect a real world scenario where transactions are received over the network in batches (on mainnet even in batches of 8k in one packet) and then can be verified and imported in parallel. See suggestions below.Suggestions for improvements
I'd say that it might be worth to specify more detailed objectives of what we want to compare between implementations.
Some suggestions an be found below.
Make sure that blocks are always full
Currently the results are heavily dependent on how the transactions get distributed in the blocks and how long is the block time.
With longer block times the TPS goes down, cause the last block is always mostly empty.
I'd suggest to make sure that all blocks are full (calculate how many transactions should be there and keep the block gas limit constant) or just discard results from the last block for TPS calculation.
This explains lower tps on Aura vs Instantseal.
Import pre-signed transactions
To avoid testing the signing speed we can prepare a pre-signed transacitons and submit them via
eth_sendRawTransaction.Unfortunatelly this will still be a bit sequential, perhaps we could introduce another RPC method that allows to import multiple RAW transactions at once, which would greatly improve importing speed.
Send transactions from multiple accounts.
This will emulate real world a bit more, that's also what transaction pools are optimized for (note that Parity was failing for you due to incorrect
per-senderlimit setting: #9582)Consider testing network behaviour not local node's pool and RPC
The idea for this test is to:
eth_sendTransactionoreth_sendRawTransactionoreth_sendBatchOfRawTransactions(but it's not part of the test)Consider submitting transactions to multiple nodes
At some point the RPC might become a bottleneck, to test the nodes communication it might be better to issue requests to multiple nodes and see how they consolidate the transaction pools.
Consider running test code on a separate machine than the test network.
Currently the test code might affect the execution (spawning 20 threads), it might be worth separating the two.