Parity-ethereum: accelerate parity aura TPS

Created on 19 Feb 2019 · 5Comments · Source: openethereum/parity-ethereum

accelerate parity aura

Chainhammer v55 is fully automated now = benchmark parity with two lines of CLI commands !

The parity aura TPS are still not satisfying, and I am optimistic you find a better combination of CLI switches for parity, to speed it up. Why? In Q2, I am going publish a comparison paper, and it would be nice to have better results by then, no? Please you help now finding better CLI settings for parity. Thanks.

your questions:

actual

slower than some other clients

expected behavior

comparable TPS

versions

Parity Ethereum version: v2.2.3 and v1.11.11 - but now free to choose _any_, really
Operating system: Linux
Installation: docker
Fully synchronized: does not apply
Network: aura, own network
Restarted: yes

steps to reproduce

Dependencies:

Spinning up a t2.medium machine on AWS, using my newest AMI is for sure the safer & easier way. Or alternatively

git clone https://github.com/drandreaskrueger/chainhammer CH
cd CH
scripts/install.sh

and accept each step of the installation script (complex, not recommended. Use the AMI.).

Run a whole experiment & analyze results

networks/parity-configure-aura.sh v1.11.11
CH_TXS=10000 CH_THREADING="sequential" ./run.sh ParityAura parity

Then just wait (only perhaps watch the logfile tail -n 10 logs/network.log).

If all goes well, you are told when the experiment has ended, and you will then have a summary file in results/runs/ - which includes time series diagrams, and TPS estimates.

Variations

You first want to read the script run.sh to be able to understand which (eight or) 10 steps are executed when running one whole experiment. Then:

parity v2.x.y

networks/parity-configure-aura.sh v2.2.3
CH_TXS=10000 CH_THREADING="sequential" ./run.sh ParityAura parity

... should be a bit faster than v1.11.11.

multi-threading

The above "sequential" is hammering transactions at parity in a simple for loop, non-async. Obviously, that is not the fastest possible way. However, unfortunately parity v2.x.y has an unsolved issue with multi-threaded sending of transactions, but you can try this with v1.11.11 where it always worked:

networks/parity-configure-aura.sh v1.11.11
CH_TXS=10000 CH_THREADING="threaded2 20" ./run.sh ParityAura parity

It uses a queue with 20 concurrent multi-threading workers, and --> should result in higher TPS than the "sequential" approach above.

v2.x.y and multi-threading - warning & helper script

When you try to start the latter with v2.2.3 instead of the v1.11.11, it might never reach its planned end, because parity very often just stops accepting new transactions, usually after a few thousand TX. The above mentioned issue.

Then when you are out of patience, and interrupt the experiment manually, you will end up with dangling processes. This script here helps:

scripts/kill-leftovers.sh

Warning: It is rather radical, and e.g. removes all docker containers from that system, so (a) first read the script, and (b) only run it on a disposable virtualbox, or cloud machine. Plus it is not 100% complete yet, so keep you eyes open which other processes might have survived when you manually end the experiment before it fully ran through.

comparison with other clients

Have a look at the new "whole laboratory in one command" scripts run-all_large.sh and run-all_large.sh and the instructions in docs/reproduce.md#how-to-replicate-the-results.

also see these files in chainhammer

README.md#install-and-run
docs/cloud.md#readymade-amazon-ami
networks/parity-start.sh, networks/parity-stop.sh, and networks/parity-clean.sh.
and perhaps there are remaining parity.md --> issues that can now be solved too, with chainhammer v55 ?

Hope this helps. Please keep me posted. Thanks.

F3-annoyance 💩 M4-core ⛓ M6-rpcapi 📣

Source

drandreaskrueger

Most helpful comment

Hey @drandreaskrueger,
I have re-run the results and made some modifications for Parity flags. Below please find detailed analysis of what was going on.

Chainhammer installation issues (Ubuntu 18.04)

There is a bunch of things missing in the installation script, afair it was:

Docker had to be installed manually via snap:

snap install docker

That later caused issued killing containers, because of apparmor, so I disabled it without investigation.

systemctl stop apparmor && systemctl disable apparmor

A bunch of tools:

apt install pkg-config ipython autoreconf dh-autoreconf

Some stuff for python (afair in venv, before installing deps)

pip3 install ipython ipykernel

Local secp256k1, otherwise the python package doesn't build because of incompatible openssl (1.1.x)

apt install secp256k1

Results?

Test machine was Scaleway's START1-M with Ubuntu 18.04

ALL tests were run with 50k transactions and threaded2 20 concurrency mode.

(start1-m-Geth) Geth v1.8.14 with 50000 txs: 120.6 TPS
(start1-m-Parity-instantseal) Parity v2.3.4 with 50000 txs: 135.1 TPS
(start1-m-Parity-aura) Parity v2.3.4 with 50000 txs: 128.6 TPS
(start1-m-Quorum) Quorum v1.7.2 with 50000 txs: 138.0 TPS

Note the results were just run once, and have no statistical significance whatsoever.
I just run them to get rough order of magnitude of transactions to resolve the issue.
Find the *.md files from tests attached:
results.zip

How?

Parity flags:

--nodes 4 --config aura --geth --gasprice 0 --gas-floor-target=40000000 --jsonrpc-server-threads 8 --jsonrpc-threads=0 --tx-queue-mem-limit 0 --tx-queue-per-sender 8192 --tx-queue-size 32768 --no-discovery --fast-unlock

Chainhammer modifications:

sed -i send.py s/duration=3600/duration=0/

Aura block time: 10s

Explanations

--gas-floor-target=40M
The starting gas in the spec files is 40M, but the default floor target is 8M, so authorities start with 40M, but then keep voting the block gas down, so the blocks get smaller. This option prevents that.
--jsonrpc-server-threads 8
The eth_sendTransaction requests are received over RPC and are sent from multiple threads asynchronously, to make it more efficient we spawn 8 threads of the RPC server to process them in parallel. 8 was choosen as 2*cores on my machine (by default it's 4).
--jsonrpc-threads=0
In recent versions this is a no-op, but the RPC requests used to be dispatched to another thread pool for processing. Since in this setup it's redundant (cause we don't need to maximize RPC throughput), spawning additional threads and passing data between them are just waste of time, it's fine for us to process less requests per seconds, cause it's not a limitting factor anyway.
--tx-queue-per-sender 8192
New transaction queue implementation (since 1.10 afair) has a different set of limits. We have a limit for total transactions in the queue, but also limit the amount of transactions from a single sender. In real world scenarios there are not that many queued transactions from a single sender. Lack of this setting together with an issue recently fixed in #10375 was a cause of #9582. Setting this option allows Parity to run in concurrency mode as well.
--no-discovery
We're running a closed network so it doesn't make sense to even try to discover new peers.
Lack of --force-sealing
--force-sealing is a recommended setting for aura authorities, especially if there is no transactions in the network (this gives a predictable block production). However this has an unintended consequence of all authorities producing a pending block, even if it's not their slot to seal one. This is just waste of resources if all nodes are running on a single machine (basically means that every pending block is processed 4 (force-sealed) + 3 (propagated) = 7 times).
--fast-unlock
And the last one, but the most important setting of all. This flag changes the way Parity Ethereum handles the JSON keyfiles and local transaction signing.
By default (for security reasons) Parity Ethereum does not store RAW wallet secrets in memory for longer than it is required to sign the payload (we store the password instead). So it means that for every incoming signing request we have to:
a) Read the keyfile from disk
b) Decrypt the keyfile with password (AES)
c) Perform key derivation (PBKDF2/Scrypt) to get the secret.
This is obviously extremely costly, but prevents storing raw secrets in memory. The flag changes the behaviour to actually cache RAW secrets instead, although it's effective only if the account is unlocked permanently (either via CLI or via unlockAccount without duration or with duration=0).
Hence I had to modify the chainhammer code a bit for this setting to be effective.
Block duration 10s
I've noticed that since the limitting factor is signing not block processing the blocks were only half-full. Extending the block duration does make them utilised better, and prevents wasting too much time for their production.
Note this might have a negative consequence of lowering the TPS since the last block is going to be half-empty and may just wait for up to 9seconds without processing any more transactions.

Opinion

The results are pretty similar between multiple implementations, cause clearly transaction signing is the limitting factor.
Since all implementations use exactly the same secp256k1 under the hood the results are almost identical.

So the benchmark currently is able to measure how fast modern processors can sign secp256k1, and a little bit of RPC performance.

Note that since all transactions come from a single sender importing to the queue has to be done sequentailly for all clients (Quorum might be doing some batch importing thanks to their async method), it completely doesn't reflect a real world scenario where transactions are received over the network in batches (on mainnet even in batches of 8k in one packet) and then can be verified and imported in parallel. See suggestions below.

Suggestions for improvements

I'd say that it might be worth to specify more detailed objectives of what we want to compare between implementations.
Some suggestions an be found below.

Make sure that blocks are always full

Currently the results are heavily dependent on how the transactions get distributed in the blocks and how long is the block time.
With longer block times the TPS goes down, cause the last block is always mostly empty.
I'd suggest to make sure that all blocks are full (calculate how many transactions should be there and keep the block gas limit constant) or just discard results from the last block for TPS calculation.
This explains lower tps on Aura vs Instantseal.

Import pre-signed transactions

To avoid testing the signing speed we can prepare a pre-signed transacitons and submit them via eth_sendRawTransaction.
Unfortunatelly this will still be a bit sequential, perhaps we could introduce another RPC method that allows to import multiple RAW transactions at once, which would greatly improve importing speed.

Send transactions from multiple accounts.

This will emulate real world a bit more, that's also what transaction pools are optimized for (note that Parity was failing for you due to incorrect per-sender limit setting: #9582)

Consider testing network behaviour not local node's pool and RPC

The idea for this test is to:

Don't connect nodes to each other initially
Authority nodes / Block producers are down or disabled
We first populate the transaction pools either via eth_sendTransaction or eth_sendRawTransaction or eth_sendBatchOfRawTransactions (but it's not part of the test)
After transactions are in the node's pools we connect them to each other and measure how long it takes to process all transactions.
This will benchmark stuff like: transaction propagation, block propagation, block processing, consensus time.

Consider submitting transactions to multiple nodes

At some point the RPC might become a bottleneck, to test the nodes communication it might be better to issue requests to multiple nodes and see how they consolidate the transaction pools.

Consider running test code on a separate machine than the test network.

Currently the test code might affect the execution (spawning 20 threads), it might be worth separating the two.

tomusdrw on 22 Feb 2019

❤4 👍2

All 5 comments

Hey @drandreaskrueger,
I have re-run the results and made some modifications for Parity flags. Below please find detailed analysis of what was going on.

Chainhammer installation issues (Ubuntu 18.04)

There is a bunch of things missing in the installation script, afair it was:

Docker had to be installed manually via snap:

snap install docker

That later caused issued killing containers, because of apparmor, so I disabled it without investigation.

systemctl stop apparmor && systemctl disable apparmor

A bunch of tools:

apt install pkg-config ipython autoreconf dh-autoreconf

Some stuff for python (afair in venv, before installing deps)

pip3 install ipython ipykernel

Local secp256k1, otherwise the python package doesn't build because of incompatible openssl (1.1.x)

apt install secp256k1

Results?

Test machine was Scaleway's START1-M with Ubuntu 18.04

ALL tests were run with 50k transactions and threaded2 20 concurrency mode.

(start1-m-Geth) Geth v1.8.14 with 50000 txs: 120.6 TPS
(start1-m-Parity-instantseal) Parity v2.3.4 with 50000 txs: 135.1 TPS
(start1-m-Parity-aura) Parity v2.3.4 with 50000 txs: 128.6 TPS
(start1-m-Quorum) Quorum v1.7.2 with 50000 txs: 138.0 TPS

How?

Parity flags:

--nodes 4 --config aura --geth --gasprice 0 --gas-floor-target=40000000 --jsonrpc-server-threads 8 --jsonrpc-threads=0 --tx-queue-mem-limit 0 --tx-queue-per-sender 8192 --tx-queue-size 32768 --no-discovery --fast-unlock

Chainhammer modifications:

sed -i send.py s/duration=3600/duration=0/

Aura block time: 10s

Explanations

--gas-floor-target=40M
The starting gas in the spec files is 40M, but the default floor target is 8M, so authorities start with 40M, but then keep voting the block gas down, so the blocks get smaller. This option prevents that.
--jsonrpc-server-threads 8
The eth_sendTransaction requests are received over RPC and are sent from multiple threads asynchronously, to make it more efficient we spawn 8 threads of the RPC server to process them in parallel. 8 was choosen as 2*cores on my machine (by default it's 4).
--jsonrpc-threads=0
In recent versions this is a no-op, but the RPC requests used to be dispatched to another thread pool for processing. Since in this setup it's redundant (cause we don't need to maximize RPC throughput), spawning additional threads and passing data between them are just waste of time, it's fine for us to process less requests per seconds, cause it's not a limitting factor anyway.
--tx-queue-per-sender 8192
New transaction queue implementation (since 1.10 afair) has a different set of limits. We have a limit for total transactions in the queue, but also limit the amount of transactions from a single sender. In real world scenarios there are not that many queued transactions from a single sender. Lack of this setting together with an issue recently fixed in #10375 was a cause of #9582. Setting this option allows Parity to run in concurrency mode as well.
--no-discovery
We're running a closed network so it doesn't make sense to even try to discover new peers.
Lack of --force-sealing
--force-sealing is a recommended setting for aura authorities, especially if there is no transactions in the network (this gives a predictable block production). However this has an unintended consequence of all authorities producing a pending block, even if it's not their slot to seal one. This is just waste of resources if all nodes are running on a single machine (basically means that every pending block is processed 4 (force-sealed) + 3 (propagated) = 7 times).
--fast-unlock
And the last one, but the most important setting of all. This flag changes the way Parity Ethereum handles the JSON keyfiles and local transaction signing.
By default (for security reasons) Parity Ethereum does not store RAW wallet secrets in memory for longer than it is required to sign the payload (we store the password instead). So it means that for every incoming signing request we have to:
a) Read the keyfile from disk
b) Decrypt the keyfile with password (AES)
c) Perform key derivation (PBKDF2/Scrypt) to get the secret.
This is obviously extremely costly, but prevents storing raw secrets in memory. The flag changes the behaviour to actually cache RAW secrets instead, although it's effective only if the account is unlocked permanently (either via CLI or via unlockAccount without duration or with duration=0).
Hence I had to modify the chainhammer code a bit for this setting to be effective.
Block duration 10s
I've noticed that since the limitting factor is signing not block processing the blocks were only half-full. Extending the block duration does make them utilised better, and prevents wasting too much time for their production.
Note this might have a negative consequence of lowering the TPS since the last block is going to be half-empty and may just wait for up to 9seconds without processing any more transactions.

Opinion

So the benchmark currently is able to measure how fast modern processors can sign secp256k1, and a little bit of RPC performance.

Suggestions for improvements

I'd say that it might be worth to specify more detailed objectives of what we want to compare between implementations.
Some suggestions an be found below.

Make sure that blocks are always full

Import pre-signed transactions

Send transactions from multiple accounts.

This will emulate real world a bit more, that's also what transaction pools are optimized for (note that Parity was failing for you due to incorrect per-sender limit setting: #9582)

Consider testing network behaviour not local node's pool and RPC

The idea for this test is to:

Don't connect nodes to each other initially
Authority nodes / Block producers are down or disabled
We first populate the transaction pools either via eth_sendTransaction or eth_sendRawTransaction or eth_sendBatchOfRawTransactions (but it's not part of the test)
After transactions are in the node's pools we connect them to each other and measure how long it takes to process all transactions.
This will benchmark stuff like: transaction propagation, block propagation, block processing, consensus time.

Consider submitting transactions to multiple nodes

At some point the RPC might become a bottleneck, to test the nodes communication it might be better to issue requests to multiple nodes and see how they consolidate the transaction pools.

Consider running test code on a separate machine than the test network.

Currently the test code might affect the execution (spawning 20 threads), it might be worth separating the two.

tomusdrw on 22 Feb 2019

❤4 👍2

Closing since I believe there are no issues on our side.

tomusdrw on 22 Feb 2019

Hooray, you did it! Congrats!

I am very happy. That you found a way - and that I kept on believing in parity. In all those 185 days since I submitted this problem for the first time, I could not believe that parity should be slower than geth; that is why I was so persistent.

And now you have finally proven the point. Thanks a million. Very good.

In the next posts, I am going through your suggestions. I think you have mainly solved it, but there are some remaining -hopefully minor- problems.

drandreaskrueger on 26 Feb 2019

🎉1

install on Ubuntu

Chainhammer installation issues (Ubuntu 18.04)
There is a bunch of things missing in the installation script, afair it was:
...

Thanks a lot.

Those issues must be Ubuntu-related, because on Debian, I have never seen them.

For now, I have simply put a note to your instructions into docs/FAQ.md#install-on-Ubuntu - please click. Perhaps you could contribute those scripts? Thanks.

your results

Great work, well done!

(start1-m-Geth) Geth v1.8.14 with 50000 txs: 120.6 TPS
(start1-m-Parity-instantseal) Parity v2.3.4 with 50000 txs: 135.1 TPS
(start1-m-Parity-aura) Parity v2.3.4 with 50000 txs: 128.6 TPS
(start1-m-Quorum) Quorum v1.7.2 with 50000 txs: 138.0 TPS

That is looking very good.

I knew it ... So parity is just not configured optimally when run out-of-the-box.

your settings merged into a code branch

Your suggested code & CLI changes are now in this branch "issues/parity10382" https://github.com/drandreaskrueger/chainhammer/compare/issues/parity10382 - please have a look.

your CLI switches

--nodes 4 --config aura --geth --gasprice 0 --gas-floor-target=40000000 --jsonrpc-server-threads 8 --jsonrpc-threads=0 --tx-queue-mem-limit 0 --tx-queue-per-sender 8192 --tx-queue-size 32768 --no-discovery --fast-unlock

Thanks. THAT is exactly what I had been hoping for, in the last 6 months. Great. Well done!

Those switches, together with parity version v2.3.4 seem to do the trick that parity keeps on accepting transactions, so this https://github.com/paritytech/parity-ethereum/issues/9582 is probably solved. Most of the time, not always though (on a 1 CPU machine, I have still seen runs where it got stuck).

your CLI switches with instantseal

Two issues remaining when trying instantseal with "threaded2 20": Once it happened that parity instantseal was stalling again, even with your new CLI switches. Plus, I got a serious new problem now: not all transactions ended up in the chain! Repeatedly send.py reported this:

Check control sample.
Waiting for 50 transaction receipts, can possibly take a while ...
Bad: Timeout, received receipts only for 45 out of 50 sampled transactions.
Sample of 45 transactions checked ... hints at: -AT LEAST PARTIAL- FAILURE :-(

Is instantseal perhaps not sealing the very last block?

When you debug that, you can use the terminator script - then you always also see the output send.py.log. I am also explaining it in the new video at minute 10.

For now, I keep instantseal with "sequential": run-all_large.sh and run-all_small.sh but for the aura-runs it is now changed to "threaded2 20".

unlock

I assume duration=0 means "keep it open forever"?

sed -i send.py s/duration=3600/duration=0/

The syntax of the sed command is the other way around. And it is in clienttools.py not in send.py.

I have now introduced this parity specific part into the unlockAccount() call. Thanks for that.

Explanations

Explanations

Thanks a lot! Very helpful.

Lack of --force-sealing

I tried without it. But it did not work, So I have re-introduced it, keeping all your switches, but adding --force-sealing at the end. See the above mentioned branch.

--force-sealing is a recommended setting for aura authorities

Yes. Afri told me to always use that switch.

especially if there is no transactions in the network (this gives a predictable block production).

Yes, and apart from other problems that I saw today when ommitting it ... I also would not be able to get the final 10 empty blocks, for better plotting of the diagrams.

However this has an unintended consequence of all authorities producing a pending block, even if it's not their slot to seal one.

oh, oops.

I suggest you change that part in your aura algorithm then.
An authority knows whenever it's NOT their slot, so it can simply not seal then - right?

This is just waste of resources if all nodes are running on a single machine (basically means that every pending block is processed 4 (force-sealed) + 3 (propagated) = 7 times).

I see.

Better find a way to uncouple the good effects of that switch from those unintended side effects.

--fast-unlock

Nice one, thanks.

Block duration 10s
I've noticed that since the limitting factor is signing not block processing the blocks were only half-full. Extending the block duration does make them utilised better, and prevents wasting too much time for their production.

Do you really think that would make a huge difference? (and if 10s is better, then why not directly go for e.g. 30s?)

You can now run a whole batch of experiments easily, to dis/prove that point, have a look here, I made that for you: scripts/run-parity-vary-blocktime.sh - I tried a bit, but quickly ran into issues with gas-full blocks, so the gas limit has to be increased, which in turn seems to cause problems when running with short blocktimes (and ddorgan's parity-deploy provides no option to simply configure a different gas limit so it needs patching) ... but that new script scripts/run-parity-vary-blocktime.sh is there now, so it should be relatively easy to get some numbers ...

(Me, I have no time for testing & debugging all that right now ... sorry.).

Note this might have a negative consequence of lowering the TPS since the last block is going to be half-empty and may just wait for up to 9seconds without processing any more transactions.

With 2000 transactions that is a considerable difference, but with 20000 or 50000 transactions submitted, i.e. many blocks in that experiment, the difference should become negligible, right?

Plus the other clients (quorum, geth) are facing the exact same situation anyways, right?

Suggestions

your Opinion and ... Suggestions ...

Great thanks. I have linked to it in docs/TODO.md#other-peoples-suggestions so that it won't get forgotten.

drandreaskrueger on 26 Feb 2019

Now even though you have found this optimization, I still do not think that you want to change the default behaviour of parity, right? What about this instead:

idea: Profiles

--gas-floor-target=40000000
--jsonrpc-server-threads 8
--jsonrpc-threads=0
--tx-queue-mem-limit 0
--tx-queue-per-sender 8192
--tx-queue-size 32768
--no-discovery
--fast-unlock

Your 8 CLI switches to configure parity optimally ... are just too many, IMHO. And to find this exact combination among the ~100 CLI parity switches ... is almost impossible for an end consumer who just wants to run a fast parity. Even though geth and quorum can perhaps be optimized further, they already do run fast "out of the box", without any such clever CLI switches (and waiting 185 days lol).

So, my suggestion for you: What about creating "profiles" that combine many different CLI switches ?

Example: I would get all your 8 switches enabled in one go, if I simply type this:

parity --profile aura1 or parity --profile aurafasttps1 (or however you want to call it).

Then you can leave the default setup of parity as it currently is, but additionally provide a quickstart for people want to run parity with the fastest possible TPS setup.

What do you think about that?

drandreaskrueger on 26 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings