Diem: Significant perf regression in throughput as measured by cluster test.

Created on 11 Dec 2019  路  4Comments  路  Source: diem/diem

@ankushagarwal found out that the regression happens after PR #1948.
My current theory is as follows: PR #1948 changes the mapping of the last version of an epoch n from n+1 (which was bad for a number of reasons) to n.
That implies that version 0 now belongs to epoch 0.

My current theory is that there is something wrong with the client requests that causes the server to send the epoch change proofs for every update_to_latest_ledger request. For example, if a client version is not set and defaults to 0, every update_latest_ledger request would now cause a response to carry the epoch proofs.
I updated the client version on tx emitter to be 1 all the time (PR #1983) but @ankushagarwal verified that it didn't help. Gonna look further.

cluster_test perf

Most helpful comment

Oh, the majority of the update to latest ledger request is coming from the vm validator: the submit_transaction_to_mempool function is calling the validate_transaction function of a vm validator, which just puts the client_known_version to 0 when updating to the latest ledger.
This approach worked fine prior to reconfiguration, but doesn't work in the multi-epoch world.

Generally the validate_transaction function looks suspicious: it's invoking an update_to_latest_ledger request for every incoming txn, but doesn't do anything with the returned proofs (and doesn't maintain trusted validator set to verify LedgerInfo anyway).

All 4 comments

@ankushagarwal deployed some logging to the server: we see that while some client requests are sending the known version of 1 (probably related to the fix in tx_emitter), most of the requests are still sent with client_known_version of 0.

Oh, the majority of the update to latest ledger request is coming from the vm validator: the submit_transaction_to_mempool function is calling the validate_transaction function of a vm validator, which just puts the client_known_version to 0 when updating to the latest ledger.
This approach worked fine prior to reconfiguration, but doesn't work in the multi-epoch world.

Generally the validate_transaction function looks suspicious: it's invoking an update_to_latest_ledger request for every incoming txn, but doesn't do anything with the returned proofs (and doesn't maintain trusted validator set to verify LedgerInfo anyway).

@wqfish implemented lightweight storage API for the specific requests currently required by mempool in #2016 and #2024 (currently closed). Fingers crossed it's going to solve the perf regression.

Perf is back! Closing the task.

Was this page helpful?
0 / 5 - 0 ratings