Again relates to this: https://github.com/AcalaNetwork/Acala/issues/135
Now it happens on every full node.
Here are my observations:
This happens on all nodes including other validators.
It should record this block is bad and ban it, instead of keep trying to import it and not able to do anything.
Some more logs
https://gist.github.com/xlc/8395857944b02fdad71c811e90daaded
The root cause be more serious as why a invalid block was produced by a honest node. The transaction is sent by one of our bot and it was always working (i.e. sending valid transaction).
The troublesome extrinsic:
0x310284ff9e22b64c980329ada2b46a783623bcf1f1d0418f6a2b5fbfb7fb68dbac5abf0f01b02802078d3fb8fb7d48170ec21129abe4f53c02da03aa5e0386384fb35dce6e1257c02d048f22d99cfeca9c5f58232c30d2519fc9c25b390dc76642145f6c808501311200190003b233dc04b300540c00000000000000000158ed63772ec9c5bec501000000000000
Signature:
0xb02802078d3fb8fb7d48170ec21129abe4f53c02da03aa5e0386384fb35dce6e1257c02d048f22d99cfeca9c5f58232c30d2519fc9c25b390dc76642145f6c80
Signer:
5Fe3jZRbKes6aeuQ6HkcTvQeNhkkRPTXBwmNkuAPoimGEv45
What I don't understand is that some validators accepts the signature. If that's the case, it indicates some serious issue on sr25519.
We changed sr25519 11 months ago in April 2019 to set the high bit of byte 31 of a signature, which marks the signature as sr25519 to speed up batch verification when used along side ed25519.
Your byte 31 is 6e 12 so you're using sr25519 from before April 2019. It should verify if you change that 12 to a 92 I think.
We delayed yanking schnorrkel 0.1.1 for testnet reasons, but its yanked now. If anyone knows any method to actually purge the old code from crates.io then please let me know.
Does polkadot.js use the new or old signature? If it is using the new version, which polkadot.js version we should be using? @jacogr
It only generates new signatures (0.8+) as of the polkadot-js API 1.0 version, as per the CHANGELOG. 0.100, i.e. pre 1.0, was the last version with old signature support. To do this it has to rely on @polkadot/util v2+ & @polkadot/wasm v1 which uses schnorrkel 0.8.5. (Anything older than that is 0.1.1 sigs)
https://github.com/polkadot-js/common/releases/tag/v2.0.1 (first entry)
https://github.com/polkadot-js/wasm/releases/tag/v0.20.1 (first entry, was renamed to 1.0 since the interfaces are stable in the current iteration)
However, Substrate master still does the old "deprecated pre-audit verify" last time I checked, so both 0.1.1 and 0.8+ would be accepted. (That PR has not gone in yet)
Thanks @jacogr
I have reviewed out dependencies and pretty sure we are not using the old version.
The transaction is generated by our oracle server and the dependencies can be found here:
https://github.com/laminar-protocol/oracle-server/blob/master/yarn.lock
I also checked our Cargo.lock file and it is using the new version of shnorrkel
https://github.com/AcalaNetwork/Acala/blob/0.3.2/Cargo.lock
It's always possible I missread the hex 64 byte signature you posted there, like maybe someone does something else funky, or used an ed25519 signature where you expect sr25519. Are you sure it's not ed25519 being used here by mistake? Or sr25519 being verified by mistake?
Just fyi, the "deprecated pre-audit verify" option was a more recent change, like last August after the audit. I'll make the first schnorrkel release without pre-audit support this week, which I suppose should be 0.9.* not 0.8.6. I've yanked all schnorrkel version that create incompatible pre-audit signatures.
I am sure it is a sr25519 signature. I am not sure if it is possible for sr25519 being verified by mistake because that implies some critical issue in block authoring logic.
We are constantly updating with Substrate & polkadot.js version so if anything goes wrong, it is more likely introduced in recent substrate, rather than we using outdated version.
Substrate still verifies 0.1.1 as well as 0.8.5 signatures, so it is not a version mismatch. (And decoding the full extrinsics does show that the sig is pre-pended by a 01 which is the enum for sr25519).
The signature could well be valid and the payload cannot be properly constructed, not all info signed is in the tx, so it is not trivial to grab the extrinsic and re-construct the payload. (Do-able, but mortal eras may create some issues)
Signature payload logged by crashing node: [25, 0, 3, 178, 51, 220, 4, 179, 0, 84, 12, 0, 0, 0, 0, 0, 0, 0, 0, 1, 88, 237, 99, 119, 46, 201, 197, 190, 197, 1, 0, 0, 0, 0, 0, 0, 133, 1, 49, 18, 0, 45, 1, 0, 0, 183, 188, 237, 202, 60, 214, 85, 69, 177, 135, 204, 40, 135, 212, 162, 120, 176, 7, 196, 193, 130, 30, 48, 10, 9, 242, 26, 148, 84, 243, 132, 255, 77, 75, 102, 129, 161, 28, 226, 209, 152, 150, 236, 24, 16, 131, 217, 18, 207, 136, 40, 53, 5, 7, 188, 179, 186, 212, 134, 216, 199, 152, 123, 67]
This PR might be related https://github.com/paritytech/substrate/pull/4916
We don't check signatures since then when producing blocks, because we check them all in transaction pool. In your setup, is it possible that block was built using transaction not from transaction pool somehow?
Looking on your code the answer is probably no, but I might miss something.
We have offchain workers submitting unsigned transactions and I initially was thinking it was the culprit of the bug but unable to find any evidences.
But regardless the root cause, Substrate should definitely handle validator producing bad blocks. Otherwise one malicious validator can easily bring the whole network down.
Does Substrate slash block producer producing bad blocks? I think it detects for equivalence but what happen if validator only produce one bad block when it is their turn?
As written in Riot, cargo run --release just synced 2k blocks and than I was already at the tip. Did you restart your network already?
If we can not even reproduce it, we will need to close this issue. However, I really would like to look into it to make sure nothing is broken in Substrate.
Yes we have deployed a new testnet.
Please use this commit to reproduce the issue: https://github.com/AcalaNetwork/Acala/tree/081449d0e33f8ed43937a16867f6a80dece3a5e3
@arkpar we also have seen this behavior in Kusama. Couldn't we ban a block when it failed to import X times from Y different nodes? Banning should only be temporary (maximum the runtime of the node), but could prevent that a bad block brings down a network. Or at least don't try to fetch the block that fast again.
There was code for tracking bad blocks initially. Looks like it was lost in one of the refactorings.
@xlc I synced to the tip with the given commit you said, it stopped at 107276, but I did not see any problems :(
@bkchr can you try add --reserved-nodes /dns4/libp2p.rz.onfinality.io/tcp/29466/p2p/QmPkQQky7dPprrP111EwyReGeJwDorRrFRfDgSw2h9991p
--reserved-nodes /dns4/libp2p.jm.onfinality.io/tcp/26442/p2p/QmULeoVyuuSXNLjwxughsw7hk2VQsBKhZ8tFLymwHx3H9S?
This should make sure you are connected to one of the validator that have 107277.
I will try to reproduce it again just to make sure the above line still works.
Yeah, that works. (I see the bad transaction)
Any idea why these bad block happen?
So, I looked into this and can not find anything that seems suspicious. IMHO there are only 2 possibilities:
There is an option to include an invalid transaction now (since we are not checking signatures during block authoring) if there is a runtime upgrade changing the signature verification in the meantime. Imagine a situation, where:
Tx1 to the pool at block 5. We know it's was reported valid at that block.6 and renders Tx1 invalid now.7 and import blindly Tx1 assuming it's valid, since it's in the pool already.That's a very specific case and I don't know if it applies to your chain though. Potentially we should force revalidation of all transactions in the pool in case of runtime upgrade, or check that the validity information is provided by the recent runtime version and if not, do full checks when importing to a pending block.
Seems that CheckSignature::No option in UncheckedExtrinsic is also skipping construction of SignedPayload, which in turn triggers additional_signed() call that may fail.
Within that call we for instance check Era validity:
https://github.com/paritytech/substrate/blob/3ca7ccef985407105d525b3ca2d1a79a820f3318/frame/system/src/lib.rs#L1387
so it should most likely be fixed too, but speaking with @bkchr seems it's not happening for that particular case.
Seems that also during check we are doing a account index to account lookup:
https://github.com/paritytech/substrate/blob/3ca7ccef985407105d525b3ca2d1a79a820f3318/primitives/runtime/src/generic/unchecked_extrinsic.rs#L127
If that changes in between validation for transaction pool and actual execution it would render the signature invalid. Overall I think skipping signature checks when authoring blocks is a pretty bad idea :/
Yeah, I will probably change it to the batching/parallelization like I was going to from the beginning.
Meanwhile, it's easy to not use this, just replace
https://github.com/AcalaNetwork/Acala/blob/f4ca6e0a879d661e8a9a1327f3f26a09710d478a/runtime/src/lib.rs#L709
with apply_extrinsic instead of apply_trusted_extrinsic
I'm working on reverting the pr that introduced it.
I believe I have a theory of how this can happen.
For some reason our testnet have ~10% fork rate (which we want to avoid on mainnet, but happy for testnet as it helps to catch fork related bugs).
The transaction payload includes mortality, and signature payload includes the block hash of the birth block.
When fork happens, the block hash of the birth block could change, and hence invalid signature.
But at the time tx pool see the transaction, it is still valid, and because signature wasn't revalidated during block production, this issue happens. Yeah skip signature validation during block production is a bad idea.
I will leave this issue open for now as a malformed block shouldn't bring the network down.
Is the birth block when the transaction got created? We've account counters to address transaction freshness I thought? What does including some block hash achieve?
tx mortality.
Most helpful comment
There is an option to include an invalid transaction now (since we are not checking signatures during block authoring) if there is a runtime upgrade changing the signature verification in the meantime. Imagine a situation, where:
Tx1to the pool at block5. We know it's was reported valid at that block.6and rendersTx1invalid now.7and import blindlyTx1assuming it's valid, since it's in the pool already.That's a very specific case and I don't know if it applies to your chain though. Potentially we should force revalidation of all transactions in the pool in case of runtime upgrade, or check that the validity information is provided by the recent runtime version and if not, do full checks when importing to a pending block.