utf8.decode() utilized by Web3 throws an uncatchable error if there is an invalid utf-8 character. Contracts with lack of or improper user input validation on public methods.
Web3js catches the error and skips the record
Irrecoverable error
Catch the error and skip the record
utf8 throws an uncatchable exception when it cannot decode hex.
From UTF8:
Decodes any given UTF-8-encoded string (byteString) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)
For my implementation purposes I have patched line 218 of utils:
https://github.com/ethereum/web3.js/blob/1.0/packages/web3-utils/src/utils.js#L218
With
let returnValue;
try {
returnValue = utf8.decode(str);
}
catch(e) {
returnValue = "";
}
finally {
return returnValue;
}
commit (don't mind the meta)
I'm not going to be submitting a PR because this is not a universally safe patch, but works for my implementation. Leaving open.
here's the hex that broke web3 for me:
0x0e00000000000f8c0000000000000005000f0000070000000909
Cheers.
Maybe https://github.com/mathiasbynens/wtf-8 might be useful for that case?
@hexagon6 Agreed, but this would be a bigger change than the above hack.
This would appear to be a serious bug that really needs to be fixed, as any web3 dependent service using this library, which takes any form of user input, could be taken down from this bug.
I am considering adding an “unsafeString” type or something to ethers.js that would help web3 deal with this issue, which would allow for invalid surrogate pairs, overlong sequences and invalid ordinal bits set.
It is a hard problem to solve, because not sanitizing the output can lead to serious security implications for dapps; if you were to hash the string, it would not match, for example, which may be used in a commit-reveal. If I can trick you into committing to a wrong hash, you cannot reveal, which may mean you cannot properly challenge, for example.
I would also recommend that Web3.js also add UTF8 sanitization to all contract calls, so it fails before it hits the blockchain.
That said, there are still multiple ways to try interpreting invalid UTF8 sequences. The current UTF8 library allows a “allowInvalid” flag (which is never set to true, but it was easy to add and potentially useful) which uses the “ignore” strategy (which simply drops the naughty bits), but for this I think the replace strategy makes more sense, replacing the invalid sequence with the placeholder, And continues to the next octet with the ordinal but cleared... I believe (please correct me if I’m wrong), that’s what the native JS string does.
I think for now, you could also use the “bytes” type for decoding, which doesn’t care about data representation. But that needs to be further in, since the contract sighash needs the string “string” to match.
Still thinking about this though...
I would also recommend that Web3.js also add UTF8 sanitization to all contract calls, so it fails before it hits the blockchain.
Agreed 👌
but for this I think the replace strategy makes more sense, replacing the invalid sequence with the placeholder, And continues to the next octet with the ordinal but cleared... I believe (please correct me if I’m wrong), that’s what the native JS string does.
I totally agree with you on this!
I think for now, you could also use the “bytes” type for decoding, which doesn’t care about data representation
Yep, this should work.
I was checking if there are other libs who do have the replacement logic as for example python does have (@ricmoo showed me the functionalities python has). But sadly there is currently no lib which does handle this as I would expect.
Luckily is @ricmoo already creating such an improved UTF-8 lib based on already existing code đź’Ş
I will wait until Richard has finished his lib and will use it for the web3.utils.toUtf8call and all other utility functions we have. This means this specific issue will be fixed as soon as @ricmoo has finished his lib in a stable manner or the new AbiCoder v5 which does include this fix got released as a stable version.
Copying over my comments on how an API design change might look to work with this. I assume the use of the TextEncoder standard in the comments, but probably the ethers.js codec can be swapped in if that is the way forward here.
Referring to getPastEvents, I would propose something like the following:
Add an option, acknowledging that parsing event data might fail due to string processing, maybe called stringParsingErrorMode or something. A parsing error might occur for some non-empty subset of the events being gathered. We have replace which can just replace the bad sequence starts with ďż˝, skip which skips that subset and returns the events for which string parsing didn't fail, or error which rejects with the encoding error encountered. It's fine if the default behavior remains error, as long as these other modes are supported.
In particular, skip would be nice for dapps that do expect arbitrary input data from events, but do not want to be DOS'd by a bad event.
Just a note: we tried to reproduce this bug in #3497 without success.
If anyone has a reproduction case, please advise.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions
@cgewecke At one point I had the exact block number and event that caused utf8 to throw exception, but it has been more than ~3~ 2 years since I reported this.
@cgewecke In the original thread I mentioned a hex that brought up the issue for me. You could include that hex string in a unit test against whatever module is presently implemented for utf-8
0x0e00000000000f8c0000000000000005000f0000070000000909
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. If you believe this was a mistake, please comment.
Most helpful comment
Copying over my comments on how an API design change might look to work with this. I assume the use of the TextEncoder standard in the comments, but probably the
ethers.jscodec can be swapped in if that is the way forward here.Referring to
getPastEvents, I would propose something like the following:Add an option, acknowledging that parsing event data might fail due to string processing, maybe called
stringParsingErrorModeor something. A parsing error might occur for some non-empty subset of the events being gathered. We havereplacewhich can just replace the bad sequence starts with ďż˝,skipwhich skips that subset and returns the events for which string parsing didn't fail, orerrorwhich rejects with the encoding error encountered. It's fine if the default behavior remainserror, as long as these other modes are supported.In particular,
skipwould be nice for dapps that do expect arbitrary input data from events, but do not want to be DOS'd by a bad event.