Web3.js: getPastEvents throws uncaught exception when processing invalid UTF-8 characters [1.0.0-beta.34]

Created on 4 May 2018  Â·  12Comments  Â·  Source: ChainSafe/web3.js

Problem

  • Solidity contract methods can accept invalid UTF-8 characters from user input (by design?)
  • Solidity contracts can log an invalid UTF-8 character
  • Web3js aggregates logs
  • Javascript is unicode, doesn't play nice with invalid UTF-8 characters.
  • If Web3js encounters a log with an invalid UTF-8 character, it has an irrecoverable error
  • utf-8 module's utf8.decode() utilized by Web3 throws an uncatchable error if there is an invalid utf-8 character.
  • The uncaught exception prevents the developer from proceeding

Impact

Contracts with lack of or improper user input validation on public methods.

What I expect to happen

Web3js catches the error and skips the record

What actually happens

Irrecoverable error

Possible Solution

Catch the error and skip the record

1.x 2.x Stale clarification needed

Most helpful comment

Copying over my comments on how an API design change might look to work with this. I assume the use of the TextEncoder standard in the comments, but probably the ethers.js codec can be swapped in if that is the way forward here.


Referring to getPastEvents, I would propose something like the following:

Add an option, acknowledging that parsing event data might fail due to string processing, maybe called stringParsingErrorMode or something. A parsing error might occur for some non-empty subset of the events being gathered. We have replace which can just replace the bad sequence starts with ďż˝, skip which skips that subset and returns the events for which string parsing didn't fail, or error which rejects with the encoding error encountered. It's fine if the default behavior remains error, as long as these other modes are supported.

In particular, skip would be nice for dapps that do expect arbitrary input data from events, but do not want to be DOS'd by a bad event.

All 12 comments

utf8 throws an uncatchable exception when it cannot decode hex.

From UTF8:

Decodes any given UTF-8-encoded string (byteString) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)

For my implementation purposes I have patched line 218 of utils:
https://github.com/ethereum/web3.js/blob/1.0/packages/web3-utils/src/utils.js#L218

With

let returnValue;
try {
  returnValue = utf8.decode(str);
}
catch(e) {
  returnValue = "";
}
finally {
  return returnValue;
}

commit (don't mind the meta)

I'm not going to be submitting a PR because this is not a universally safe patch, but works for my implementation. Leaving open.

here's the hex that broke web3 for me:
0x0e00000000000f8c0000000000000005000f0000070000000909

Cheers.

Maybe https://github.com/mathiasbynens/wtf-8 might be useful for that case?

@hexagon6 Agreed, but this would be a bigger change than the above hack.

This would appear to be a serious bug that really needs to be fixed, as any web3 dependent service using this library, which takes any form of user input, could be taken down from this bug.

I am considering adding an “unsafeString” type or something to ethers.js that would help web3 deal with this issue, which would allow for invalid surrogate pairs, overlong sequences and invalid ordinal bits set.

It is a hard problem to solve, because not sanitizing the output can lead to serious security implications for dapps; if you were to hash the string, it would not match, for example, which may be used in a commit-reveal. If I can trick you into committing to a wrong hash, you cannot reveal, which may mean you cannot properly challenge, for example.

I would also recommend that Web3.js also add UTF8 sanitization to all contract calls, so it fails before it hits the blockchain.

That said, there are still multiple ways to try interpreting invalid UTF8 sequences. The current UTF8 library allows a “allowInvalid” flag (which is never set to true, but it was easy to add and potentially useful) which uses the “ignore” strategy (which simply drops the naughty bits), but for this I think the replace strategy makes more sense, replacing the invalid sequence with the placeholder, And continues to the next octet with the ordinal but cleared... I believe (please correct me if I’m wrong), that’s what the native JS string does.

I think for now, you could also use the “bytes” type for decoding, which doesn’t care about data representation. But that needs to be further in, since the contract sighash needs the string “string” to match.

Still thinking about this though...

I would also recommend that Web3.js also add UTF8 sanitization to all contract calls, so it fails before it hits the blockchain.

Agreed 👌

but for this I think the replace strategy makes more sense, replacing the invalid sequence with the placeholder, And continues to the next octet with the ordinal but cleared... I believe (please correct me if I’m wrong), that’s what the native JS string does.

I totally agree with you on this!

I think for now, you could also use the “bytes” type for decoding, which doesn’t care about data representation

Yep, this should work.

I was checking if there are other libs who do have the replacement logic as for example python does have (@ricmoo showed me the functionalities python has). But sadly there is currently no lib which does handle this as I would expect.

Luckily is @ricmoo already creating such an improved UTF-8 lib based on already existing code đź’Ş

I will wait until Richard has finished his lib and will use it for the web3.utils.toUtf8call and all other utility functions we have. This means this specific issue will be fixed as soon as @ricmoo has finished his lib in a stable manner or the new AbiCoder v5 which does include this fix got released as a stable version.

Copying over my comments on how an API design change might look to work with this. I assume the use of the TextEncoder standard in the comments, but probably the ethers.js codec can be swapped in if that is the way forward here.


Referring to getPastEvents, I would propose something like the following:

Add an option, acknowledging that parsing event data might fail due to string processing, maybe called stringParsingErrorMode or something. A parsing error might occur for some non-empty subset of the events being gathered. We have replace which can just replace the bad sequence starts with ďż˝, skip which skips that subset and returns the events for which string parsing didn't fail, or error which rejects with the encoding error encountered. It's fine if the default behavior remains error, as long as these other modes are supported.

In particular, skip would be nice for dapps that do expect arbitrary input data from events, but do not want to be DOS'd by a bad event.

Just a note: we tried to reproduce this bug in #3497 without success.

If anyone has a reproduction case, please advise.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions

@cgewecke At one point I had the exact block number and event that caused utf8 to throw exception, but it has been more than ~3~ 2 years since I reported this.

@cgewecke In the original thread I mentioned a hex that brought up the issue for me. You could include that hex string in a unit test against whatever module is presently implemented for utf-8

0x0e00000000000f8c0000000000000005000f0000070000000909

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. If you believe this was a mistake, please comment.

Was this page helpful?
0 / 5 - 0 ratings