Go-ipfs: Switch to base36 by default in all text output (overriding ipfs/go-ipfs/issues/4143 )

Created on 25 May 2020 · 13Comments · Source: ipfs/go-ipfs

Now that base36 support is landing into go-ipfs for unrelated reasons I would like to resurrect my pitch from 3 years ago to default to base36 rendering as opposed to base32. The internals of go-ipfs operate over binary CIDs so the proposed change is strictly concerned with "UX".

Let me know your thoughts!

✅ Pros

Consistency across the board: libp2p-key and unixfs-entry pointers all look like what the wider public knows as "a CID"
Brings "semi-avalanche effect" to identity CIDs. E.g.

| | | |
| --- | --- | --- |
| Data: | sufficiently long payload | sufficient1y long payload |
| B32Cid: | bafkqagttovtgm2ldnfsw45dmpeqgy33om4qhaylznrxwczak | bafkqagttovtgm2ldnfsw45brpeqgy33om4qhaylznrxwczak |
| B36Cid: | kumq0v1xsl33xy9ospzfmf6uuibi0jxtdsrmhovipx6o0a | kumq0v1xsl33xy9ospzfmbp3zrpszmf3tbmvd50ou83pje|

Increases the usable size of an identity-cid from 34 to 36 bytes of "blocksize" while still usable in subdomains (max 63 characters): 1 + ceil( (4+blocksize) * log(256) / log(base32-or-36) )

❌ Cons

Early adopters are already familiar with bafy..., another switch would create further confusion
Somewhat slower ( on my macbook I can encode and decode ( 1 * 1000 * 1000 * 1000 ) / 2672 = 374251 b36 CIDs in 1 second ):

go-multibase$ go test -count=1 ./... -cpu=1 -bench . -benchmem
goos: darwin
goarch: amd64
pkg: github.com/multiformats/go-multibase
BenchmarkRoundTrip/base32            2190152           541 ns/op         256 B/op          4 allocs/op
BenchmarkRoundTrip/base36             464097          2672 ns/op         336 B/op          5 allocs/op
BenchmarkRoundTrip/base58btc          564238          2220 ns/op         336 B/op          5 allocs/op
BenchmarkEncode/base32               5402474           234 ns/op         192 B/op          3 allocs/op
BenchmarkEncode/base36                608466          1966 ns/op         192 B/op          3 allocs/op
BenchmarkEncode/base58btc             739843          1585 ns/op         192 B/op          3 allocs/op
BenchmarkDecode/base32               4150816           278 ns/op          64 B/op          1 allocs/op
BenchmarkDecode/base36               1604271           742 ns/op         144 B/op          2 allocs/op
BenchmarkDecode/base58btc            1804045           643 ns/op         144 B/op          2 allocs/op

kinenhancement

Source

ribasushi

👍1

All 13 comments

I'm pretty late to the "how do we encode a CID"-Party... but why don't we use base62 for encoding?

I mean, I don't think we encounter a system which can't distinguish between lower and upper case and entering CIDs by hand isn't really a thing either.

The much smaller CID on the other hand would allow for better displaying. Currently it's hard to create for example a table with the full CID next to a name on a console. E.g. the status list of ipfs cluster always do a linebreak.

On the other hand a checksum-sign at the end, like with IBANs would be neat, since it would allow to detect copy'n'paste errors.

If we use the same text to binary transformation it would be nice if the type of data we encode would be visible to a human, so CIDs and Public Keys cannot be mixed up. Not sure how to do this space efficiently tho.

RubenKelevra on 25 May 2020

I don't think we encounter a system which can't distinguish between lower and upper case

DNS is one such system. This is why we migrated away from b58 in the first place. For more background see https://github.com/ipfs/go-ipfs/issues/5982#issue-408601450

ribasushi on 25 May 2020

@RubenKelevra I appreciate your enthusiasm to contribute but again, please take some time to read the relevant background before chiming in on discussions. It would have taken you less time to google "base32 cid ipfs" and read the first link than it took me to write this comment.

Stebalien on 25 May 2020

👍1

There is a performance penalty, but this is mostly about UX.
Having that in mind, I'd like to highlight what @RubenKelevra said here:

If we use the same text to binary transformation it would be nice if the type of data we encode would be visible to a human

I wonder, perhaps there is UX next-positive in having different defaults (and canonical representation in subdomains) for mutable and immutable identifiers:

keep base32 for /ipfs/baf...
use base36 as the default for PeerIDs/libp2p-keys (/ipns/k2...)
- we will use it for longer keys such as ed25519 anyway, so we are half-way there

This would make it easy to eyeball identifiers (baf.. vs k2..), potentially reducing confusion between "CIDs" and "PeerIDs".

Any downsides?

I don't think we encounter a system which can't distinguish between lower and upper case

I wish that was the case :-)

Unfortunately the default text representation of CID needs to be case-insensitive for pragmatic reasons, some of them are:

DNS is one thing,
but actual code in browser vendors (and probably a bunch of libraries/SDKs used in the wild) is forcing all characters in the "authority" component of the URI to be lowercase, _even if URI does not start with http*_ (!)
Windows filesystems are case-insensitive (only recently Windows 10 build 17107 introduced experimental opt-in per directory to make them case-sensitive)
probably much more

lidel on 25 May 2020

@Stebalien sorry.

@lidel

I think adding an optional parity section would be better for user interaction, too.

Humans are particularly bad at comparing hash sums, so adding an optional parity section can help increasing the strength against attacks, since the parity section would need to be forged as well.

A visual separator like a dash, between the CID and the parity section would help to break it visually apart.

So if a user inserts a CID w/ parity (or a PeerID w/ parity) we can check the full address for spoofing attacks and wrong input.

The Argon2 algorithm is probably a good choice: The checksum part would be the password, while the hash definition would be the salt.

The output could be encoded to a very short string and added. If someone enters a CID without additional parity bits at the end, a warning could be displayed - while if a CID with additional parity bits at the end would be processed without a warning.

If we need to mass-import CIDs we could offer a flag to ignore this part.

Clearly out of scope for this ticket, I will move it to an feature request

Edit: Make the idea a bit more clear.

RubenKelevra on 25 May 2020

@RubenKelevra the limitation here is that IPFS uses CIDv0/CIDv1 spec, and adding checksum would mean creating CIDv2. I think its worth discussing, but as you noticed its out of scope here, so please fill an issue in https://github.com/multiformats/cid :pray:

lidel on 26 May 2020

This would make it easy to eyeball identifiers (baf.. vs k2..), potentially reducing confusion between "CIDs" and "PeerIDs".

Any downsides?

Well, people will start assuming that baf means CID and k2 means peer ID. Worse, people will start writing tools to match these prefixes. On the other hand, that's likely going to happen anyways.

Stebalien on 26 May 2020

👍1

Apart from performance and en/decoder-complexity,
another downside of base36 is decreased legibility, because it adds 0, 1 and 8 to the character set that can be mistaken for I, l or O
probably not worth it for the small increase in data density.

bmwiedemann on 27 May 2020

Well, people will start assuming that baf means CID and k2 means peer ID. Worse, people will start writing tools to match these prefixes. On the other hand, that's likely going to happen anyways.

To clarify, I wouldn't consider the ability to distinguish between CIDs and peer IDs based on encoding a feature.

I'm slightly leaning towards base36 everywhere:

To be consistent with CIDs.
To _stop_ users from trying to distinguish between "is a cid" and "is not a cid" by looking at the base encoding. They're all CIDs.

@bmwiedemann

Apart from performance and en/decoder-complexity,
another downside of base36 is decreased legibility, because it adds 0, 1 and 8 to the character set that can be mistaken for I, l or O
probably not worth it for the small increase in data density.

The goal is:

To shave off two bytes from specific IPNS keys so they fit in subdomains.
Be consistent.

Stebalien on 28 May 2020

To shave off two bytes from specific IPNS keys so they fit in subdomains.

Maybe soon there will be other keys that need another 2 bytes?

Others have avoided the 63-char limit by allowing a dot:
https://tools.ietf.org/html/rfc6376#section-3.6.2.1

and code complexity is still a concern to me, because it makes alternative implementations harder and thereby reduces ipfs adoption:
https://github.com/bmwiedemann/perl-ipfs/blob/master/multibase.pm#L16
has just ~20 lines to implement encoding+decoding for base32+64+16+2
and no base58 or 36 support in there. Luckily both filestore and CIDv1 default to base32.

bmwiedemann on 28 May 2020

TLDR: @Stebalien has accurately stated the goal, @bmwiedemann's solution is explored in #7318 and has serious problems for important use cases.

Maybe soon there will be other keys that need another 2 bytes?

Yes, and we'll likely have to deal with that at some point, however the "just use a dot" suggestion is not as simple as you are advertising. Take a look at #7318 for more context and feel free to add any suggestions there if you have some ideas.

Hopefully we can work to ensure that we don't have to deal with this problem "soon" and that by then some of the issues in #7318 will become manageable. Potentially we could even sidestep DNS name restrictions entirely by relying more heavily on an IPFS protocol handler instead just using a DNS name.

In the meanwhile we'd like to handle #6916 while still supporting resolution in subdomains. While we could do that by using base 36 in some places (e.g. subdomains) and base 32 in other places, this proposal is advocating for consistently outputting base 36.

aschmahmann on 28 May 2020

❤1

@bmwiedemann I contributed a scaffold b36/b58 encoder here as a "oneliner" : https://github.com/bmwiedemann/perl-ipfs/commit/83cb86c33eceafe50980ca0a1ed3409e8a8c2c9a#r39496064

Maybe soon there will be other keys that need another 2 bytes?

We are now painfully aware about the hard limit for user-visible things being 40 bytes, and are going to aggressively design to not go over this in the future (case in point: the current situation was the result of an oversight).

ribasushi on 28 May 2020

Dropping a reference for when there is a decision on this: one of the main spots to fixup would be https://github.com/ipfs/go-ipfs/blob/79d571f272d640f0d4164f6f9d94c5fb44983822/core/corehttp/hostname.go#L381-L396

ribasushi on 10 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings