Now that base36 support is landing into go-ipfs for unrelated reasons I would like to resurrect my pitch from 3 years ago to default to base36 rendering as opposed to base32. The internals of go-ipfs operate over binary CIDs so the proposed change is strictly concerned with "UX".
Let me know your thoughts!
✅ Pros
| | | |
| --- | --- | --- |
| Data: | sufficiently long payload | sufficient1y long payload |
| B32Cid: | bafkqagttovtgm2ldnfsw45dmpeqgy33om4qhaylznrxwczak | bafkqagttovtgm2ldnfsw45brpeqgy33om4qhaylznrxwczak |
| B36Cid: | kumq0v1xsl33xy9ospzfmf6uuibi0jxtdsrmhovipx6o0a | kumq0v1xsl33xy9ospzfmbp3zrpszmf3tbmvd50ou83pje|
1 + ceil( (4+blocksize) * log(256) / log(base32-or-36) )❌ Cons
bafy..., another switch would create further confusion( 1 * 1000 * 1000 * 1000 ) / 2672 = 374251 b36 CIDs in 1 second ):go-multibase$ go test -count=1 ./... -cpu=1 -bench . -benchmem
goos: darwin
goarch: amd64
pkg: github.com/multiformats/go-multibase
BenchmarkRoundTrip/base32 2190152 541 ns/op 256 B/op 4 allocs/op
BenchmarkRoundTrip/base36 464097 2672 ns/op 336 B/op 5 allocs/op
BenchmarkRoundTrip/base58btc 564238 2220 ns/op 336 B/op 5 allocs/op
BenchmarkEncode/base32 5402474 234 ns/op 192 B/op 3 allocs/op
BenchmarkEncode/base36 608466 1966 ns/op 192 B/op 3 allocs/op
BenchmarkEncode/base58btc 739843 1585 ns/op 192 B/op 3 allocs/op
BenchmarkDecode/base32 4150816 278 ns/op 64 B/op 1 allocs/op
BenchmarkDecode/base36 1604271 742 ns/op 144 B/op 2 allocs/op
BenchmarkDecode/base58btc 1804045 643 ns/op 144 B/op 2 allocs/op
I'm pretty late to the "how do we encode a CID"-Party... but why don't we use base62 for encoding?
I mean, I don't think we encounter a system which can't distinguish between lower and upper case and entering CIDs by hand isn't really a thing either.
The much smaller CID on the other hand would allow for better displaying. Currently it's hard to create for example a table with the full CID next to a name on a console. E.g. the status list of ipfs cluster always do a linebreak.
On the other hand a checksum-sign at the end, like with IBANs would be neat, since it would allow to detect copy'n'paste errors.
If we use the same text to binary transformation it would be nice if the type of data we encode would be visible to a human, so CIDs and Public Keys cannot be mixed up. Not sure how to do this space efficiently tho.
I don't think we encounter a system which can't distinguish between lower and upper case
DNS is one such system. This is why we migrated away from b58 in the first place. For more background see https://github.com/ipfs/go-ipfs/issues/5982#issue-408601450
@RubenKelevra I appreciate your enthusiasm to contribute but again, please take some time to read the relevant background before chiming in on discussions. It would have taken you less time to google "base32 cid ipfs" and read the first link than it took me to write this comment.
There is a performance penalty, but this is mostly about UX.
Having that in mind, I'd like to highlight what @RubenKelevra said here:
If we use the same text to binary transformation it would be nice if the type of data we encode would be visible to a human
I wonder, perhaps there is UX next-positive in having different defaults (and canonical representation in subdomains) for mutable and immutable identifiers:
/ipfs/baf.../ipns/k2...)This would make it easy to eyeball identifiers (baf.. vs k2..), potentially reducing confusion between "CIDs" and "PeerIDs".
Any downsides?
I don't think we encounter a system which can't distinguish between lower and upper case
I wish that was the case :-)
Unfortunately the default text representation of CID needs to be case-insensitive for pragmatic reasons, some of them are:
http*_ (!)@Stebalien sorry.
@lidel
I think adding an optional parity section would be better for user interaction, too.
Humans are particularly bad at comparing hash sums, so adding an optional parity section can help increasing the strength against attacks, since the parity section would need to be forged as well.
A visual separator like a dash, between the CID and the parity section would help to break it visually apart.
So if a user inserts a CID w/ parity (or a PeerID w/ parity) we can check the full address for spoofing attacks and wrong input.
The Argon2 algorithm is probably a good choice: The checksum part would be the password, while the hash definition would be the salt.
The output could be encoded to a very short string and added. If someone enters a CID without additional parity bits at the end, a warning could be displayed - while if a CID with additional parity bits at the end would be processed without a warning.
If we need to mass-import CIDs we could offer a flag to ignore this part.
Clearly out of scope for this ticket, I will move it to an feature request
Edit: Make the idea a bit more clear.
@RubenKelevra the limitation here is that IPFS uses CIDv0/CIDv1 spec, and adding checksum would mean creating CIDv2. I think its worth discussing, but as you noticed its out of scope here, so please fill an issue in https://github.com/multiformats/cid :pray:
This would make it easy to eyeball identifiers (baf.. vs k2..), potentially reducing confusion between "CIDs" and "PeerIDs".
Any downsides?
Well, people will start assuming that baf means CID and k2 means peer ID. Worse, people will start writing tools to match these prefixes. On the other hand, that's likely going to happen anyways.
Apart from performance and en/decoder-complexity,
another downside of base36 is decreased legibility, because it adds 0, 1 and 8 to the character set that can be mistaken for I, l or O
probably not worth it for the small increase in data density.
Well, people will start assuming that baf means CID and k2 means peer ID. Worse, people will start writing tools to match these prefixes. On the other hand, that's likely going to happen anyways.
To clarify, I wouldn't consider the ability to distinguish between CIDs and peer IDs based on encoding a feature.
I'm slightly leaning towards base36 everywhere:
@bmwiedemann
Apart from performance and en/decoder-complexity,
another downside of base36 is decreased legibility, because it adds 0, 1 and 8 to the character set that can be mistaken for I, l or O
probably not worth it for the small increase in data density.
The goal is:
To shave off two bytes from specific IPNS keys so they fit in subdomains.
Maybe soon there will be other keys that need another 2 bytes?
Others have avoided the 63-char limit by allowing a dot:
https://tools.ietf.org/html/rfc6376#section-3.6.2.1
and code complexity is still a concern to me, because it makes alternative implementations harder and thereby reduces ipfs adoption:
https://github.com/bmwiedemann/perl-ipfs/blob/master/multibase.pm#L16
has just ~20 lines to implement encoding+decoding for base32+64+16+2
and no base58 or 36 support in there. Luckily both filestore and CIDv1 default to base32.
TLDR: @Stebalien has accurately stated the goal, @bmwiedemann's solution is explored in #7318 and has serious problems for important use cases.
Maybe soon there will be other keys that need another 2 bytes?
Yes, and we'll likely have to deal with that at some point, however the "just use a dot" suggestion is not as simple as you are advertising. Take a look at #7318 for more context and feel free to add any suggestions there if you have some ideas.
Hopefully we can work to ensure that we don't have to deal with this problem "soon" and that by then some of the issues in #7318 will become manageable. Potentially we could even sidestep DNS name restrictions entirely by relying more heavily on an IPFS protocol handler instead just using a DNS name.
In the meanwhile we'd like to handle #6916 while still supporting resolution in subdomains. While we could do that by using base 36 in some places (e.g. subdomains) and base 32 in other places, this proposal is advocating for consistently outputting base 36.
@bmwiedemann I contributed a scaffold b36/b58 encoder here as a "oneliner" : https://github.com/bmwiedemann/perl-ipfs/commit/83cb86c33eceafe50980ca0a1ed3409e8a8c2c9a#r39496064
Maybe soon there will be other keys that need another 2 bytes?
We are now painfully aware about the hard limit for user-visible things being 40 bytes, and are going to aggressively design to not go over this in the future (case in point: the current situation was the result of an oversight).
Dropping a reference for when there is a decision on this: one of the main spots to fixup would be https://github.com/ipfs/go-ipfs/blob/79d571f272d640f0d4164f6f9d94c5fb44983822/core/corehttp/hostname.go#L381-L396