Eth2.0-specs: Remove bitlist length bit from SSZ

Created on 3 Jul 2019  路  17Comments  路  Source: ethereum/eth2.0-specs

There was a change recently that changed the bitfield types to Bitlist[N] in #1224. Part of this change was to also change SSZ to include special handling for the underlying data type, a byte array.

I've talked with a few folks about the rationale for the changes in #1224. From what I understand, these changes were included for readability in the Python specification/implementation. However, this added a change that isn't strictly related to readability: Bitlist bit length in SSZ encoding.

Bitlist[N]
Note that from the offset coding, the length (in bytes) of the bitlist is known. An additional leading 1 bit is added so that the length in bits will also be known.

as_integer = (1 << len(value)) + sum([value[i] << i for i in range(len(value))])
return as_integer.to_bytes((as_integer.bit_length() + 7) // 8, "little")

specs/simple-serialize.md

Prior to the add bitlist length in SSZ, we checked the number of bytes in the bitfield byte array to ensure it was the expected length. Maybe an issue here is that you can only check for fields in multiples of 8 bits (1 byte). So if we had a bitfield that represented indices in a committee where the committee length was 10 then we would check the bitfield had 2 bytes, but there were 6 bits unused. An additional layer of verification could be to set the 6 most significant bits to 0 in the second byte or simple disregard their values entirely. In either case, we had a solution without Bitlist length in SSZ. The length enforcement should be handled in the application layer rather than the serialization layer.

With these changes, we now have this added consequence of a non-primitive type handling in SSZ. Up until this PR, you could use SSZ and other encoders interchangeably with adding special rules. You could encode any data structure with a bitfield to JSON or others where the underlying bytes array would naturally be appropriately represented in the same way. We have to treat some byte arrays different than other byte arrays and this adds unnecessary complexity. The key takeaway here is that we could previously fully populate an SSZ yaml file to a data structure using the existing serialization tooling for yaml and/or json. i.e. byte arrays were simply byte arrays, not byte arrays with some extra metadata.

If its helpful, I had started collecting notes on this change and trying to piece together the puzzle of #1224. Link

There's a good bit about this in @rauljordan's article in the section called "Why Not More Basic Types!?" link

TL;DR - Bitfield length information in SSZ is unnecessary complexity and we should remove it. Let's keep simple serialize simple.

SSZ

Most helpful comment

My recommendation is that you just don't treat bitfields as bytes, but as a separate object, that is just a list of booleans with a special way of serializing it.

I think part of the friction is around the fact that practically speaking, any performant implementation is going to want to represent a bitlist as bytes, like a bitfield. But by imbuing a bitlist with its length information, we can no longer just use the bytes primitive as is. This is the first type that is not a primitive in any language to be added to ssz. The non-standard convention of the padding bit / length information practically requires an additional type to be supported.

Its this disconnect between this pseudo-bitfield and the widely-known pattern of bitfield usage that is a little troubling. The switch to thinking about bitfields as "lists of bits" creates different machinery and potentially different semantics than a normal bitfield.

Does the bitlist need to grow/shrink? Does bit access beyond the bitlength but within the bytelength throw an error? What operations does it support? These are new questions that need to be addressed by a 'bitlist' implementor. The lack of specification and unknown future usage of bitlists makes it difficult to build out what amounts to a custom data type library.

I'm curious if there are known tangible benefits from tracking the bitlist length? I understand the typing is more robust, but why do we need/want it within ssz? What will we need to be doing with bitlists that we can't be doing with bitfields/bytes in our application?

All 17 comments

+1 on this. Although it adds further readability on perhaps the Python implementation, it pushes further complexity on languages, such as Go, which already have a harder time implementing SSZ due to a lack of generics. Having bitlists and bitvectors as primitives as @prestonvanloon mentioned also make implementation assumptions within SSZ that limit us from using the tooling we use in Prysm with other encoders. We kindly ask researchers to reconsider this change and possibly revert in favor of keeping SSZ simple.

Since I was the driver behind this change I am very happy to answer any questions about it. The way I understand it, the main criticism is this:

The length enforcement should be handled in the application layer rather than the serialization layer.

It was the main intention of this PR to have the length check in the SSZ layer. The reason is that we came across a number of these bitfields in different places, and it seemed that it would be much more natural to have this length check at a lower layer. We think it does increase readability of the spec.

You could encode any data structure with a bitfield to JSON or others where the underlying bytes array would naturally be appropriately represented in the same way. We have to treat some byte arrays different than other byte arrays and this adds unnecessary complexity.

I don't understand this part, could you clarify? My recommendation is that you just don't treat bitfields as bytes, but as a separate object, that is just a list of booleans with a special way of serializing it. That should be possible in any implementation, even if you serialize to yaml or JSON?

Here are some answers to your question in the Hackmd:

This appears to have moved from the phase 0 spec into the custody game spec for phase 1. It also appears to have been renamed to get_bit.

This is just temporary because there will be larger changes to the phase 1, and we will make it completely use the SSZ bitfields. There will be no more need for get_bit, it will all be handled by SSZ.

What isn鈥檛 clear is how does ssz know what the committee size is? Removing this validation loosens overall validation on justification and finalization during epoch processing. I recommend client implementors keep this validation and we restore it in the specification.

SSZ does not know the committee size, but it now knows the bitfield size. There was a bug where this size was not checked, which has now been fixed by Justin in #1264 .

Bitlists with padding bit seems like a change we don鈥檛 need given the added cost and no clear benefit.

The benefit is that we have a well defined length of every bitfield object.

I don't understand this part, could you clarify? My recommendation is that you just don't treat bitfields as bytes, but as a separate object, that is just a list of booleans with a special way of serializing it. That should be possible in any implementation, even if you serialize to yaml or JSON?

I'll try to come up with a contrived example.

Imagine that we have a bitlist of length 8 where the data is 0b00111111.
Also imagine we have a data structure that exists of primitive types.

type Data struct {
   Num uint64      `json:"num"`
   Bitfield []byte `json:"bitifeld"`
   Name string     `json:"name"`
}

Now if we look at the JSON representation of this object:

{
   "num": 5,
   "bitfield": "8w==", // Or 0xF3 if you prefer
    "string": "hello"
}

This JSON can populate the go struct without any special handling or processing. It just works.

Before #1226, the SSZ representation of the bitfield would be 0xF3 as well. These data representations are interchangeable between the two.

Now, you have this padding bit metadata. So the bitlist would be serialized to 0x3F01 in SSZ. If we were to put this same data in JSON and use that data to populate our primitive []byte field, we would have data that looks like

[]byte{0b00111111, 0b00000001} 

So now we have to do a second round a processing this data.

  1. Populate the structure with the data from JSON
  2. Then go back and clean up any bitfields, removing the length bit.

Likewise with marshaling to JSON and remain compatible with SSZ primitives

  1. Edit any bitfields to add the padding bit
  2. Marshal to JSON

An alternative is to change our struct to

type Data struct {
   Num uint64      `json:"num"`
   Bitfield struct{
      Data []byte `json:"data"`
      Length uint64 `json:"length"`
   } `json:"bitifeld"`
   Name string     `json:"name"`
}

But now our JSON object from earlier doesn't map so we have to change that representation to the following:

{
   "num": 5,
   "bitfield": {
       "data": "8w==", // Or 0xF3 if you prefer
        "length": 8,
     },
    "string": "hello"
}

SSZ does not know the committee size, but it now knows the bitfield size.

SSZ doesn't need to know either of these things. The bitlist is an unbounded/dynamic array of bytes in ssz. I offered a solution in the original issue description for handling expected bitlist lengths that were not a multiple of 8.

The benefit is that we have a well defined length of every bitfield object.

We already knew what to expect for bits length in processing.

I strong urge you to reconsider this change. It may improve readability in python but comes with a non-trivial increase in complexity.

What if you just have your bitfield Bitfield []byte already be in the serialized SSZ format? Then all the information would be contained in this data structure.

SSZ doesn't need to know either of these things.

The argument here is that SSZ is a strictly better typing library if it handles lengths of lists safely, i.e. it is aware of the exact length of a list and can catch an out of bounds access. This change makes bitlists behave in exactly the same way any other kind of lists behave. It pushes a very slight bit of complexity into the SSZ layer, and removes a larger amount of complexity from the application layer.

While I do think it is natural to enforce bitlist lengths in consensus via Merkleisation length mix-ins, there is wiggle room to experiment with alternative serialisations if the currently suggested serialisation is too painful to deal with. Indeed, the consensus part of phase 0 (deposit contract, state transition function, fork choice rule) is agnostic to serialisation. In other words, serialisation is an implementation detail not part of the remit of the spec freeze.

Indeed, the consensus part of phase 0 (deposit contract, state transition function, fork choice rule) is agnostic to serialisation

I would think alternative serializations would have an impact to changing state transition function. Ex: if we don't enforce length check in the SSZ layer, state transition will now need to check block body's operation lengths, validator length for shuffling...etc

Go is somewhat painfull here due to the lack of static-type parametrization, but you can work around it reasonably well. I.e. you keep the leading bit in memory for bitlists to derive the length from. And for bitvectors you declare it as byte array (fixed size) and put a Count method on it to tell the bit length. Your SSZ lib can instantiate a default value, and derive the length from there, since there are no type/class/static-methods in Go to work with. Or you make Count run on a pointer receiver, and you can use a typed nil to call the bit-count function on.

Here's a POC: https://play.golang.org/p/fiCVRA5_1jS

This should enable you to read it from hex/yaml, like any other []byte and [N]byte. The functions attached to the type will handle the logic, no constructor or encode-decode overrides necessary. Go sure is different than Python, JS, etc.

Bitvectors cost you a few lines to define a different length, but there's simply no generics/parametrization/inheritance to do it with. And it's not a struct, so no embedding either. But this works well enough, although Go certainly is not "pretty" here.

Carelessly being able to convert from Bitvector[:] to []byte to Bitlist is the benefit however, and SSZ serialization/deserialization is basically free (copy over bytes in one go, check length, and done).

Hi @dankrad

What if you just have your bitfield Bitfield []byte already be in the serialized SSZ format? Then all the information would be contained in this data structure.

This is highly non-standard and enforces one more cognitive leap for someone attempting to read our serialization code + use our libraries. It adds more "gotchas" that shouldn't exist in the first place. I can't think of an encoding library that would do this. We certainly would _never_ expect the canonical Go JSON encoder to enforce that a list of something never exceeds 10 elements, and for good reason.

@protolambda

Bitvectors cost you a few lines to define a different length, but there's simply no generics/parametrization/inheritance to do it with. And it's not a struct, so no embedding either. But this works well enough, although Go certainly is not "pretty" here.

Why do we need to work around that? Why not just handle it at the application layer in the first place? Aside from being quite messy and difficult to grok, using something akin to that Go playground would add large complexity to our SSZ and lack of readability. That also makes assumptions about the structure of our SSZ library. It is unreasonable to need to do that sort of logic for something that could have been avoided in the first place.

Our point is less about the difficulty of doing it in Go and more that this adds significant constraints to our tooling and forces SSZ to make assumptions that other serialization libraries would never enforce given they are supposed to be agnostic to use cases.

When approaching SSZ, we sought out to build it following the mindset and API's of other popular, agnostic serialization libraries that focus simplicity and do not take on application layer or use-case complexity. This change means we have to use SSZ for everything, and encourages lock-in, as it is what does length enforcement instead of it living in the application layer.

On behalf of our entire team and keeping in mind the goals of SSZ as a generic, agnostic, and _simple_ serialization library, we kindly ask to please reconsider this change, as v0.7.1 SSZ maintained a higher level of simplicity and a more robust, agnostic specification with respect to these primitives than what we see in v0.8's design decisions.

My recommendation is that you just don't treat bitfields as bytes, but as a separate object, that is just a list of booleans with a special way of serializing it.

I think part of the friction is around the fact that practically speaking, any performant implementation is going to want to represent a bitlist as bytes, like a bitfield. But by imbuing a bitlist with its length information, we can no longer just use the bytes primitive as is. This is the first type that is not a primitive in any language to be added to ssz. The non-standard convention of the padding bit / length information practically requires an additional type to be supported.

Its this disconnect between this pseudo-bitfield and the widely-known pattern of bitfield usage that is a little troubling. The switch to thinking about bitfields as "lists of bits" creates different machinery and potentially different semantics than a normal bitfield.

Does the bitlist need to grow/shrink? Does bit access beyond the bitlength but within the bytelength throw an error? What operations does it support? These are new questions that need to be addressed by a 'bitlist' implementor. The lack of specification and unknown future usage of bitlists makes it difficult to build out what amounts to a custom data type library.

I'm curious if there are known tangible benefits from tracking the bitlist length? I understand the typing is more robust, but why do we need/want it within ssz? What will we need to be doing with bitlists that we can't be doing with bitfields/bytes in our application?

This is tricky because SSZ implicitly assumes that there are two levels of "staticness of length" of a type: (i) known at "compile time", (ii) not known until you receive the object. Case (i) corresponds to Vector[foo, n], case (ii) corresponds to List[foo,n]. But attestations are a strange third type, where the length is NOT known at compile time, but it's also known before we get the list, because it's proportional to the length of the committee size. So a variable length type is necessary, but we don't technically _have to_ specify the exact length. So this specific kind of dynamicness that we see with attestations is worst-case for SSZ.

I could foresee usecases in the future where the fully dynamic bitlist is valuable; representing Merkle paths in SSZ objects is one big example.

I'm personally okay with going back to bitfields being an application-layer concept if desired; it's a small change either way.

Pros:

  • ~None?~
  • Spec simplicity (depending on your perspective)
  • Ability to populate a clear bitlist abstraction from the encoding data.

Cons:

  • When len(bitfield) % 8 == 0, the encoding size increases by 1 byte
  • The expected length can be inferred at runtime for verification already
  • Including the bitlist length in ssz encoding requires workarounds
  • Most implementations would verify the expected length in runtime processing at the application layer in addition to decoding any ssz container
  • Increased complexity of ssz libraries

~I think its a clear win to revert this, especially if its a small change.~ Let me know what you think.

Edit: Updated with a nice pro after chatting with @dankrad on the phone. It's not such a clear win now :)

if we don't enforce length check in the SSZ layer, state transition will now need to check block body's operation lengths, validator length for shuffling...etc

Right, you can do that if you want :) The spec freeze does not prevent implementers from not enforcing the length check in their (de)serialisation layer, and instead pushing the check at a higher layer to keep the abstract state transition equivalent and maintain compatibility with the frozen spec. It's an implementation detail.

Pros:

* None?

* Spec simplicity (depending on your perspective)

  • Consistency with other kinds of lists (all of them have a clearly defined length, bitlist is the exception here -- why should a programmer have to think different about bitlists compared to other kinds of lists?)
  • Better readability of the spec (and also any other program that makes use of SSZ+bitlists in the future)
  • Type safety (In this particular case, the length is known from another place, but in general, it is preferrable that all types come with the same kind of guarantees)
  • assert len(bitlist) == expected_length is easier to read and reason about than verify_bitlist(...)

Cons:

* When len(bitfield) % 8 == 0, the encoding size increases by 1 byte

This difference is extremely small (one bit on average!)

* The expected length can be inferred at runtime for verification already

That's not a con, that's a neutral one. It only applies in this case.

* Including the bitlist length in ssz encoding requires workarounds
* Most implementations would verify the expected length in runtime processing at the application layer in addition to decoding any ssz container
* Increased complexity of ssz libraries

I think the increase in complexity here is a bit overblown :)

I think its a clear win to revert this, especially if its a small change. Let me know what you think.

Well, I think your list of pros and cons is a bit biased. I think we're all willing to engage in the discussion here, but please let's keep it at the technical layer.

@prestonvanloon Would you be happier if every Bitlist[N] was replaced by Container{bits: Bitvector[N], length: uint256}, with the length check done in the state transition function? :)

I just had a quick chat with @prestonvanloon about this. Currently, the solutions available to us seem:

  1. Go with @protolambda's POC, which basically always uses the serialized version in memory. Then Bitlists can still be represented by bytes without a problem with naturally using yaml/json libraries.
  • In this case, the SSZ libraries should still present the appropriate methods for getting/setting bits and for getting the length, with the corresponding safety in terms of length checking.
  • We might want to consider to change Merkleizationif we go this way, because currently, the length checking bit is not added for Merkleization [Disadvantage: a bitlist of max length 2^n could have a bitlength of 2^n+1, needing 2^(n+1) bits at the bottom of the Merkle tree effectively...]
  1. @JustinDrake's suggestion; another possibility would be to use this as the representation of the Bitlist inside SSZ, thus getting rid of the leading bit

  2. Removing Bitlists entirely and going back to the status before the change, with separate Bitlist checks inside the spec

Thanks for the recap @dankrad. The offline discussion was much needed.

The primary motivation for using a bitlist abstraction is that the bit getters/setters are much more intuitive than bitwise operations. Different languages may represent and maintain the underlying data in different ways, i.e. python might be a list of booleans while others might use some container with bytes / length or use the ssz encoded format with the length bit present.

The trade off is clear: you cannot simply construct this bitlist abstraction _without_ the length information.

With that said, our original position was option 3 as we did not fully understand the trade offs. Now, we are leaning more towards option 1 where we can keep the length information in the optimized form as suggested by @dankrad with an abstraction pattern suggested by @protolambda or a backing container in our implementation as @JustinDrake suggested.

I'll go ahead and close this.

Happy to reopen if anyone has any further ideas or comments. Thanks all!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mratsim picture mratsim  路  4Comments

mratsim picture mratsim  路  5Comments

paulhauner picture paulhauner  路  4Comments

djrtwo picture djrtwo  路  3Comments

dangerousfood picture dangerousfood  路  5Comments