Open an issue for following up our discussion on gitter.
Anything else? :)
cc @vbuterin @djrtwo @arnetheduck @mratsim @paulhauner @NatoliChris @poseyy
Repost the gitter discussion about integer types:
size % 8 == 0?2^p. int128 and int256 works but int40 or int24 don’t. That’s not super blocking as we can always represent them as an array of bytes or packed uint8 + uint16. i.e. in memory it’s just easier to use uint32 constrained to a certain range for uint24 for example. For serialization we would use an array of bytes.int / uint: Signed and unsigned integers of various sizes. Keywords uint8 to uint256 in steps of 8 (unsigned of 8 up to 256 bits) and int8 to int256. But don’t use this: uint and int are aliases for uint256 and int256, respectively. <— that’s too confusing in our use cases.Good call @hwwhww!
Here's our README: https://github.com/sigp/lighthouse/tree/master/ssz
EDIT: Oh, I see it's already in your list :)
This is great, I love having so much information all in one place. A few quick comments:
Some of the above on uint vs int and explicit types is covered here and will hopefully clear up this small(?) issue.
I'm writing a blog post on the basics of ssz and the rationale/motivation for introducing a new serialization format, I think the biggest issue with something like this is getting people to understand the rationale for a new format/understanding why it's necessary. Cheers.
One thing I did want to question: in RLP the main motivations were simplicity and guaranteed absolute byte-perfect consistency. Obviously simplicity is carried over, it seems like guaranteed absolute byte-perfect consistency is also a part of simple serialize. Can someone confirm this is the case?
I have some archaeological discovery that might shed some light on this issue.
One of our core principles in Ethereum is simplicity; the protocol should be as simple as possible, and the protocol should not contain any black boxes. Every single feature of every single sub-protocol should be precisely 100% documented on the whitepaper or wiki, and implemented using that as a specification (ie. test-driven development).
Buterin, V. 2014 RLP v.s. external technologies
Here is Nim implementation, simple tests and the discussion for a common test format.
One thing that I'm not too sure of regarding current SSZ are serializing container types. For example:
# CrystallizedState
fields = {
# List of validators
'validators': [ValidatorRecord],
# Last CrystallizedState recalculation
'last_state_recalc': 'int64',
# What active validators are part of the attester set
# at what slot, and in what shard. Starts at slot
# last_state_recalc - CYCLE_LENGTH
'shard_and_committee_for_slots': [[ShardAndCommittee]],
# The last justified slot
'last_justified_slot': 'int64',
# Number of consecutive justified slots ending at this one
'justified_streak': 'int64',
# The last finalized slot
'last_finalized_slot': 'int64',
# The current dynasty
'current_dynasty': 'int64',
# Records about the most recent crosslink `for each shard
'crosslink_records': [CrosslinkRecord],
# Used to select the committees for each shard
'dynasty_seed': 'hash32',
# Start of the current dynasty
'dynasty_start': 'int64'
}
# ValidatorRecord
fields = {
# The validator's public key
'pubkey': 'int256',
# What shard the validator's balance will be sent to
# after withdrawal
'withdrawal_shard': 'int16',
# And what address
'withdrawal_address': 'address',
# The validator's current RANDAO beacon commitment
'randao_commitment': 'hash32',
# Current balance
'balance': 'int128',
# Dynasty where the validator is inducted
'start_dynasty': 'int64',
# Dynasty where the validator leaves
'end_dynasty': 'int64'
}
# ShardsAndCommittee
fields = {
# The shard ID
'shard_id': 'int16',
# Validator indices
'committee': ['int24']
}
# CrosslinkRecord
fields = {
# What dynasty the crosslink was submitted in
'dynasty': 'int64',
# What slot
'slot': 'int64',
# The block hash
'hash': 'hash32'
}
Some thoughts in the past 2 days:
Replying to @AlexeyAkhunov https://github.com/ethereum/beacon_chain/issues/94 and https://github.com/ethereum/eth2.0-pm/issues/8
As mentioned by @raulk in https://github.com/ethereum/eth2.0-pm/issues/8 call, Protocol Labs (libp2p) will probably develop specialized Wireshard dissectors for Eth2.0 which should help traffic analysis a lot.
However, I do think type prefixes would be useful at the top level for BeaconBlock, ActiveState, CrystallizedState i.e. treat them as tagged unions/sum types, but probably unneeded for nested types like ShardAndCommittee, CrosslinkRecord, ValidatorRecord, AttestationRecord (unless you serialize and send them as standalone?)
For instance let's look at ActiveState and CrystallizedState
# ActiveState
fields = {
# Attestations that have not yet been processed
'pending_attestations': [AttestationRecord],
# Most recent 2 * CYCLE_LENGTH block hashes, older to newer
'recent_block_hashes': ['hash32']
}
# CrystallizedState
fields = {
# List of validators
'validators': [ValidatorRecord],
# Last CrystallizedState recalculation
'last_state_recalc': 'int64',
# What active validators are part of the attester set
# at what slot, and in what shard. Starts at slot
# last_state_recalc - CYCLE_LENGTH
'shard_and_committee_for_slots': [[ShardAndCommittee]],
# The last justified slot
'last_justified_slot': 'int64',
# Number of consecutive justified slots ending at this one
'justified_streak': 'int64',
# The last finalized slot
'last_finalized_slot': 'int64',
# The current dynasty
'current_dynasty': 'int64',
# Records about the most recent crosslink `for each shard
'crosslink_records': [CrosslinkRecord],
# Used to select the committees for each shard
'dynasty_seed': 'hash32',
# Start of the current dynasty
'dynasty_start': 'int64'
}
To be able to deserialize and distinguish between those efficiently you would need to send or agree on the types beforehand.
Also having type prefixes for top-level types would allow to concat multiple messages in a single one. I don't think we should use type prefixes for 'hash32' or 'address' unless they are serialized stand-alone that would add a lot of bytes especially for types that can appear in a list.
From https://github.com/ethereum/beacon_chain/issues/94 as well
I think we should have 2 bytes at the start to encode SSZ major.minor version
https://github.com/ethereum/beacon_chain/issues/94 and https://github.com/ethereum/eth2.0-pm/issues/8
The argument for moving at the end is because otherwise you would have to iterate on the data structure once to get the length then once again to get feed the data. ==> All languages even Haskell have a vector implementation that tracks length so you wouldn't need the first iteration to get length.
Furthermore putting length at the end means you need specific token to indicate end of data like ‘\0’ in cstring. This is causing a lot of issues in C code also this prevents reading by chunk into a buffer, you would have to read token by token (or read by chunk and backtrack).
As mentioned by @raulk in ethereum/eth2.0-pm#8 call, Protocol Labs (libp2p) will probably develop specialized Wireshard dissectors for Eth2.0 which should help traffic analysis a lot.
A community member has made progress with developing Wireshark dissectors for some basic libp2p protocols, like SecIO (security), Multistream (protocol selection), Yamux (multiplexing), etc. However, the challenge is accessing the session secrets to further decrypt the payload. We've briefly discussed that in this issue: https://github.com/libp2p/specs/issues/46.
Wireshark uses an "onion" approach to dissect frames. It calls dissectors iteratively to decapsulate the different layers in the message. At one point the dissection is going to encounter an application specific protocol in the payload (eth2.0 protocol), and to decrypt it, a Wireshark dissector should be registered for it. It's best if those dissectors are developed and maintained by the Ethereum 2.0 community, as they are application-specific and they'll need to evolve in lockstep with the protocol.
EDIT: the existing dissectors are likely outdated and not exhaustively tested.
EDIT 2: Contributions super welcome. Let’s _buidl_ this tooling together.
@mratsim
All languages even Haskell have a vector implementation that tracks length so you wouldn't need the first iteration to get length.
the argument is that you don't keep all data in memory at all - ie not even in a vector. Consider a linked list - there are two common implementations - one that has O(n) size and one with O(1) - the latter stores size in a central location but this breaks several nice properties of the linked lists.. likewise, if you have the data in a database, you may want to stream it instead of putting it in a vector first - this will make the max-memory requirement of your application go down. Both these things can be solved with a two-pass approach which is sometimes appropriate
Furthermore putting length at the end means you need specific token to indicate end of data like ‘\0’ in cstring. This is causing a lot of issues in C code also this prevents reading by chunk into a buffer, you would have to read token by token (or read by chunk and backtrack).
This depends - if you know the size of the full buffer (as is common when working with a datagram-style protocol, or if like above, you're storing the total size somewhere), you can work backwards from there to figure out the positions of the fields (just like when it's at the front, and you're implicitly assuming the buffer starts at "0").
Based on my understanding, one of the priorities of the format was efficient in-memory access (along with a willingness to pay a price in terms of space efficiency)
I just went through the exercise of hardening our implementation against overruns and misaligned pointers (https://github.com/status-im/nim-beacon-chain/pull/7 - hopefully caught all of them :smile:) - the following things are noteworthy, when writing cross-platform code:
mmap and the like to directly map the data, when loading from a trusted sourceA possible solution to this issue is to introduce a rule to the serializer that ensures that padding is mandatory and deterministic:
For a good example of a well-aligned format, check out https://google.github.io/flatbuffers/flatbuffers_internals.html
It is worth to note that modern X86 processors have mostly solved unaligned access and the performance penalty is small - nonetheless, for several cases of optimized access, even X86 requires aligned data as the optimization level of the compiler gets cranked up - sse4.2 for example, as seen here: http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
I was reading through the current spec. Couple of thoughts:
hash96/97 for BLS aggregate public keys is confusing. It is not a hash function in the normal cryptographical sense. (Though admittedly it shares some properties). Why not bytes96 instead?bytesN for fixed size byte strings much like the fixed size integers?len(list) * sizeof(element)), but the rest of the spec and reference implementation support variable sized elements. It is thus not clear whether variable sized list items are supported. This potentially makes a O(n) vs O(1) difference for random indexing performance, so clarity would be appreciated.['uint8'] * len(s). For a given schema there always exists an invalid bytestring (proof: take a valid bytestring and append a byte).@arnetheduck Here's another nice example of a well-designed serialization format meant for fast mmaping: https://capnproto.org/index.html I do think that takes alignment into consideration will necessarily not be 'simple'.
@Recmo
it does deal with alignment: https://capnproto.org/encoding.html
also, flatbuffers has lots of valuable discussion on the topic: https://google.github.io/flatbuffers/flatbuffers_internals.html
@arnetheduck I know it handles alignment, that's why i brought it up. :)
The point I failed to make was that handling alignment and creating a "Simple" format are somewhat conflicting goals.
Let's take discussion of alignment and it's relevance to another issue if necessary
Most helpful comment
A community member has made progress with developing Wireshark dissectors for some basic libp2p protocols, like SecIO (security), Multistream (protocol selection), Yamux (multiplexing), etc. However, the challenge is accessing the session secrets to further decrypt the payload. We've briefly discussed that in this issue: https://github.com/libp2p/specs/issues/46.
Wireshark uses an "onion" approach to dissect frames. It calls dissectors iteratively to decapsulate the different layers in the message. At one point the dissection is going to encounter an application specific protocol in the payload (eth2.0 protocol), and to decrypt it, a Wireshark dissector should be registered for it. It's best if those dissectors are developed and maintained by the Ethereum 2.0 community, as they are application-specific and they'll need to evolve in lockstep with the protocol.
EDIT: the existing dissectors are likely outdated and not exhaustively tested.
EDIT 2: Contributions super welcome. Let’s _buidl_ this tooling together.