Zeronet: Consider replacing MessagePack with Cap'n'proto

Created on 24 Apr 2019  路  4Comments  路  Source: HelloZeroNet/ZeroNet

https://capnproto.org/ - more efficient than protobuf

Most helpful comment

Yeah there's a few reasons why this test turns out to favor msgpack:

  • Normally msgpack suffers from the fact that field names are encoded on the wire. However, the bulk of the data in this test is ip/port pairs, which you've chosen to encode as two-element lists rather than as objects with named fields. This seems to be a hand-rolled optimization for msgpack that makes the code less readable (admittedly not hugely so in this particular case). Without this optimization, msgpack would be closer to Cap'n Proto, maybe larger.
  • Meanwhile, the data contains lots of small strings (IP addresses). This is particularly bad for Cap'n Proto, because any variable-width value needs to be represented as an 8-byte pointer pointing to a separate object whose size is itself rounded up to a multiple of 8. The IP address "127.0.0.1" is 10 bytes (with NUL terminator) which rounds up to 16. The Peer struct itself appears to contain the 8-byte pointer plus a UInt16 port number which again rounds up to 16 bytes. So you end up with 32 bytes for each ip/port pair, compared with a theoretical best of 12.
  • OTOH, if you were to encode the IPv4 address as a 32-bit integer rather than a string, then the ip/port pair would be only 8 bytes under Cap'n Proto, probably beating msgpack by a decent margin.
  • With regards to speed, the Cap'n Proto Python implementation is not as heavily optimized as the C++ implementation. In C++, Cap'n Proto would almost certainly be faster (even with the larger encoding). But that doesn't help you, obviously, if your project is Python. :/

All 4 comments

Looking great, but it did not meant to be a replacement for msgpack and according to my quick and simple test it's ~4 times slower than msgpack and the encoded package is ~3 times larger:

$ python3 test.py
capnp Encode 1000x1000 peers 40048 in 0.919s
capnp Decode 1000x1000 peers 1000 ['127.0.0.0', 1234] in 1.271s
msgpack Encode 1000x1000 peers 14010 in 0.282s
msgpack Decode 1000x1000 peers 1000 ['127.0.0.0', 1234] in 0.385s

test.py

import os
import capnp
import time

s = time.time()

this_dir = os.path.dirname(__file__)
pattern = capnp.load(os.path.join(this_dir, 'peer.capnp'))

for t in range(1000):
    req  = pattern.PeerList.new_message()
    peers = req.init('peers', 1000)
    for i in range(0, 1000):
        peers[i].ip = "127.0.0.0"
        peers[i].port = 1234

    req_data = req.to_bytes()

print("capnp Encode 1000x1000 peers", len(req_data), "in %.3fs" % (time.time() - s))

s = time.time()
for t in range(1000):
    decoded = pattern.PeerList.from_bytes(req_data)
    found = []
    for peer in decoded.peers:
        found.append([peer.ip, peer.port])

print("capnp Decode 1000x1000 peers", len(found), found[0], "in %.3fs" % (time.time() - s))


import msgpack
s = time.time()
for t in range(1000):
    req = {"peers": []}
    for i in range(1000):
        req["peers"].append(["127.0.0.0", 1234])

    req_data = msgpack.packb(req, use_bin_type=True)

print("msgpack Encode 1000x1000 peers", len(req_data), "in %.3fs" % (time.time() - s))


s = time.time()
for t in range(1000):
    decoded = msgpack.unpackb(req_data, raw=False)
    found = []
    for peer in decoded["peers"]:
        found.append([peer[0], peer[1]])

print("msgpack Decode 1000x1000 peers", len(found), found[0], "in %.3fs" % (time.time() - s))

Looks like the number of messages is way bigger 40k vs 14k
capnp Encode 1000x1000 peers 40048 in 0.919s
msgpack Encode 1000x1000 peers 14010 in 0.282s

That's the size of the generated message for 1000 ip/port pairs

Yeah there's a few reasons why this test turns out to favor msgpack:

  • Normally msgpack suffers from the fact that field names are encoded on the wire. However, the bulk of the data in this test is ip/port pairs, which you've chosen to encode as two-element lists rather than as objects with named fields. This seems to be a hand-rolled optimization for msgpack that makes the code less readable (admittedly not hugely so in this particular case). Without this optimization, msgpack would be closer to Cap'n Proto, maybe larger.
  • Meanwhile, the data contains lots of small strings (IP addresses). This is particularly bad for Cap'n Proto, because any variable-width value needs to be represented as an 8-byte pointer pointing to a separate object whose size is itself rounded up to a multiple of 8. The IP address "127.0.0.1" is 10 bytes (with NUL terminator) which rounds up to 16. The Peer struct itself appears to contain the 8-byte pointer plus a UInt16 port number which again rounds up to 16 bytes. So you end up with 32 bytes for each ip/port pair, compared with a theoretical best of 12.
  • OTOH, if you were to encode the IPv4 address as a 32-bit integer rather than a string, then the ip/port pair would be only 8 bytes under Cap'n Proto, probably beating msgpack by a decent margin.
  • With regards to speed, the Cap'n Proto Python implementation is not as heavily optimized as the C++ implementation. In C++, Cap'n Proto would almost certainly be faster (even with the larger encoding). But that doesn't help you, obviously, if your project is Python. :/
Was this page helpful?
0 / 5 - 0 ratings