Kotlinx.serialization: Serialize a ByteArray in a ProtoBuf

Created on 13 Dec 2017  路  19Comments  路  Source: Kotlin/kotlinx.serialization

In general, the protocol buffer supports to store raw data. However, this fails currently. For example:

@Serializable class ProtoClass(@SerialId(id = 1) val value: ByteArray)

throws

kotlinx.serialization.SerializationException: Any type is not supported

Storing a ByteArray doesn't seem to be possible with the current Serializer api, is it? i.e. KOutput doesn't has a method to write a ByteArray.

design

Most helpful comment

Any updates on this guys? @sandwwraith @jaccozilla @elizarov @czeidler @pdvrieze

All 19 comments

Yes, native array types were not included in the prototyping phase of this library for the sake of simplicity. There is a design debate on how to implement them - either create additional serializers (like ArrayListSerializer, MapSerializer, etc) - this is easy to add; however, each element of the array would be written separately, which may lead to poor performance. Or create additional functions in KInput/KOutput (like writeInt - writeIntArray, etc..) which can utilize packed nature of array; however, adding a lot of primitive functions will bloat the interface and make it even more harder to implement.

Currently, we try to separate low-level interfaces, in which you can override every function for performance, from high-level interfaces, that can be easily implemented to prototype new formats. Of course, this question would also be considered there.

I see this problem. But my problem is slightly different. In Protocol Buffer a byte array (a string of bytes) is a "native" type:
https://developers.google.com/protocol-buffers/docs/proto3#scalar

While storing a ByteArray using JSON is not directly possible, it is natively supported in Protocol Buffer and CBOR. To make the full power of Protocol Buffer and CBOR available I think it would be good to add a writeByteArray method to the interface. (For JSON you could provide a default implementation that serializes the ByteArray to a base64 encoded string...)

Currently the missing possibility of storing a raw ByteArray in a Protocol Buffer prevents me from using the serialization package. I.e. I don't want to waste space by encoding the ByteArray into a string.

Has any design decision been made on adding writeByteArray, writeIntArray, and others as functions in KInput/KOutput vs creating new serializers?

I prototyped adding write/readByteArray function to KInput/KOutput, that went pretty well. For arrays of the other types I had a few thoughts:

  1. Only add write/readByteArray to KInput/KOutput. But provide serializers for IntArray and other types that delegate to writeByteArray. They can expand their value into multiple bytes in the byteArray. This would be more efficient then converting to a string or writing each element independently with only a new serializer. But perhaps less efficient if some formats have optimization for arrays other then bytes.
  2. Add functions for all array types, but provide default implementations which delegate to byteArray like above.
  3. Only add write/readByteArray and punt on other types. This allows users to implement their own serializers for the other types which can delegate to byteArray for performance/size and only ties the framework to byteArray

If adding write/readByteArray to KInput/KOutput sound okay, I can cleanup my prototype and put a PR up for it.

@jaccozilla currently vision of this problem is following:

Adding a lot of new methods to KOutput is undesirable, but there is a need in standard serializers for ByteArray, IntArray, etc.. discoverable by plugin. Then, if format has specific support for byte arrays, it should intercept writeSerializableElementValue, check that serializer there is ByteArraySerializer, and write object correspondingly.

Idea with adding single writeByteArray method probably is not bad, but delegation from IntArrays and others to it would probably be cumbersome and inefficient (because you likely need to copy whole array).

Chatted with @sandwwraith on slack a little, here are the requirements I'm hearing

  • Standard serializers for all primitive array types (ByteArray, IntArray, ect...) which are discoverable by the plugin.
  • Uptake in new formats should be easy. Keep number of methods added to KInput/Output to a minimum.
  • It should be obvious to users of that there is an efficient way to write ByeArrays
  • Performance should be good. Take advantage of any format's built-in support for ByteArrays. Don't add a bunch of array copies.

Here are some approaches, with my thoughts on how they meet the requirements.

Don't add any new methods to KInput/Output, check for known array serializers in each format's writeSerializableValue

Do something like the following in formats which have support for writing primitive arrays

override fun <T : Any?> writeSerializableValue(saver: KSerialSaver<T>, value: T) {
    if (saver === ByteArraySerializer) doSomething(value) else saver.save(this, value)
}

Benefits | Negatives
------------ | -------------
Formats can implement whatever optimizations they want | Implementing a new format requires copy/paste of the case statements
No changes required to plugin | When writing custom serializers it is not obvious that there is an efficient way to write a primitive array.

Add new method for each primitive array to KInput/Output

Benefits | Negatives
------------ | -------------
Provides good visibility that there are efficient ways to write arrays | Adds a lot of new methods to KInput/Output that are required when implementing a new format
Allows format specific optimizations for all array types | requires changes to plugin
Formats may optimize all array types

Add writeByteArray, delegate others array types to it

(https://github.com/jaccozilla/kotlinx.serialization/tree/readWriteByteArray)

Benefits | Negatives
------------ | -------------
Only adds a single method to KInput/Output | Not efficient and would require copying primitive arrays into a ByteArray to be written
Still provides decent visibility for writing arrays. But not obvious what to do when writing non-byte arrays. | Prevents format specific optimizations for different array types (such as protobuf's packed https://developers.google.com/protocol-buffers/docs/encoding#packed)

Add writePrimitiveArray method which takes a view into a primitive array, delegate all array serializers to it

Instead of adding a method which takes a ByteArray and delegate to it. Or add a new method per type, take some interface which is a view into a primitive array. See
https://github.com/jaccozilla/kotlinx.serialization/tree/readWritePrimitiveArray for a POC. That adds the following sealed class. Sealed as it should be a 1-to-1 mapping to primitive array types and so format implementations can do an easy when on the type

sealed class PrimitiveArrayView<T : Number> {

    abstract val size: Int
    abstract operator fun iterator(): Iterator<T>

    class ByteArrayView(val array: ByteArray) : PrimitiveArrayView<Byte>() {
        override val size = array.size
        override operator fun iterator(): ByteIterator = array.iterator()
    }

    // other types

    companion object {
        fun adapt(array: ByteArray): ByteArrayView = ByteArrayView(array)
        // other types
    }
}

Then a primitive array serializer would look something like:

object ByteArraySerializer : KSerializer<ByteArray> {
    override val serialClassDesc: KSerialClassDesc = PrimitiveDesc("kotlin.ByteArray")

    override fun save(output: KOutput, obj: ByteArray) = output.writePrimitiveArrayValue(PrimitiveArrayView.adapt(obj))
    override fun load(input: KInput): ByteArray = (input.readPrimitiveArrayValue(Byte::class) as PrimitiveArrayView.ByteArrayView).array
}

Benefits | Negatives
------------ | -------------
Only adds a single method to KInput/Output | Read method takes generics to know which primitive type to read out, similar to reading an enum.
Still provides decent visibility for writing arrays, but requires callers to wrap their arrays into the view | All format implementations will have similar when statements to handle the different types
Prevents having to copy an IntArray into a ByteArray, but does add some extra object creation for every invocation |
Allows format specific optimization of array types |

As a though (I haven't really deeply considered all aspects of it) there is yet another way. Use a higher order function (reading is not provided in the example, but is analogous):

class KOutput {
//...
    open fun <T> optimizedWriter(type:KClass<T>): ((T)->Unit)? = null // By default noop
    inline fun <reified T> optimizedWriter() = optimizedWriter(T::class)
    fun <T> writeOptimizedOrDefault(type:KClass<T>, obj:T) // syntactic sugar to avoid client complexity
//...
}

object ByteArraySerializer: KSerializer<ByteArray> {
    override fun save(output:KOutput, obj: ByteArray) {
        output.optimizedWriter<ByteArray>()?.invoke(obj) ?: output.writeValue(obj)
    }
}

Benefits | Drawbacks
------------ | -------------
Adds only single method | Naive implementations require a class per specialised writer (this could be encapsulated in object state - like the output has to be)
Client knows about optimized writing and can reuse it| Requires lookup on serialization (perhaps invokedynamic can be used to optimize this - easy for generated code, harder for custom code)
Not restricted to arrays| higher order function
Does not leak/require type specific details|
Does not require reflection

@pdvrieze I implemented your suggestion at https://github.com/jaccozilla/kotlinx.serialization/tree/optimizedWriter. I think it does a nice job of leaving the option to add future optimized serialization impls without having to change the API. I don't know anything about generating code or how it would fit into the compiler plugin.

But ,I don't think it will be easily usable for someone trying to write their own custom serializers with this approach. You really need to know beforehand that the specific type you are trying to write has an optimized impl. If you know that then you might as well just call writeSerializableValue with its provided serializer. But maybe that is okay? with writePrimitiveArray is pretty obvious from the interface that writing arrays has a better way to do it.

My suggestion was really from the perspective of what can be used well from generated serializers. But non-generated ones can use this as well. The decision whether or not something has an optimized storage is completely left to the format specific code rather than the type specific code. If you add the syntactic sugar it can even use optimized code automatically. The syntactic sugar version could be an extension function.

I had a look at your code. In the implementation of optimizedReader/writer you are using an unneeded extra lambda. You could just directly look up the readers/writers from the map by storing them in the map (that map is instance level so can hold the needed link to the actual serializer)

Let me chime in on general features we desire to have in the serialization library and what options it leaves us for efficient support of byte arrays.

First of all, we are trying hard to maintain the following separation of work between a "serializer" (which is typically plugin-generated, but may also be a custom piece of code for a given class) and an "encoder" (which is format-specific implementation of KOutput interface in the current prototype, to be renamed):

  • Serializer (class-specific) makes decision on how to represent a given class instance to the format. Built-in and generated serializers are straightforward, but, for example, some kind of Color class that is stored as an Int in memory can be, in fact, serialized as a structure with there elements (r, g, b) or as an array of three bytes, or in some other shape. It is up to serializer implementation for the Color class to decide.

  • Encoder (format-specific) makes decision on how a given serialized structure is represented in a particular format. It encodes structures, collections, maps, enums, etc according to the specifics of the format.

The above general picture leaves a question of performance optimization completely outside of the equation (we'll get to it later). According to the above picture, when a particular format (like Protobuf or CBOR) has some format-specific representation for byte array, then it should check the type of element when it is being asked to write a collection by _looking at its serialization descriptor_ (not by looking at the instance of a type) and, having detected that it is being asked to write a collection of non-nullable bytes, use format-native representation for it (note that current prototype does not have sufficient infrastructure in serialization descriptor to make this check, but it is being extended in this direction).

That is the most general solution and it ensures that it works not only for ByteArray type in the original Kotlin source code, but also for List<Byte> and for any other Kotlin type whose serializer had chosen to represent its type as a collection of bytes (various fancy byte-buffer-like classes may serialize themselves as collection of bytes).

Unfortunately, this is not very efficient for ByteArray. In case of an actual ByteArray instance being serialized, format might take some advantage of the fact that it has the whole byte array instance at once, instead of being asked to write it byte-by-byte. So how do we make ByteArray not only supported, but also efficient?

We have the following options:

  • _Don't do anything_. Implement just the generic solution (above), test it, and see if it already has good enough performance. Note that there will be no boxing (since we have a dedicated function to write a primitive byte) and the only potential overhead comes from call choreography between built-in byte array serializer and implementation of the format.

  • _Check for known array serializers_ as explained by @jaccozilla Note, however, that it should be done in addition to (not instead of) generic solution, so that a Color that represents itself as a collection of 3 bytes is written as a native byte array in Protobuf.

  • _Add new method for primitive arrays_ (can be only for byte array or for all primitive arrays). We can provide default impl for these methods. Decision on whether to support only byte arrays or all primitive arrays shall be based on research of whether there are any formats that _natively_ support arrays of primitives other that bytes.

We cannot delegate all primitive arrays to "writeByteArray". For example, Protobuf (as a format) only supports byte arrays. It does not specify any representation for arrays of ints, for example. A particular array of ints may choose to serialize itself as an array of bytes, but it will have to decide on byte order and on some fixed/variable size encoding of ints. That is going to be a serializer-specific decision, not something mandated by Protobuf format (which does not have any special provisions for arrays of ints).

We don't want to introduce any wrappers, either. We already have a wrapper-less "visitor" model of interaction between a serializer and an encoder that works quite well.

@elizarov I would like to suggest another requirement that I would like to be able to continue to be supported (when the encoder can/does). That is the use case of supporting an externally defined schema. What I mean is that the encoder must be able to be forced to serialize according to a specific schema due to compatibility constraints (maybe the schema is a standard, maybe for legacy reasons).

Encoder/decoder introspection should also remain possible from a serializer. For example in my code I have an xml encoder/decoder that will allow me to directly embed an xml fragment as a child. In this case the serializer for the fragment code must know about the more efficient serialization (and the encoder will not have such knowledge).

Just saw this issue. Haven't really read the discussion, but the way I handled it in https://github.com/cretz/pb-and-k is to use a single ByteArr wrapper class to get hashCode, equals, etc. For other "repeated" primitive types, I still consider those lists which, while they are not as memory conscious or performant, allow me to use a list that tracks size. Just figured I'd mention it in the discussion, again haven't read the arguments and have no opinion on them.

Any process on this issue? Can support ByteArray first, then support other arrays?

Any updates on this guys? @sandwwraith @jaccozilla @elizarov @czeidler @pdvrieze

Supporting ByteArray for plain Protobuf dump / load seems like a crucial feature. Any suggestions for fast workaround?

With https://github.com/Kotlin/kotlinx.serialization/pull/509 (and release version of 0.13.0, I believe) all primitive array serializers were added to runtime library with some fast-path for bytearray in protobuf

Is it documented how to write raw byte arrays? The example in custom serializers is still writing hexadecimal strings! 馃え
@sandwwraith?

I'm trying to figure out whether ByteArrays are serialized efficiently for CBOR, can anyone clarify?

@travisfw try smth like encodeSerializableValue(ByteArraySerializer(), bytes) or encodeSerializableElement(descriptor, i, ByteArraySerializer(), bytes)

@sanity No, CBOR doesn't have any special fast-path for bytearrays. If you can confirm that this is a problem, please create an issue.

Was this page helpful?
0 / 5 - 0 ratings