Chapel: Review of protobuf user-end APIs

Created on 6 Aug 2020  路  31Comments  路  Source: chapel-lang/chapel

We now have a Chapel plugin for the protoc compiler and a user module for supporting protobuf operations. The pull request for integrating the work with chapel master is under review.

Meanwhile, it is important that we have a good feedback on the names of user-end functions. For now, the generated Chapel code for a message includes 2 user-facing methods-

  • writeToOutputFile(ch) - This function enables the user to serialize proto messages into protobuf binary wire format through a writing channel to an output file.

  • parseFromInputFile(ch) - This function enables the user to parse proto messages from an input file through a reading channel.

Example usage of the methods through a generated chpl file from a source proto file.

_example.proto_(source proto file)

syntax = "proto3";

message exampleMessage {
  int64 testField = 1;
}

_example.chpl_ (Mock generated chapel file)

module example {  // Generated by the protocol buffer compiler.
  ...
  record exampleMessage {
    var testField: int(64);
    proc writeToOutputFile(ch) throws { ... }
    proc parseFromInputFile(ch) throws { ... }
  }
}

_write.chpl_(code written by a user to set message field values and serialize these to byte stream)

use example;
use IO;

var file = open("out", iomode.cw);
var writingChannel = file.writer();

var messageObj: exampleMessage;
messageObj.testField = 9223;

messageObj.writeToOutputFile(writingChannel);

_read.chpl_(code written by another user to parse the input byte stream and access the set message field values )

use example;
use IO;

var messageObj: exampleMessage;
var file = open("out", iomode.r);
var readingChannel = file.reader();

messageObj.parseFromInputFile(readingChannel);

writeln(messageObj.testField == 9223); // Prints true

What names should we use for the methods currently called writeToOutputFile and parseFromInputFile in Chapel for writing and parsing respectively?

Common method names used by other languages are -

| Language | Write function | Parse function |
| ------------- | ------------- | ------------- |
| C++ | SerializeToOstream | ParseFromIstream |
| C# | WriteTo | ParseFrom |
| Python | SerializeToString | ParseFromString |
| JAVA | writeTo | parseFrom |

Most helpful comment

It would be nice if we could make it distinguishable that a record is a special protoc-generated type, e.g. if we were using classes, I imagine we'd make it a subclass of a ProtoMessage class or something. However, I don't currently have any suggestions on how we might do this (and this is a slight tangent from the method naming discussion).

This would also be acceptable.

@dlongnecke-cray - No, as far as I know, the users are only allowed to use pre-defined methods in the record(only the generated code). I don't think there will be a need for a user to extend generated records with new methods.

Alright then, sounds like there's no need to worry!

I think there might be sufficient agreement that serialize / deserialize are the ways to go. Do we want to put it to a vote?

All 31 comments

@lydia-duncan @mppf

I'd advocate for naming the methods writeProto and parseProto, personally. To me, the To and From suffixes are redundant because the operation already implies directionality. However it would be nice to know _what_ format the object is writing (thus the Proto suffix).

I'd also like to ask (forgive me if this has been asked before) whether or not we should overload writeThis to call writeToOutputFile and readThis to call readFromInputFile. The upside being that myObj <~> ch looks awesome, and the downside being that users may not expect that behavior. Perhaps this new behavior could occur only if the channel is writing binary?

Awesome work @Aniket21mathur!

@dlongnecke-cray thanks for the feedback. :)

However it would be nice to know _what_ format the object is writing (thus the Proto suffix

Correct me if I got it wrong, but I wanted to add that the method is not writing to a proto format. I think it might be more reasonable to say that the method is writing to protobuf binary wire format. The writeProto function will write a byte stream to the output file through a writing channel.

whether or not we should overload writeThis to call writeToOutputFile and readThis to call readFromInputFile.

I guess it would be better if @mppf answers this. But yeah the channel is a binary channel.

The most natural names to me are writeTo(ch) and readFrom(ch).

However, given the special format being used, I support writeProto and readProto.

  • While technically this is not a "proto" format, I think the Proto suffix gives the reader an easy-to-grasp idea of what these methods are for. writeProto surely beats writeBinaryWireFormat and writeBWF.

  • Despite parse used as part of the name in other languages, I suggest using read as a generic term for obtaining data from a file. "parse" is too technical. OTOH we have parseToml, so there is a precedent for "parse".

Correct me if I got it wrong, but I wanted to add that the method is not writing to a proto format. I think it might be more reasonable to say that the method is writing to protobuf binary wire format.

That's a good point. We don't want to imply that we're writing the text based format. Maybe a method name like serialize or serializeProto would be better, where the behavior changes depending on the channel direction. This would be similar to how we implement the <~> operator.

module example {  // Generated by the protocol buffer compiler.
  ...
  record exampleMessage {
    var testField: int(64);
    proc serializeProto(ch) throws where ch.writing { ...}
    proc serializeProto(ch) throws where !ch.writing { ... }

    // Maybe? Maybe not? Not sure...
    proc readWriteThis(ch) throws { serializeProto(ch); }
  }
}

Ugh, but maybe we want deserializeProto instead for reading channels? 馃槮

But deserializeProto is too wordy...

Regarding readWriteThis -- it makes sense to use the compiler-generated version of this method. This way writeln(messageObj); produces the same output as if record exampleMessage were defined by the user. Good consistency.

I view BW-formatted reads/writes as specialized operations that deserve separate name(s).

+1 to serialize or serializeProto, assuming where-clauses are acceptable. deserializeProto sounds fine by me, too.

Let us use serialize if we like using BWF for exchanging these records between locales, for example, upon Chapel's regular remote variable references. Otherwise, use serializeProto.

@Aniket21mathur and I have talked about wiring it up to writeThis/readThis/readWriteThis but that is future work. As far as I know we would still want to have the explicit method names.

The current names like writeToOutputFile seem slightly wrong to me since they operate on a channel not a file.

So options that I know of that have been proposed are:

  • writeTo / parseFrom
  • writeProto / parseProto
  • writeToChannel / readFromChannel
  • serialize / deserialize
  • serializeProto / deserializeProto

I personally like serialize / deserialize with 2nd favorite writeTo / parseFrom. serialize / deserialize make sense to me but there might be a good reason the other languages don't use those names.

Anyway, I don't think it's necessary in a typed language like Chapel to include the argument type in the method name.

My understanding is that the arguments to these two methods are regular Chapel channels. What differs is the format being used.

So I propose that writing/serializing in format does NOT occur upon:

  • writeln() that the user may insert for debugging or otherwise
  • remote accesses, where the compiler relies on a serialize() method when present

So I propose to avoid tying these to read/writeThis.

We can still use serialize/deserialize names because these two methods accept a channel argument and so will not match/be used for remote accesses.

Let's get back to the method names conversation and save the writeThis conversation for a separate issue (we can certainly open it now to move that conversation there, though).

It sounds like serialize is the favorite but that its counterpart (deserialize) is more controversial?

I'm for serializeProto and deserializeProto as I want to know the format I'm writing to/from (and I also think that with serialize/deserialize in the name, this removes most/all ambiguity that the methods refer to the binary wire format, and not the text based object representation).

I'll happily take serialize and deserialize as well, though.

It sounds like serialize is the favorite but that its counterpart (deserialize) is more controversial?

I think that was just me being picky 馃槃. In serialization I think bidirectionality is super appealing (a la <~>), but sadly we refer to it as "deserializing", so that should probably be preferred over overloading serialize for each direction.

Re serializeProto - I'm not super opposed to this - but isn't the fact that it will write in proto format given by the fact that the record type is a type generated by the protobuf compiler from a .proto file?

writeThis conversation for a separate issue (we can certainly open it now to move that conversation there, though)

@lydia-duncan we can open a new issue or maybe we can rename this issue latter?

Let's open a new issue, for organization's sake (and link to the comments that were related)

Some thoughts after reading the entire discussion-

  • I am not sure, whether overloading serialize for each direction will be helpful in the case of protobuf. In all the official and many third-party implementations, I have seen methods with two different names. So, it might be better if we follow that approach for the sake of consistency among protobuf implementations. Also, if we keep that aside I personally think that a user should have a clear distinction between the two methods. It might be possible that the two methods are used in a single file many times, in that case it might not be good to rely on arguments to distinguish between them.
    (P.S - Just a personal opinion)

  • serialize\deserialize seems fine to me, serializeProto\deserializeProto are also good. But yeah one thing that we can consider is that these functions are only available through a record type generated from the protoc compiler, as @mppf stated above. This implicitly dedicate these methods to proto serialize\deserialize.

Re serializeProto - I'm not super opposed to this - but isn't the fact that it will write in proto format given by the fact that the record type is a type generated by the protobuf compiler from a .proto file?

Would it be reasonable to expect that users will extend these generated records with new methods within application code? My thought was that if the records are tightly integrated in the code it would help to be more explicit with the method name in order to reduce ambiguity.

If that's not a legitimate concern then I'm totally cool with just serialize/deserialize.

Would it be reasonable to expect that users will extend these generated records with new methods within application code?

@dlongnecke-cray - No, as far as I know, the users are only allowed to use pre-defined methods in the record(only the generated code). I don't think there will be a need for a user to extend generated records with new methods.

I also like serialize and deserialize for being clear and concise. I think it's OK to omit the format from the method names (or arguments), because (1) proto is slightly confusing as it's not the _actual_ format being written, and (2) the record is generated by the protoc compiler and can be thought of as a special proto message type.

It would be nice if we could make it distinguishable that a record is a special protoc-generated type, e.g. if we were using classes, I imagine we'd make it a subclass of a ProtoMessage class or something. However, I don't currently have any suggestions on how we might do this (and this is a slight tangent from the method naming discussion).

It would be nice if we could make it distinguishable that a record is a special protoc-generated type, e.g. if we were using classes, I imagine we'd make it a subclass of a ProtoMessage class or something. However, I don't currently have any suggestions on how we might do this (and this is a slight tangent from the method naming discussion).

This would also be acceptable.

@dlongnecke-cray - No, as far as I know, the users are only allowed to use pre-defined methods in the record(only the generated code). I don't think there will be a need for a user to extend generated records with new methods.

Alright then, sounds like there's no need to worry!

I think there might be sufficient agreement that serialize / deserialize are the ways to go. Do we want to put it to a vote?

Do we want to put it to a vote?

@lydia-duncan do we :P?

I take the thumbs-ups on @dlongnecke-cray 's comment as a mark of agreement :)

I take the thumbs-ups on @dlongnecke-cray 's comment as a mark of agreement

As do I XD

In order to make progress, I'll accept serialize/deserialize.

Still, a couple of points above make me want to respond...

isn't the fact that it will write in proto format given by the fact that the record type is a type generated by the protobuf compiler from a .proto file?

Yes. However, as far as Chapel user code, I would like these records to be as "normal" as possible so the user does not worry about where they came from.

It would be nice if we could make it distinguishable that a record is a special protoc-generated type

This comes to mind:

// to be defined by protoc compiler
proc type exampleMessage.isProtobufMessage() param return true;

// either in the modules or the user can do it themselves
proc isProtobufMessage(type t) param
  return canResolveTypeMethod(t, "isProtobufMessage") &&
         t.isProtobufMessage();

the users are only allowed to use pre-defined methods in the record

I am not sure this constraint is enforced or makes sense.
I suggest allowing secondary methods on such records, just like on normal records.

I agree with most of Vass's previous comment, with the following exception

I am not sure this constraint is enforced or makes sense.
I suggest allowing secondary methods on such records, just like on normal records.

I don't think we intend to do anything to prevent this from occurring (nor can we without compiler support), but I think in practice it should not be done - users should expect that they can call a method in both languages. It would be confusing if there are methods only available to Python or Rust users, say, so I don't see why a method only available in Chapel would be good practice.

It would be confusing if there are methods only available to Python or Rust users, say, so I don't see why a method only available in Chapel would be good practice.

I agree with @lydia-duncan, protobuf documentation does have an indication of some pre-defined methods and I think it would be good if we adhere to those.

While the naming scheme is converging on serialize / deserialize, it might be still relevant to discuss the idea of making the message record distinguishable as a protoc-generated type, as part of the broader interface discussion.

What do other language interfaces do w.r.t. this? At a quick glance I see that Python uses a Message superclass and Go uses structs (indistinguishable from user-written structs).

Maybe isProtobufMessage() as @vasslitvinov suggests above is sufficient.

What do other language interfaces do w.r.t. this?

I don't think other languages do anything specific to let the user know that these classes/methods are related to protobuf. Though at the same time I don't think there is any negative if we do that in chapel :).

For now, we have a package specifier in the proto file, through which a user can control the generated file name/module name in chapel. Even though the user does not use the specifier, the generated file name/module name is the same as the proto file name.

Other languages like C# just use namespaces with package name to wrap the entire code, which I think is done by modules in case of chapel.

At a quick glance I see that Python uses a Message superclass

As fas as I know Python don't use any superclass in the generated code, it only uses descriptor objects for proto fields and messages. Something like ->

test = _reflection.GeneratedProtocolMessageType('test', (_message.Message,), {
  'DESCRIPTOR' : _TEST,
  '__module__' : 'any_pb2'
  # @@protoc_insertion_point(class_scope:anypackage.test)
  })
_sym_db.RegisterMessage(test)

@ben-albrecht you might have looked at the user support library for Python I guess. There they have a Message class and all. In chapel we have the https://github.com/chapel-lang/chapel/blob/36fffdc165ed1bb9e1910af98ff0897d3fa87480/modules/packages/ProtobufProtocolSupport.chpl module for that. I think here we are talking about modifying the generated records from the compiler.

@Aniket21mathur - thanks for the details.

you might have looked at the user support library for Python I guess.

You are correct. Looks like I took _too_ quick of a glance.

Other languages like C# just use namespaces with package name to wrap the entire code, which I think is done by modules in case of chapel.

Could you elaborate what you mean by this? What would this look like in Chapel?

Could you elaborate what you mean by this? What would this look like in Chapel?

Sure!

Suppose we have below proto syntax file with a package specifier->

with_package.proto

syntax="proto3";

package my_Package;

message foo {
  int32 a = 1;
}

The generated chapel module will have the name same as the package name.

my_Package.chpl

module my_Package {
...
}

In case the user do not specify a certain package name, the chapel module will take the name of the proto file.

without_package.proto

syntax="proto3";

message foo {
  int32 a = 1;
}

Generated file

without_package.chpl

module without_package {
...
}

@ben-albrecht I hope this helps :)

It seems we have converged on serialize / deserialize as the names. AFAIK that is sufficient design agreement for PR #16105 to be merged.

I can see two open design questions (that should get their own issue) that are future work:

  • Should we have the generated record include a method along the lines of proc type exampleMessage.isProtobufMessage() param return true; per https://github.com/chapel-lang/chapel/issues/16207#issuecomment-672342917 ?
  • Should we support initialization of a message record from a channel / bytes as an alternative to default-initialization followed by deserialize ? E.g. var myMessage = new exampleMessage(deserializeFrom=myChannel).

I'm in favor of something like this. A function written to take a protobuf message could then assert this in the where clause.

  • Should we support initialization of a message record from a channel / bytes as an alternative to default-initialization followed by deserialize ? E.g. var myMessage = new exampleMessage(deserializeFrom=myChannel).

This seems like a nice convenience.

This issue has served its purpose so I'm closing it.

https://github.com/chapel-lang/chapel/issues/16286 asks questions about some methods specific to supporting Any types

Was this page helpful?
0 / 5 - 0 ratings