protobuf 🚀 - proto3 and unknown fields

I too am wondering about this. I am looking into migrating what is essentially a messaging system to gRPC (where proto3 seems to be recommended). In my case, clients send messages (text plus rendering information) to each other via a server where the server needs to understand the text and certain parts of the rendering info. I want to allow client developers to experiment with new features (pre-release) without having to deploy server code for every change.

Essentially, its a case where I want a shared proto definition between the client(s) and server, but dont want to require the server proto definition to be the latest to process requests.

dhendry on 6 May 2015

I'd like to hear about the explanation, too.

The behavior of proto2 makes sense to me.

solicomo on 23 Jul 2015

I have a lot of concerns about silently deleting data upon deserialization, to the point that even though we have internally been using proto3 for several months, I am considering changing things back to proto2. This change would be a lot easier to stomach if there was a message option to allow serialization and deserialization of unknown fields instead of discarding them.

jeremyong on 14 Mar 2016

Being unable to add unknown fields that persist is also unacceptable for us. Reading the code, it's pretty clear the decision to omit unknown fields happens at compile time rather than at runtime (based on the generated code), so it seems proto3 is a no-go. Personally, I very much liked most of the changes to the new version except this one. Changing the default behavior alone might have been ok, especially given that the new behavior is well-documented, but doing so without a way to restore old behavior seems like a misstep. Supporting a plugin that reverts that behavior seems too expensive relative to the cost of just using proto2 with restrictions (optional only, etc).

jeremyong on 14 Mar 2016

👍1

Still no answers to this? This is a fundamental issue which is seriously hindering our the adoption of protobuf in many areas.

dhendry on 19 Apr 2016

+1 proto2 is a permanent fixture for us. Changing default behavior is one thing but changing it in a way that doesn't let the user even control it is a strict loss in my opinion. What I foresee moving forward is a huge fragmentation in the client ecosystem. Maintaining support for both proto2 and proto3 semantics is too much to chew for most developers, and I'm already seeing some client libraries do this awkward dance where they have some proto2 properties and some proto3 properties. The easiest example of this causing a problem in history is the move from Python2 to Python3. One possible solution might be a file level option that informs the protobuf compiler not to strip unknown fields.

jeremyong on 19 Apr 2016

The proto3 spec doesn't forbid preserving unknown fields. Instead, it allows implementation to choose whether to preserve unknowns. The current C++/Java chose to drop the unknowns though. We are currently looking the issue and will keep this thread posted.

liujisi on 20 Apr 2016

Thanks @pherl for providing the update. FWIW, I think it is worth considering how the behavior might be standardized, for the same reason people argue against undefined behavior in C or C++. Undefined behavior (if present) should really be due to a lack of foresight if it exists, but for something like this, we might as well come up with an actual solution since we're already aware of the problem.

jeremyong on 20 Apr 2016

Thanks for keeping this issue alive. I'd just like to add that we are interested in support for Go, but that might need to be addressed in golang/protobuf.

joshuarubin on 20 Apr 2016

@pherl Any progress on this front?

jeremyong on 6 May 2016

+1 for preserving unknown fields.

I accept that you can not trivially maintain compatibility with the JSON format (at least as long as you want to marshal fields with their names), but I think a lot of shops would be happy to pay this price for not having to release their low-level infrastructure in lock step with their newest clients.

In fact Kenton seems to wonder himself (https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html): Apparently, version 3 of Protocol Buffers, aka “proto3”, removes this feature. I honestly don’t know what they’re thinking. This feature has been absolutely essential in many of Google’s internal systems.

In my opinion the right approach would be to make this an option of the proto compiler on compiling the proto: this way everybody can decide for themselves whether the benefits outweigh the downsides.

For now I have overridden the PreserveUnknownFields function in both cpp_helpers.h and java_helpers.h in the compiler code to always return true and this seems to work, but I would appreciate it if someone from google could confirm.

gfecher on 12 Jun 2016

👍6

Some updates: we tried to gather data to prove "unknown fields are essential for Google systems", but the result is not so convincing (the experiment is done in a Google sub-system, not the whole of Google).

For those of you who are interested in adding back unknown fields in proto3, could you describe your use case in more details and explain why unknown fields is required (e.g., can the same use case be supported using some other proto3 features)? We need to prove unknown fields are needed in some common use cases in order to add it back.

xfxyjwf on 12 Jun 2016

Here is a use case I developed internally that makes heavy usage of unknown
fields:

In addition to the message itself, we often annotate the message before
sending it over the wire with metadata indicating if a field was deleted or
not, if it was set to a default field, etc. Internally, we use a diff-ing
scheme to create a protobuf message "diff" which handles maps, fields, and
messages (recursively applied). The application of the diff itself is
associative, so many diffs can accumulate into one, and this makes for a
fairly elegant scheme for updating state for a particular message across
many clients that may or may not be online.

Generalizing this use case, any protobuf message that is derived from the
reflection API must necessarily leverage the unknown field set, since by
definition, we cannot know the shape of the message a priori. Think of this
as a "higher order message" whereas messages that are schema defined are
first order messages.

On Sun, Jun 12, 2016 at 11:30 AM, Feng Xiao [email protected]
wrote:

Some updates: we tried to gather data to prove "unknown fields are
essential for Google systems", but the result is not so convincing (the
experiment is done in a Google sub-system, not the whole of Google).

For those of you who are interested in adding back unknown fields in
proto3, could you describe your use case in more details and explain why
unknown fields is required (e.g., can the same use case be supported using
some other proto3 features)? We need to prove unknown fields are needed in
some common use cases in order to add it back.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/google/protobuf/issues/272#issuecomment-225452538,
or mute the thread
https://github.com/notifications/unsubscribe/AAPRJdU9zmn_iHC60rz014oYrvt0n_zQks5qLFBXgaJpZM4D8C3u
.

Jeremy Ong
PlexChat CTO
650.400.6453

jeremyong on 12 Jun 2016

👍4

Hi,

We have a use case with a mixture of data validation/data transformation and storage.
Our infrastructure component understands certain bits of the schema that it validates/changes, but it is oblivious to the rest of the payload. It does store it, however, and clients running on the new schema expect the newly introduced fields to be returned intact.

In general any component could benefit from preserving unknown fields where only a partial understanding of the message is needed, especially where the bits the component does care about does not change often, but the rest of the schema does. I can think of routing, storage, certain types of data transformation, etc.

I would be interested in knowing how you managed to solve these use cases (which I'm sure you have internally at google) without preserving unknown fields.

gfecher on 13 Jun 2016

👍4

We need unknown fields, because it's one of the ways we know on the server-side that our proto definition is out of date, and needs to be re-synchronized. Without unknown fields, we would have to resort to polling or some other less authoritative way of detecting when the client has added fields.

Also while I understand trying to reduce feature surface area, unknown fields don't exactly cause a problem, do they? Dropping them has more negatives than positives, please add them back to proto3.

InfinitiesLoop on 18 Jun 2016

👍7

If the proto3 way was to set some option, like option (ProtoOptions).preserveUnknownFields = True; that would allow those of use who need it to keep it and those of you who don't need to do without it.

Best of both worlds. :)

JesseChisholm on 18 Jun 2016

👍3

I would absolutely want the ability to preserve or strip unknown fields at runtime. There are levels of our system which get deployed regularly, are kept up to date, and should be validating the well known schema (and stripping unknown fields), but there are other internal layers which get deployed far less frequently, that are not directly exposed to clients or potentially malicious actors where preserving unknown fields is highly desirable so we dont have to do full and extensive deploys for every little change.

dhendry on 28 Jun 2016

👍1

Hey guys,

We would love to have this feature, too :) During my relatively long time at Google, I was aware of many services that relied on this behavior from proto2.

Essentially, think of any set of three or more services where A talks to C via B, and we don't want to redeploy B when a proto that is being passed between A and C gets a new field added to it. (I also posted this as a question on stackoverflow.)

Would be great to have an update for supporting this feature and/or an alternative mechanism that you believe can solve this problem for us.

Thanks,
Rohit

rohitsaboo on 14 Jul 2016

👍6

Still no word on _what the original justification was_ too.

jeremyong on 4 Aug 2016

The use-case we have is the following:

We use Stream Processors, namely kafka-streams, that rearranges protobuf messages. For example we have 2 streams of protobuf messages that we join with each other. The join will just output a joined message having the two others as fields. Sometimes we also aggregate streams to list of messages of previous streams. The stream processors only know about the fields relevant for them (join fields, group by fields ...) all the other fields are carried along as unknown-fields.

This allows the stream processor to continue working even when upstream schema changes happen, we do not need to redeploy our stream processing application, and the new fields end up in the output for free.

To add some drama: I think loosing the unknown fields will force us to move to avro

Kaiserchen on 24 Aug 2016

👍12

This is a bit of a deal breaker for us too. We have the same use case where A sends data to B which reads some fields and forwards the message to C. We don't want to have to constantly update B when the schema changes even though it doesn't read any of the new fields. The current behaviour is quite dangerous since C can't tell if one of the new fields was set to the default value or if B is just out of date and lost data.

matthewrj on 31 Aug 2016

Would really appreciate an update on the feedback here. Whether Proto3 is going to ever support unknown fields can impact decisions being made even for folks still on Proto2, because if it isn't, we may need to invent other ways of solving our problems in order to avoid rearchitecting things when/if we move to proto3.

InfinitiesLoop on 31 Aug 2016

I have two use cases, both of which have sub-optimal workarounds:

1) Include a signature in the same protobuf as the payload to be signed. To verify the signature, I deserialize, extract and remove the signature, reserialize and verify the signature. This breaks if the signed message contains any new fields unknown to the process verifying the signature. The workaround is to serialize in two levels, with the inner (signed) message serialized as bytes in the outer message.

2) A server is the ultimate source of small update packets that are then routed peer-to-peer. Unserializing and reserializing before passing the message on to other peers strips out unknown fields. The workaround is for peers to share the original bytes instead of deserializing and reserializing.

chmod007 on 5 Oct 2016

One thing to keep in mind is that proto2 is not going away. We are still actively improving it and plan to keep doing so indefinitely, so proto2 is still a good choice if you have a use case that depends on unknown fields. The one main drawback is that a few languages (such as C# and Ruby) are currently proto3-only, but if you're not using those languages then that's not a problem.

@chmod007 , have you thought about using proto2 for your two use cases? Is that possible or do your schemas have to be proto3 for another reason?

acozzette on 7 Oct 2016

I'll add a few usecases.

We have a gRPC service proxying RPC traffic. It would be awfully nice to not have a hard requirement to deploy the proxy first upon schema changes in any of the services it proxies.
We also maintain stream processing services which are processing protos from other parts of the organization. If they add a field, I'd prefer that field doesn't disappear unexpectedly just by flowing through our stream processor. There's some pretty awful documentation / tooling / coupling implications of needing to redeploy stream jobs any time upstream producers evolve their schema. Depending on any cycles in data flows, there may be no topological order that produces valid schema updates without doing a 2-step deploy: 1) upgrade proto schema, redeploy all the (many) things that might rely on it 2) update producer to fill in field, deploy producer. Pray all the systems were updated.

re: proto2 vs. proto3, it's kind of annoying to mix and match. It's pretty counterintuitive to only use proto2 to maintain unknown fields, but have proto3 definitions for gRPC servers. I agree with most of the design choices in proto3 (e.g. removing optional/required fields, map types), but not this.

I'd actually been unaware proto3 removed unknown field support until I expected it to maintain an unknown field and it didn't (and came to report it as an issue). I'd touted unknown field support as a huge selling point for protobufs when we'd first implemented them.

The protobuf website originally recommended that new projects use proto3, which is why we'd adopted it, but this is a pretty huge issue for us. We'll likely be forking the compiler similarly to @gfecher as the proto3 ship has long since sailed and this behavior is very important to helping us produce robust infrastructure.

Xorlev on 18 Nov 2016

@pherl @xfxyjwf Do you have suggestions for how to work around this with proto3? If this was removed, what techniques were used to avoid requiring this pattern within Google?

As far as I see it, this was the chief benefit of protobuf:

+----------+                        +----------+
|          |   +----------------+   |          |
|          |   |                |   |          |
| Producer +--->  Intermediate  +---> Consumer |
|          |   |                |   |          |
|          |   +----------------+   |          |
+----------+                        +----------+

Producer and Consumer could be updated with new fields, while intermediate can remain on the same version. If intermediate is a proxy of sorts, then this is important.

stevvooe on 18 Nov 2016

👍4

@stevvooe We've been continuing to use proto2 for the intermediate proxy type thing since they are binary compatible. Throughout our codebase, we've been propagating proto2 everywhere since it's really annoying to maintain two different semantics for the proto definitions themselves but if you wanted, producer and consumer could use proto3.

I do have some plans eventually to do a separate C++ compiler entirely that consumes proto3 syntax but retains the API of the unknown fields unless someone else gets to it first. I want to do other changes like using more STL containers (vectors and maps) as the backing in-memory storage and fix the oddities with the arenas we've been seeing.

jeremyong on 18 Nov 2016

@stevvooe one possible solution for the intermediate is to preserve the raw payload (if it doesn't need to update the fields). We could also introduce language specific parsing APIs to preserve the unknown fields for such cases.

liujisi on 18 Nov 2016

👍1

I have exactly the same situation as @stevvooe. In my case the intermediate does update some fields. Is there any work around for when the intermediate does update fields?

matthewrj on 19 Nov 2016

@pherl Thank you for the response!

@stevvooe one possible solution for the intermediate is to preserve the raw payload

This is what proto2 did, automatically, and allowed updates.

It seems like I could create a gogo plugin (or a patch for gogo) to preserve the unrecognized data.

stevvooe on 19 Nov 2016

👍1

@pherl Thanks for the response!

Do you have any insight into why unknown fields were removed in proto3? Was it to put a nail in the coffin of extensions? I'll admit, unknown fields make it harder to have deterministic serialization, but the introduction of map<> types have similar faults. That said, if your message has a map it's known to be potentially non-deterministic whereas unknown fields made it a message instance by message instance question.

Even still, an option along the lines of option java_allow_unknown_fields option cpp_allow_unknown_fields (or a per-message-specific) would be my ideal resolution here as it makes it a language-specific problem to support unknown fields and makes it quite explicit in your proto whether it's used or not. A linter can help prevent use of these protos in situations such as a proto being used as a join key.

The presence of those options also serve as documentation that that behavior is _not_ handled by default.

I don't want to maintain my own fork of protoc going forward, so it's certainly in my selfish interests to add a user-accessible switch to the mainline compiler. I also realize there may have been good reasons for removing them and I'd be interested in hearing those. I recognize that adding any additional switches to such a prolific project have definite implications going forward as well.

Xorlev on 19 Nov 2016

Thanks for the feedback. @acozzette is looking into this issue and will keep this thread posted. Potentially we would introduce some APIs to optionally preserve the unknowns in proto3.

liujisi on 29 Nov 2016

🎉4 👍1

Yes please! I had already begun forking the compiler and this will save me a ton of time.

jeremyong on 29 Nov 2016

@pherl For the most part, I think we have a way out for the docker use cases. It would just be good to get a clear understanding of the design decision. I am sure there is a good reason, but I am having trouble inferring. Even more so, does this reasoning apply to our use case, as in, are we doing something "bad"?

stevvooe on 29 Nov 2016

👍1

The original motivation is to let the language implementation decide whether to preserve unknown fields, i.e. the spec does not require that implementation must preserve unknowns. This simplifies implementations and enables struct-like API. There's nothing wrong with preserving unknowns.

liujisi on 29 Nov 2016

👍6

If I'm not mistaken, that's simply not consistent with what the documentation has said which explicitly states "removal of unknown fields" as a "feature" of the proto 3 spec. Either way, glad it's being looked at.

jeremyong on 30 Nov 2016

Once again, we are gonna recommend internally in my company proto2 due to this lack of unknown field propagation in proto3.
We have a major use case in which a dozen Backends are communicating together via the same message, each intermediate computing a little part of the content.
The interest of using unknown fields is simply development efficiency by removing team dependencies. Usually one or two BE in the row are interested in the change. Forcing all 12 to update the version in coordination is what we cannot afford.

Having an option would just be the most flexible solution by far, and this for all languages please since we need it at least for C++ Java and Python...

fducat on 12 Dec 2016

@pherl Hi, is the option to re-introduce the preservation of unknown fields actively being considered? Our company has recently adopted proto3 ( with no proto2 legacy ) under the false assumption that unknown fields were retained. We may have to fall back to proto2 if there will not be a path to optionally support unknown fields in the near future. Any feedback would be appreciated.

mark-e-hoffman on 12 Dec 2016

Just to nudge this issue again.

@pherl - You mentioned in late November that you were looking at the possibility of exposing API's to allow unknown fields to be preserved, have there been any decisions regarding this?

Much like other contributors on this thread we're on the verge of moving back to proto2 but would prefer not to go through that exercise if at all possible.

JemDay on 19 Jan 2017

Hi Jem,

The plan is that:
1) prepare a doc listing the rationale of dropping unknown fields
2) collect use cases when unknown fields are needed; brainstorm and go
through the use case and figure out workarounds/alternative without adding
unknown fields back.
3) if the alternatives in (2) do not work, or if there's no workaround. We
will then preserve the unknowns.

Currently we are on (1) and (2). Will share the docs when they are ready.

On Thu, Jan 19, 2017 at 1:57 PM Jem Day notifications@github.com wrote:

Just to nudge this issue again.

@pherl https://github.com/pherl - You mentioned in late November that
you were looking at the possibility of exposing API's to allow unknown
fields to be preserved, have there been any decisions regarding this?

Much like other contributors on this thread we're on the verge of moving
back to proto2 but would prefer not to go through that exercise if at all
possible.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/protobuf/issues/272#issuecomment-273911543,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATQyZn-v6pUpT7vouWYe8Gy5HdP_Lrxks5rT9xUgaJpZM4D8C3u
.

liujisi on 20 Jan 2017

👍1

@pherl - Thanks for the response, glad to hear you guys are taking a look at this.

One of our use-cases is very similar to the one described by @stevvooe where intermediate processes in an invocation chain are decorating messages as they pass through.

I hate to ask (because i hate it when people ask me!), but do you have a sense of when you might have further information to share?

JemDay on 26 Jan 2017

I also arrived at this page while debugging a missing unknown field. My use case is a proto that looks like this:

message PluginData {
  SomeData field_a = 1;
  SomeOtherData field_b = 2;
  // Remaining fields are available for plugin-specific implementations.
}

The design, here, was explicitly to allow the users of the API to pass through their own fields. The API is the storage layer, and the user provides both a client and a backend plugin. I considered using Any for this, but my experiences with Any on other projects have led me to consider it harmful. Given that no one but the client needs to understand the other fields, and the client has their proto definition, passed-through unknown fields seemed like the ideal solution.

Any update on an ETA for rationale?

EDIT: I decided it may be useful to clarify two more points:

The experiences I had with Any that led me to consider it harmful centered around receiving data from clients. Some of our tooling wanted to validate the Any on receipt, which made the semantic "Any of the protos you had at compile time" rather than "Any possible proto". Given that that system also wanted dynamically-defined data, this was a nightmare.
In my use case above, the client already knows what type they want, so the string type_url in the Any is a waste of I/O. As a result, I considered adding a plain bytes field to my PluginData proto, which the client would then operate on, assuming it was their type. However, the only difference between that and using unknown fields, on the wire, is that the byte field approach has an extra tag and length stanza, which again, is a waste of I/O, the only benefit being that it would allow me to add more fields to my parent proto in the future. I decided to solve that by reserving a field or two for later.

kd8azz on 26 Feb 2017

We are planning to bring unknown fields back in proto3. Please take a look on the doc about the general plan: https://docs.google.com/document/d/1KMRX-G91Aa-Y2FkEaHeeviLRRNblgIahbsk4wA14gRk/edit#heading=h.w8dtggryroj4

liujisi on 13 Mar 2017

🎉18 👍15 ❤2

Yes! Excellent news @pherl. Thank you for keeping us up to date. :)

Xorlev on 13 Mar 2017

Awesome news @pherl. Thanks a lot for the feature and for the clear upcoming implementation plan.

fducat on 14 Mar 2017

@pherl Thanks for the great response!

The provided document addresses all the major concerns. I hope we can also coordinate with unofficial generators, like gogo/protobuf, to coordinate the rollout.

stevvooe on 14 Mar 2017

@pherl - Thanks for the follow-up, much appreciated.

JemDay on 14 Mar 2017

Hi,

We are planning to implement this at least of Java and C# part.
@pherl Could you add precision in the design documentation about what kind of flag will be used to activate this option?

Should this be:

Defined when compiling protoc
Defined as a protoc command line flag
Defined as a Builder option

Regards,
F.

dopuskh3 on 28 Mar 2017

Will conforming implementations be required to preserve the original ordering of unknown fields when serializing messages?

danburkert on 5 Jun 2017

Looking at UnknownFieldSet.java it looks like the order of unknown fields is entirely dependent on the backing Map. My guess would be "no", but it might be worth asking the question as to whether deterministic serialization mode should be extended to interleave unknown fields by ascending tag id in the output.

Xorlev on 5 Jun 2017

@Xorlev I bring it up because the two usecases mentioned in the doc ('intermediary servers', and 'read-modify-write') have subtle edge-cases when field re-ordering is combined with oneof fields. Serializing unknown fields by ascending tag id (or naively serializing unknown fields after known fields) can change the meaning of messages.

For instance consider the following schemas known to the intermediate server and end-user, respectively:

// Intermediate server schema.
message Schema {
  oneof test_oneof {
    string s = 1;
  }
}

// End-user schema.
message Schema {
  oneof test_oneof {
    string s = 1;
    int32 i = 2;
  }
}

Now consider a non-canonical serialized message: Schema { i = 42, s = "foo" } (note: when there are duplicate values for a tag in a serialized message, the last value wins). When re-serializing this message the intermediate server will output Schema { s = "foo", i = 42 } if ordering by ascending tag or unknown fields last. An end-user would interpret the original message as Schema { s = "foo" }, and interpret the message from the intermediate server as Schema { i = 42 }.

Edit: I should note, this issue is already discussed in the proto3 language guide, but the ramifications are a little more serious if/when people implement intermediate servers which should maintain message contents.

danburkert on 5 Jun 2017

@danburkert That's a good point about the oneof edge case. However, I think it would be fairly rare for anyone to actually hit that edge case, because you would have to both create a non-canonical serialized message and also add or remove an item from a oneof definition, which we already warn against doing in the docs you linked to. Trying to update all implementations to preserve unknown field ordering would probably also be too time-consuming and difficult to be practical.

This issue has actually come up in the past, because we are thinking about a C++ optimization that would involve stripping out unused fields at link time by effectively treating them as unknown fields. (Actually we already do something like this for Java lite). But when we do this we need to be careful not to mess with oneofs because naively treating them like unknown fields could cause the problem you described.

acozzette on 5 Jun 2017

The upcoming 3.4 release will provide APIs to explicitly drop unknowns for Java. There are already APIs in C++ and Python (DiscardUnknownFields). Make sure you are using those APIs explicitly if you rely on the current behavior.

liujisi on 25 Jul 2017

I have a kind of related issue which I am not sure fits here, but I'm asking anyway :) Using proto3 with python (protobuf __version__ is 3.3.0).

I am intending to use the intermediary pattern as mentioned above. However, I am using the oneof keyword. If I try to deserialize and reserialize a message with an unknown oneof, it discards the unknown oneof data. For example:

message.proto

message Message1 {
  bool flag = 1;
}
message Message2 {
  bool other_flag = 1;
}
message MasterMessageV1 {
  oneof payload {
    Message1 m1 = 1;
  }
}
message MasterMessageV2 {
  oneof payload {
    Message1 m1 = 1;
    Message2 m2 = 2;
  }
}

python test code:

>>> from message_pb2 import Message1, Message2, MasterMessageV1, MasterMessageV2
>>> test_m2 = Message2(other_flag=True)
>>> master = MasterMessageV2(m2=test_m2)
>>> encoded = master.SerializeToString()
>>> print(encoded)
b'\x12\x02\x08\x01'
>>> decoded = MasterMessageV1.FromString(encoded)
>>> print(decoded.SerializeToString())
b''

Essentially, if I serialize a MasterMessageV2 message with the m2 field set, when deserializing as a MasterMessageV1 it discards payload. If I then reserialize the decoded object and then deserialize as a MasterMessageV2 the m2 data is missing.

I realise that I can just change the type of payload to be bytes and decode them separately with some sort of payload_type enum, but then that breaks a lot of the niceness that comes with protobuf. Is this a bug? If not, is there a way that everyone else is handling this?

jeremyherbert on 14 Aug 2017

@jeremyherbert This is the same issue as proto3 currently doesn't preserve unknown fields. It should be addressed in the next couple releases.

Note that adding new fields into oneofs is risky. Even with unknown fields preserved, the new field will not be visible in the oneof of the old binary. Instead of seeing an unrecognized type, the old message will treat the oneof as not set. You would have to dig into the unknown fields to distinguish between an unset oneof vs an unrecognized oneof.

liujisi on 14 Aug 2017

@pherl, the pattern "save unknown fields and then discard it" seems excessive for me. Isn't it better just to pass a flag to parsing function telling it to save or not to save unknown fields while parsing? It will save you memory and CPU in case you don't need these fields while will retain all desired benefits. In our workflows we sometimes have most of fields in message as unknown, and I'm afraid that parsing it will degrade our performance.

Actually, I would like to have such flag in proto2 too.

vozbu on 7 Sep 2017

@vozbu what language are you using? We do have API to skip unknowns fields in Java. Other languages chose to have a discard unknown fields API after parsing is finished mostly to reduce the complexity in implementation.

liujisi on 7 Sep 2017

@pherl, I'm talking about C++. I haven't seen the implementation to judge about it. I speak my thoughts as a user.

vozbu on 8 Sep 2017

@pherl, the doc you shared states "3.4 release (ETA: Q3 2017): Google protobuf implementation for each language will provide APIs to explicitly drop or preserve unknowns for proto3. A temporary flag will be introduced for the default parsing behavior - default to drop unknowns."

3.4 is released. Did that actually make it in? I'm using Java and I see the flag for retaining unknowns, explicitDiscardUnknownFields in CodedInputStream, but the parsing code I see is using:
final boolean shouldDiscardUnknownFieldsProto3() { return explicitDiscardUnknownFields ? true : proto3DiscardUnknownFieldsDefault; }
So even if you don't set that flag you get proto3DiscardUnknownFieldsDefault, which defaults to false and appears not to have any way for external users to change.

jbolla on 14 Sep 2017

The plan would be only to provide APIs for explicitly drop unknowns, for
those who depend on the behavior. The default is only for testing only. In
3.5 we will flip the default.

On Wed, Sep 13, 2017 at 4:32 PM jbolla notifications@github.com wrote:

@pherl https://github.com/pherl, the doc you shared states "3.4 release
(ETA: Q3 2017): Google protobuf implementation for each language will
provide APIs to explicitly drop or preserve unknowns for proto3. A
temporary flag will be introduced for the default parsing behavior -
default to drop unknowns."

3.4 is released. Did that actually make it in? I'm using Java and I see
the flag for retaining unknowns, explicitDiscardUnknownFields in
CodedInputStream, but the parsing code I see is using:
final boolean shouldDiscardUnknownFieldsProto3() { return
explicitDiscardUnknownFields ? true : proto3DiscardUnknownFieldsDefault; }
So even if you don't set that flag you get
proto3DiscardUnknownFieldsDefault, which defaults to false and appears not
to have any way for external users to change.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/google/protobuf/issues/272#issuecomment-329325957,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATQyUtZy8f6n6c-aVRnPPPSXV0oKyGuks5siGYYgaJpZM4D8C3u
.

liujisi on 14 Sep 2017

All languages will be fixed in 3.5.x releases.

liujisi on 11 Dec 2017

@liujisi Now that direction has changed and support added for preserving field to some implementations, will this recommendation in the official proto3 documentation be changing?

Proto3 implementations can parse messages with unknown fields successfully, however, implementations may or may not support preserving those unknown fields. You should not rely on unknown fields being preserved or dropped. For most Google protocol buffers implementations, unknown fields are not accessible in proto3 via the corresponding proto runtimes, and are dropped and forgotten at deserialization time.

Ref: https://developers.google.com/protocol-buffers/docs/proto3#unknowns

leighmcculloch on 17 Jul 2018

@leighmcculloch Good catch, I'll update that documentation to say that unknown fields are now preserved for proto3 messages as of version 3.5.

acozzette on 17 Jul 2018

Is there a public method to detect if a deserialized message has unknown fields?

This would be useful to check a message which is coming from an untrusted source.
I do not want to relay the message to other services if I am not sure it complies to my proto format. Also in my case I cannot reserialize it because the serialized messages bytes are cryptographically signed (the serializer is not deterministic across different protobuf implementations).

I'm about to replace protobuf with JWT for this :(

MalteJ on 26 Aug 2018

There are methods to get a list of unknown fields. But:

In Go the parameter name suggests it should not be used ("XXX_unrecognized").
And the C++ docs say:

Get the UnknownFieldSet for the message.

This contains fields which were seen when the Message was parsed but were not recognized according to the Message's definition. For proto3 protos, this method will always return an empty UnknownFieldSet.

https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message#Reflection.GetUnknownFields.details

MalteJ on 26 Aug 2018

In Go, there is not currently a reliable way to programmatically interact with unknown fields. At best, you can use proto.DiscardUnknown to recursively discard all unknown fields. However, there is no stable API to iterate and/or modify the current set of unknown fields.

Furthermore, not all unknown fields are stored in XXX_unrecognized, unknown fields in the extension ranges are stored in proto.XXX_InternalExtensions. The current state of affairs is unfortunate, and we're working on v2 of the API, which will provide a stable way to read, modify, and write unknown fields.

dsnet on 4 Sep 2018

I'm coming to this party rather late... I've just upgraded a C# application that uses protobuffers from version 3.4.0 to 3.6.1. The application relies on unknown fields not being preserved. Now by default they ARE preserved and I've seen a significant and unacceptable increase in memory consumption. (The ratio of known to unknown fields is about 1:5.) There is mention here of APIs being available to explicitly discard the unknown fields but its not clear to me whether these were temporary and have now been removed or still exist. What is the current situation? Do these APIs still exist in the version 3.6.1 C# distribution? If so where can I find details?

kditrj2d on 18 Mar 2019

From my understanding (though I don't work on protobufs, I've just been a part of this thread for a _long_ time), these APIs are here to stay -- you will be able to keep or discard unknown fields depending on your use case.

https://github.com/protocolbuffers/protobuf/blob/e479410564727d8954e0704254f4345f97e3d844/csharp/src/Google.Protobuf/MessageParser.cs#L333-L340 Appears to be what you want -- applied to a MessageParser, it returns a new MessageParser which discards/doesn't discard unknown fields.

Xorlev on 19 Mar 2019

Thanks for the reply. Found it, tried it, code now works again.

kditrj2d on 19 Mar 2019

Protobuf: proto3 and unknown fields

Most helpful comment

All 69 comments

Related issues