Google-cloud-go: bigquery: infer schema from protoc generated struct

Created on 24 Aug 2018  路  17Comments  路  Source: googleapis/google-cloud-go

Client

BigQuery

Describe Your Environment

macOS 10.13.6

Expected Behavior

bigquery.InferSchema() works for structs generated using protoc.

Actual Behavior

googleapi: Error 400: Field EventTime.XXX_NoUnkeyedLiteral is type RECORD but has no schema, invalid

The protobuf compiler generates the following field when generating a Go struct:

XXX_NoUnkeyedLiteral struct{}             `json:"-"`

This field prevents me from inferring the BigQuery schema from the struct. It would be nice to be able to infer the schema from structs generated by the protobuf compiler. Otherwise I'd have to manually specify the schema, which means that I also need to keep it in sync with any changes made to the protobuf message.

bigquery feature request

Most helpful comment

@shollyman @jadekler @jba Apologies for the broad ping, just want to check in on this issue!

Now that the v2 API is out and the XXX fields are gone, it would be absolutely superb if the schema inference understood well-known-types (timestamps most of all) but possibly also the google.type messages from https://github.com/googleapis/googleapis.

At my current shop we're using a hand-rolled protoc plugin to generate BigQuery conversion functions that correctly handle types like google.protobuf.Timestamp and google.type.Date. I believe if support for these types were added to the core BigQuery library, we could do away with this plugin altogether.

From a wider ecosystem perspective, I think supporting the google.protobuf and google.type messages would also promote usage of protobuf in general, since long-term warehousing of protobuf messages in BigQuery would become next to trivial. I can only assume this is how it already works inside Google with Dremel?

Curious to hear your thoughts here!

All 17 comments

Out of curiousity, what version of protoc are you using, and are you using the latest protoc-go-gen?

Oh, sorry. I'm using proto3 with libprotoc 3.6.0 and the latest version of protoc-go-gen (1.2.0 I suppose, since go get will fetch the latest version or master even?).

@pongad Any idea what's up with this?

Hi, any news on this?

Hi @simpajj - sorry, no news yet. We've had a bout of out of office activity these last two weeks. We're catching up on issues this week and next, and hope to get to this soon.

@jadekler Alright :) Thanks for the info!

Unfortunately, InferSchema is working as it should. Its contract is to turn the exported fields of a struct into schema fields, and it's doing that. (Hence my change from bug to FR.)

We could add special logic that says if the struct implements proto.Message, then ignore any fields beginning XXX_. @dsnet, do you think that's a reasonable heuristic? Any plans to add extra fields to proto messages that do not have that prefix?

That would certainly help us out, if it is deemed as a good enough heuristic.

I suppose another alternative would be for the golang protobuf compiler plugin to provide an option for skipping the generation of the XXX_ fields, similar to how gogo/protobuf does it with protoc-gen-gogofaster. However, this might be the wrong place for such a discussion.

In the future, the XXX fields will disappear entirely (https://github.com/golang/protobuf/issues/276). Adding options to make them go away at the cost of reduced protobuf functionality is just a temporarily hack.

We could add special logic that says if the struct implements proto.Message, then ignore any fields beginning XXX_. @dsnet, do you think that's a reasonable heuristic? Any plans to add extra fields to proto messages that do not have that prefix?

Well, there are discussions for an "opaque" API that has no exported fields, where everything is accessed through getter/setter methods. Also, we plan on having official support for dynamic messages, which also would not have any exported fields.

It seems to me that if you wanted to specially treat protobuf messages, you should go all the way and support them fully. With the upcoming v2 API, you can iterate over all the protobuf fields and derive your schema from that. This approach seems to be a lot more robust.

SGTM. We will wait for proto API v2, then use that to implement this feature.

@dsnet Are there any plans for a stable release of the v2 API in the near future?

We're aiming for a stable v2 release hopefully at end of Q1 of 2019.

Thanks. We eagerly await the new API and will wait on this issue until its release.

cc @shollyman

Looks like we're still not yet there on the proto v2 work, https://github.com/golang/protobuf/issues is tracking a variety of blocking issues.

@shollyman @jadekler @jba Apologies for the broad ping, just want to check in on this issue!

Now that the v2 API is out and the XXX fields are gone, it would be absolutely superb if the schema inference understood well-known-types (timestamps most of all) but possibly also the google.type messages from https://github.com/googleapis/googleapis.

At my current shop we're using a hand-rolled protoc plugin to generate BigQuery conversion functions that correctly handle types like google.protobuf.Timestamp and google.type.Date. I believe if support for these types were added to the core BigQuery library, we could do away with this plugin altogether.

From a wider ecosystem perspective, I think supporting the google.protobuf and google.type messages would also promote usage of protobuf in general, since long-term warehousing of protobuf messages in BigQuery would become next to trivial. I can only assume this is how it already works inside Google with Dremel?

Curious to hear your thoughts here!

Update: My team have started putting together a BigQuery encoding for protobufs here, using the v2 reflection APIs:

https://github.com/einride/protobuf-bigquery-go

It's still under development and not complete, but we're already getting some mileage out of it.

I do think that functionality like this could live in the main bigquery package (all the dependencies are already there, and I think reading/writing protobufs to BigQuery is a common use case).

For example as a new ValueSaver implementation (to avoid breaking changes to existing code) that takes protobuf semantics into account (converting google.protobuf.Timestamp to TIMESTAMP, and so forth).

Until then, we're happy to accept pull requests on the encoding above, to make it more complete and robust.

Was this page helpful?
0 / 5 - 0 ratings