As per the stable API verisoning policy, we should ideally be freezing the v2 API additions at EOQ. At this point, any API additions/deprecations would take place in the v3 APIs. To do this, we need to open up v4alpha and make v3 canonical. There should be an escape hatch available for any v2 changes that are strongly motivated after EOQ.
My current proposal for the coming week is the following:
protoxform pipeline is switched to take v3 as the input and write-out v4alpha. Envoy does not consume v4alpha yet internally. All new protos and proto modifications happen in v3 canonically. We will send an e-mail out to the community providing clear instructions as this PR merges.After 1.14.0, we should move Envoy to consuming v4alpha internally, via API boosting tooling, some time between now and EOY. There are perhaps good arguments to delay this, since distributions like Envoy Mobile probably just want to ship with a single stable major version (CC @junr03) and even regular Envoy deployments might want to benefit from not having to take the version upgrade performance hint when running with the latest major version.
CC @envoyproxy/api-shepherds
SGTM!
There are perhaps good arguments to delay this, since distributions like Envoy Mobile probably just want to ship with a single stable major version
馃槄 so we would have v2, v3, and v4alpha compiled protos, right? I mean, conversely instead of delaying consuming v4alpha I could prioritize https://github.com/envoyproxy/envoy/issues/10198. I have that issue in the back burner for now, but as we move into having three versions of the protos compiled in, I would want to start working on it. My plan was to just hack something together along the lines you described in #10198 (similar to what I described in https://github.com/lyft/envoy-mobile/issues/710) and get a feel for it.
Otherwise, the plan to freeze v2 sgtm! Envoy Mobile is already fully on V3 (with the exception of the metrics service, but that's easy to upgrade, as we control the server-side service).
how will the v2 api freeze impact call-outs like rate limit, ext authz, ALS?
today these explicitly call out to the v2 service, but use v3 message, which might be problematic if v2 freezes as you will not be able to generate an accurate server based off of the v2 proto definition.
for reference:
rate limit:
service: https://github.com/envoyproxy/envoy/blob/4e1f9f95618ca354b255fb207a5a6e3097d687d0/source/extensions/filters/common/ratelimit/ratelimit_impl.cc#L26
message: https://github.com/envoyproxy/envoy/blob/4e1f9f95618ca354b255fb207a5a6e3097d687d0/source/extensions/filters/common/ratelimit/ratelimit_impl.cc#L59
ext auth:
service: https://github.com/envoyproxy/envoy/blob/4e1f9f95618ca354b255fb207a5a6e3097d687d0/source/extensions/filters/common/ext_authz/ext_authz_grpc_impl.cc#L21
message: https://github.com/envoyproxy/envoy/blob/4e1f9f95618ca354b255fb207a5a6e3097d687d0/source/extensions/filters/common/ext_authz/ext_authz_grpc_impl.cc#L39
access log:
service: https://github.com/envoyproxy/envoy/blob/4e1f9f95618ca354b255fb207a5a6e3097d687d0/source/extensions/access_loggers/grpc/grpc_access_log_impl.cc#L70
message: https://github.com/envoyproxy/envoy/blob/4e1f9f95618ca354b255fb207a5a6e3097d687d0/source/extensions/access_loggers/grpc/grpc_access_log_impl.cc#L41
@yuval-k I don't anticipate freezing to break anything. It will only mean that we don't add new fields to the frozen protos. Access log service also has a v3 set of endpoints FWIW, so, over time I expect that folks interested in new features in access log service will move to v3.
@htuch thank you for your reply!
i don't think the issue is the need to upgrade the servers/clients; I'm not sure I understand what is the upgrade path.
right now all these clients send the latest version message (right now v3, but soon v4) to a v2 service. assuming we change v3 or v4 in an way that's not backwards compatible, and IIUC backwards compatibility will silently break? i.e. the only reason it is working right now is because v3 and v2 are binary compatible
the server owner has no clear way of creating a server with the right messages, as the v2 server proto message contains the v2 messages (as with the current code you need a v2 server package with a vN server message).
how would the upgrade path look like?
one option is to add a flag to the clients indicating which server version to use - though not sure on which message this flag will appear on?
if the flag is set have the clients explicitly use the old messages? I'm asking as generally I see the latest version in internal code.
@yuval-k this is a really good point. There's a few issues here:
Service selection; do we want to connect to the v2/v3/v4alpha etc. endpoint when making a gRPC call? For xDS services, we have the ability to specify explicitly the resource/transport version for messages/services in ConfigSource. For other services, this is largely hardcoded today.
What messages does Envoy receive from a service? Envoy internally always operates at the latest version. Due to the stylized way that we manage changes between major versions, a v2 message can always be interpreted as v4alpha etc.
What messages does Envoy send to a service? Here again, Envoy internally always operates at the latest version. If we have new fields in v3 that are not present in v2, these will appear to the management server as unknown fields. This might be an issue or not depending on how the management server treats unknown fields
If we want to avoid confusion around (3) and for functional reasons for (1), we probably want to add explicit selection of the service for gRPC access logs, ext_authz, rate limiting, tracing, metrics, LRS and HDS. This would be a new field when specifying the gRPC service, similar to what we do today for xDS. This is some reasonable chunk of work, mostly mechanical plumbing, but we need tests etc.
I'm inclined to proceed with the freeze and immediately after address (1) and (3). We won't have any issues with unknown fields until we take PRs that do things that are v3 only, so we have a small window to get this right.
@mattklein123 @lizan do you folks have thoughts on ^^? @yuval-k does that answer the question?
@htuch thank you for the detailed reply!
If understand correctly, in scenarios (2,3) envoy will be sending\receiving v3v4alpha messages to a v2 service.
This sounds mostly fine - I can see edge cases that will still trigger a silent failure. mainly if the protobuf tags change, breaking binary compatibility.
This can happen in means that are allowable today by the stylized workflow. for example, If a field is deprecated in v2 and a new field is introduced in its place. If I understand the workflow correctly, this will result in the deprecated field being removed from v3,v4alpha.
This will result in effectively a protobuf tag change - In this case, in scenario 3 above, semantics of existing behavior will change. envoy will not send a field that the server expects, and will send an unknown field in its place (same goes for the response path, though here envoy can potentially convert the message to the latest version).
One "easy" way to solve this is to not allow removing deprecated fields to things under api/envoy/service even across major versions, until envoy stops supporting calling out to the external service with the version in which the field is not deprecated.
what do you think?
Yeah, scenario 3 is the problem case. If a new behavior is added to replace a deprecated one, that would have occurred in v2 though, not in v3, as field replacement needs to happen within a major version, with the old deprecated field still around. Hence it should still be safe to send the v3 proto as v2 servers should know about both deprecated and new field in v2 (if they are up-to-date). When v3 is canonical and Envoy is using v4alpha internally, we can get breakages that you describe if we're still sending v2 though, since v3 might deprecate and replace.
We're not planning on moving Envoy to v4alpha for some time though, even though we will start generating v4alpha protos.
I think that as long as Envoy is still doing hardcoded v2 service selection, any v3 breaking changes shouldn't be allowed as you suggest. What I'd like to do is fix these services to properly handle v2/v3/v4alpha service selection right away, as soon as we do the freeze. I've filed https://github.com/envoyproxy/envoy/issues/10609 to track this. We should not move to v4alpha in Envoy until this is fixed.
@mattklein123 @lizan do you folks have thoughts on ^^?
+1 this sounds like the right plan. Thanks for calling this out @yuval-k!
Noting that #10612 and #10614 track some of the limitations that the current set of PRs for #10355 imply.
Most helpful comment
Yeah, scenario 3 is the problem case. If a new behavior is added to replace a deprecated one, that would have occurred in v2 though, not in v3, as field replacement needs to happen within a major version, with the old deprecated field still around. Hence it should still be safe to send the v3 proto as v2 servers should know about both deprecated and new field in v2 (if they are up-to-date). When v3 is canonical and Envoy is using v4alpha internally, we can get breakages that you describe if we're still sending v2 though, since v3 might deprecate and replace.
We're not planning on moving Envoy to v4alpha for some time though, even though we will start generating v4alpha protos.
I think that as long as Envoy is still doing hardcoded v2 service selection, any v3 breaking changes shouldn't be allowed as you suggest. What I'd like to do is fix these services to properly handle v2/v3/v4alpha service selection right away, as soon as we do the freeze. I've filed https://github.com/envoyproxy/envoy/issues/10609 to track this. We should not move to v4alpha in Envoy until this is fixed.