Opentelemetry-specification: Add flush API to processor/exporter

Created on 12 Nov 2019  路  16Comments  路  Source: open-telemetry/opentelemetry-specification

Currently there is no way to ensure that spans are processed and exported. The existing shutdown() API indicates that cleanup can be done which may include flushing any pending data but the processor/exporter is not usable afterwards.

This is a significant problem for environments like AWS lambda where the process is not shutdown but frozen once the function finished. As the process is still running but frozen using shutdown() seems to be wrong.
It is also not specified if the process will be terminated or gets woken up again after some (possible long) time therefore it's vital to push span data to consumer.

Most helpful comment

If the issue is just the name flush, I suggest choosing one that looks more dangerous to call, e.g. force_send_data_now or force_flush_now (I hope users think twice before calling something that starts with force_).

All 16 comments

Hi @Flarna, just following up, sorry for the slow reply.

Why we avoid Flush

We would like to avoid having a method called Flush in the API. This is due to end users seeing this method, and believing it is their responsibility to manage data export. This results in inefficient code, as repeated calls to a manual Flush interact poorly with Exporters which are design to manage their own flushing.

Instead of Flush, we prefer have Close be a synchronous operation which also flushes the remaining spans. This ensures that all the data is exported before the process exists, but does not encourage users to be calling Flush on a timer while their program is running.

Suspend and Awake

In this case, it sounds like there is an event in javascript/lambda that is similar to Close, but not quite. Without knowing the details of this environment, it sounds like Suspend and Awake may be states we want to add to the SDK.

Suspend is a synchronous call, which returns once the SDK has exported all of its data, and is no longer executing operations in the background.

Awake is a synchronous call, which returns the SDK to a working state.

However, I notice that in their most basic form, Suspend and Awake are no different from Close and NewSDK. I would like to understand what would really be gained by adding these. For example, why not make a new SDK every time the program returns to operation? Where are the improvements expected, what are the issues, etc?

More Notes

If the code which must manage Suspend and Awake is fairly generic, then Suspend and Awake should be lifted into the API. Close used to be on the SDK, and was moved to the API for this reason. If there is no generic way of handling these hooks, and this all feels experimental, these methods should remain in the SDK for now.

I understand that in JS, "synchronous" may actually mean "asynchronous with a callback on completion." So, no suggestion on the particular shape of this API, just the intent.

The only other concern I would raise: please make sure this API change is something which is generally relevant, and no overfitted to a particular platform. Hope this is helpful, please let me know if you ahve any questions.

A few arguments for Supend/Awake (or really suspension notification in general):

First, it currently does not seem to be likely that we need an Awake, only Suspend or rather OnPossibleSuspensionPending.

Second, Suspend and Awake are potentially cheaper to implement than Close and NewSDK.

Third, I'm not sure if all languages, especially these using globals, have anticipated restarting a new SDK in the same process after a shutdown. For example, in Python you may only set the global Tracer once. If you shut that down, you can't use that one anymore but you also can't set a new global tracer.

Lastly, in practice shutdown is often implemented sloppily because the assumption is that the whole process will be torn down anyway. It is not unheard of to run into "file in use" errors or deadlocks when attempting to resurrect a shut-down subsystem.

@Oberon00 totally agree with your points about Close being sloppy, and languages not assuming a second instance of the SDK being created and attached as a global.

It would be good to understand what the requirements are for Suspend. For example, what happens if the SDK is long polling. Should it not try to reestablish a connection after Suspend is called? What happens to these network connections when a program is frozen and then resumed? If there are potential issues, then there should also be an Awake notification.

(Honestly, my take on "lambda" or "serverless" is that it is actually an ill-defined concept, as most of the languages being run "serverless" have no such concept within the language itself. This is why I am concerned about presuming that the requirements for one serverless implementation could be considered universal. It's anything from a VM-level suspension, to a bespoke/proprietary language runtime, to "actually we just kill and restart your process rather frequently.")

I don't care that much about the name choosen for a flush like operation. But I think we should select a name which describes the operation done and not one of the usecases where it fits.
I agree that just adding flush may result in users calling it even if it is not needed. But this is more a documenation topic.

Close and Re-create sounds like an overkill as it has more side effects:

  • any open span would get lost
  • in case flush was not able to send all spans in time they would be lost even if process gets resumed and a retry would work fine
  • Configurations received from backend would be lost (exect if they were persisted somewhere)
  • ...

I have AWS lambda and similar serverless concepts in mind but I assume there are other usecases where something like this is helpful. Serverless frameworks support Node, Java, Python,.. therefore I think this is a specification topic.

Regarding long polling and persistent connections used by exporter: They get most likely interrupted and exporter has to reconnect. But network interrupts can happen at any time so nothing special.

Serverless is currently not that "friendly" for usecases like OTel. But honestly speaking if OTel ignores them or misses good support for them it will be a serious limitation for OTel.

As of now it's anyway unclear if flush could be really implemented reliable for Node.JS on AWS lambda. There is no hook allowing to trigger an async flush and sync is effective not possible. But this will change in future and lifecycle events will be emitted to do housekeeping for addons like an OTel tracer.

I agree with @Flarna here that users calling Flush is a documentation issue and shouldn't necessarily impact how we design the API. I'm not opposed to an onSuspend hook, but I still think Flush more accurately depicts what we're trying to do. An onSuspend hook depends on exporter developers understanding what suspension means and what they need to do about it. Especially for custom exporters, this could be a problem. Flush is unambiguous.

I agree that Flush() sounds like the most straightforward and simple approach. The alternate suggestions are IMO overthinking and complicating the API with callbacks and events just to make it more difficult for users to call Flush() too frequently. I don't think it is worth it. Just clearly documenting performance implications of calling Flush() is IMO enough.

@Flarna thank you for your responses to my questions.

I would like to see the following, please:

Positive evidence as to what actually happens when a serverless environment suspends

It is still not clear to me whether flush would actually solve the problem being proposed. I do not want to add something, only to discover that it did not really solve the original problem.

For example:

As of now it's anyway unclear if flush could be really implemented reliable for Node.JS on AWS lambda. There is no hook allowing to trigger an async flush and sync is effective not possible. But this will change in future and lifecycle events will be emitted to do housekeeping for addons like an OTel tracer.

Please include a set of examples from several major serverless runtimes, to explain how this would work in practice.

Flush has created a lot of support issues in the past.

In general, names and semantics matter quite a bit. It will be very difficult to remove or fix API decisions after this project has reached v1.0, so this is not a situation where we try things and iterate later. Arguments that start with "I don't care that much" and "its just a documentation issue" raise a red flag for me. Please _do_ care about the names, and care about how users will naively interact with this API.

Again, we have had a huge problem with Flush in the past, so this is not a theoretical issue. If an API returns an error, users believe they need to handle it. If an API includes a Flush, users believe they need to manage flushing themselves. There was documentation then - users do not read the fine print. Please think hard about whether there is an equally acceptable solution which would avoid this issue.

I am not arguing for a complicated API, or any particular API in this case. And I completely agree that serverless is a valid use case. But, honestly, this feels a bit rushed, and it does not look to me like the use cases and requirements have been fully researched. My primary request - with all API changes going forwards - is to please show examples and thoroughly explore the edge cases, so that there is a solid justification at the time we add it to spec.

In the past, we have made spec changes without doing this work up front, only to create trouble as multiple languages hit the same problem during implementation. So I am trying to change this practice.

This Flush method is not a user-facing API. It's an SDK-internal API, right?

@jmacd yes

@tedsuo drafting a response to your questions now

@jmacd Well, as the user is responsible to instantiate SpanProcessor and SpanExporter and the user is also responsible to call shutdown on them it's user facing.
But it is a different API group then that one to create spans and add attributes to them.

In general I have a tooling in mind on top of the various OpenTelementry components which eases the use by putting together all pieces. This tooling could/should also take care about calling shutdown/flush if needed to avoid the burden all users to take care about this.
Clearly such a tool is tech/application and therefore not part of the spec but APIs should allow to create such tooling.

@tedsuo

Please _do_ care about the names, and care about how users will naively interact with this API.

I care about the name therefore I suggested flush to reflect what it does. But I'm open for anything else as this is nothing I would block.

It will take a while to get concrete examples for several serverless environments and also other languages then Node.js as I haven't used them till now.
For Node.Js on AWS Lambda the AWS controls when the process is started and when it get's recycled. There are no lifecycle hooks.
User supplies a function which is called triggered by an external event. This function get's an event object as argument holding the input parameters and callbacks to tell the runtime when the function is done which its work and to provide the results. Once user signals work is done the runtime freezes the process until a new request comes in or the process is recycled.

A user can delay signaling that it's function is done until flush is complete. So there is an option to do this correct even with async exporters.
The hard part is to create a tooling described above which would do this automagically.

Thanks @tedsuo for your response. Lets see what i can address

Does this help?

It is still not clear to me whether flush would actually solve the problem being proposed. I do not want to add something, only to discover that it did not really solve the original problem.

I completely understand the concern. My experience is mostly with NodeJS on AWS Lambda, but I believe it to be more generally applicable than that. I actually ran into this problem originally while trying to instrument a NodeJS Lambda function with OpenTelemetry and found that spans were being lost because they were still in the BatchSpanProcessor queue when the function returned and the process was suspended and/or killed.

I was able to find 3 solutions:

  1. Using the SimpleSpanProcessor fixes the issue because there are never spans waiting in a queue to be sent, but in my case it was causing export to be called many times with a single span at a time.
  2. Calling the (currently private) _flush method of the BatchSpanProcessor before the function invocation completed caused spans to be exported properly.
  3. Shutting down the BatchSpanProcessor causes it to flush and shut down its exporter. It then must be recreated on each function invocation. Because the same function instance may be used, the function was actually sometimes started with the shut down span processor and exporter already registered with the global tracer. In order to get around this i had to register a new global tracer, span processor, and exporter on each invocation, even if they were already created (there is no way to undo the shutdown of a span processor or exporter).

This proposal is to make (2) the "officially correct" way to do this. (1) and (3) both introduce unnecessary and unintuitive inefficiencies.

This is not specific to any particular implementation of Functions as a Service, but rather to the designed behavior of the batch span processor. If the function completes and the batch span processor has not yet reached the scheduledDelayMillis timeout to export a batch, the function may be killed and the spans may not be exported. The way to solve this is to flush the span processor explicitly before the function returns.

FWIW I have verified this behavior on GCP and Azure as well.

Support issues

@jmacd yes this is an sdk level api, but that doesn't mean end users won't call it. I think what @tedsuo is getting at though is still valid since, at least in JS, editor integrations would happily suggest flush as an autocomplete option on any span processor instance. Typically, there is also inline tooltip documentation though that is simple and easy to understand.

In general, names and semantics matter quite a bit. It will be very difficult to remove or fix API decisions after this project has reached v1.0, so this is not a situation where we try things and iterate later. Arguments that start with "I don't care that much" and "its just a documentation issue" raise a red flag for me. Please _do_ care about the names, and care about how users will naively interact with this API.

I don't think Flarna was trying to imply he doesn't care, but rather that he isn't married to Flush as a name and whatever name is eventually chosen will be ok with him.

Again, we have had a huge problem with Flush in the past, so this is not a theoretical issue. If an API returns an error, users believe they need to handle it. If an API includes a Flush, users believe they need to manage flushing themselves. There was documentation then - users do not read the fine print. Please think hard about whether there is an equally acceptable solution which would avoid this issue.

As stated above, there needs to be a way to flush spans before the function execution completes. In my opinion, giving it a name that does not accurately describe the behavior is unnecessary indirection and may cause confusion.

I am not arguing for a complicated API, or any particular API in this case. And I completely agree that serverless is a valid use case. But, honestly, this feels a bit rushed, and it does not look to me like the use cases and requirements have been fully researched. My primary request - with all API changes going forwards - is to please show examples and thoroughly explore the edge cases, so that there is a solid justification at the time we add it to spec.

In the past, we have made spec changes without doing this work up front, only to create trouble as multiple languages hit the same problem during implementation. So I am trying to change this practice.

For the record, I think of "user" as the developers who write instrumentation with an OTel API. I think of "operator" as the developers who control the main() function and the deployment environment--essentially the SDK configuration. That's the sense in which I consider Flush not a user-facing API.

If the issue is just the name flush, I suggest choosing one that looks more dangerous to call, e.g. force_send_data_now or force_flush_now (I hope users think twice before calling something that starts with force_).

@tedsuo would force_flush or some more sinister sounding name alleviate your concern about users calling it? If so I can modify the PR.

Ahh, sorry for being a bottleneck. I really appreciate your answers.

  • I'm glad to see proof that this works in NodeJS. I was concerned that this was a bit theoretical but you've convinced me.
  • I agree that ForceFlush is better that Flush.
  • It helps that ForceFlush is an SDK method. This makes it clearer that instrumentation packages should not be calling it.

As an aside: this also points to the need for a good streaming protocol, which sends every span as it completes while avoiding an excessive overhead, like network handshakes. @rnburn has done some really good work on designing one for C++, that shows how far you can go with this pattern. I am convinced that you will still need ForceFlush or some kind of trigger, to avoid a race condition on the final span. But this could help avoid some kind of bottleneck occurring on suspend.

I still wonder whether we will see issues related to tearing down and reestablishing network connections... but that can be handled separately, as part of an improved streaming or lambda-specific SpanProcessor.

The use-case I had in mind for https://github.com/open-telemetry/opentelemetry-cpp/pull/45 is a little different.

Flush is intended to support a mode where a tracer is run without any networking or background thread. Since a tracer doesn't have access to IO in that case, it needs for the flush (or something similar) to be invoked externally by the user.

This mode is for applications (e.g. Envoy) that want to integrate a tracer into their own networking/threading architecture. For background, see
https://github.com/envoyproxy/envoy/issues/1196#issuecomment-317050421
or how flushing is used for lightstep's tracer in envoy
https://github.com/envoyproxy/envoy/blob/master/source/extensions/tracers/lightstep/lightstep_tracer_impl.cc#L118-L122

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yurishkuro picture yurishkuro  路  5Comments

andrewhsu picture andrewhsu  路  3Comments

XSAM picture XSAM  路  3Comments

cijothomas picture cijothomas  路  3Comments

tigrannajaryan picture tigrannajaryan  路  5Comments