Vector: Sending distributions to DataDog

Created on 20 Jul 2020 · 9Comments · Source: timberio/vector

We would like for Vector to support sending DataDog's distribution data type to their servers. At the moment, it's unclear how we should go about supporting this.

As far as I can tell, the rough lifecycle of a distribution in the normal DataDog stack of tools is as follows:

The application invokes dog.distribution(...) from their DogStatsd library
The DogstatsD submits that datapoint as a statsd packet with the d type identifier
The DataDog agent collects those points into a sketch map
Those sketches are then serialized and sent to what appears to be a beta API endpoint

The actual sketch implementation is relatively widely known, so the challenge would mostly be to determine how those sketches should be serialized and sent to the correct API endpoint. Some further code-diving seems to suggest that there are both protobuf and JSON representations, but protobuf may be preferred.

On the other hand, it seems that different DataDog libraries may send distribution datapoints directly. The Python lib references a distribution_points API resource that seems accessible via a simple POST of JSON data. This could be a much more lightweight option for Vector if it's viable, allowing us to sidestep (at least temporarily) the complexity of aggregation.

To summarize, we should try to answer the following questions:

Is it supported for 3rd party tools to send distributions to the DataDog API?
If yes, should we be sending aggregated sketches or collections of samples?
Which is the best-supported API endpoint and format for doing so?

data model metrics outside help requirements datadog datadog_metrics enhancement

Source

lukesteensen

Most helpful comment

DataDog support third-party tools sending distribution data. The format of the JSON sketch isn't well-documented at this point, but the distribution_points endpoint is simple and supported. It is nearly identical to the metric submission format, but instead of points being a timestamp and float value tuple, a "point" is a timestamp and list of values tuple. This endpoint does the sketch conversion immediately on intake on the server-side. They intend to allow collection and submission of sketches in addition to raw points to support use cases where collecting and serializing a large volume of samples is not feasible, but they don't have a supported endpoint for that at this time.

So go ahead recommendation is to use the distribution_points API endpoint to send the timestamp and list of values tuple.

jamtur01 on 1 Sep 2020

🎉2

All 9 comments

@jamtur01 assigning this to you since we just need to unblock this work. I've reached out to our contact at Datadog and have not received a response. We should try to answer these questions this sprint, if possible.

binarylogic on 31 Aug 2020

So go ahead recommendation is to use the distribution_points API endpoint to send the timestamp and list of values tuple.

jamtur01 on 1 Sep 2020

🎉2

@lukesteensen It's unblocked. Let's chat where it could fit.

jamtur01 on 1 Sep 2020

Sounds good! I think this should be a relatively straightforward expansion of where we ended up in #2913 (/cc @ktff). The representation that we send will be the same as our existing distribution, we'll just need to serialize and route them to the correct DataDog endpoint. It looks like the python library has some code we could follow along with.

lukesteensen on 1 Sep 2020

👍1

So, datadog_metrics sink transforms batched events to a single http request, but we need to support having two requests per batch since there is a chance of having distribution and non distribution metric in the same batch which have different endpoints so different uri. Some of the ways to do this are:

Extend build_request method in HttpSink trait to return Vec<Request>, and all of it's call sites, and so on.
Extend datadog_metrics sink to have two sinks internally, one for each endpoint, and split the events between those two.
Remove batching.

First one seems like a better option if we expect to have more of this case, otherwise 2. is a more local change and there shouldn't be a lot of duplicated code but there will be an issue with Acker since we will need to synchronize those two sinks somehow. 3. is the easiest one but we would lose a feature.

cc. @lukesteensen

ktff on 4 Sep 2020

👍1

This sounds reminiscent of the partitioning we're doing in the aws_cloudwatch_logs sink.

binarylogic on 4 Sep 2020

👍1

Yes, that's exactly what's needed.

ktff on 6 Sep 2020

I think the best approach is probably to split into two sinks internally and partition events across them. Hopefully we can handle acks the same way we do for other partitioned sinks.

lukesteensen on 8 Sep 2020

Yep, but luckily our sink utils mesh quite nicely together, although adding some high level documentation of the whole service/batch/buffer/partition stack is something we should consider, so I was able to reuse the partitioning logic.

ktff on 9 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings