Vector: metric sum with batch to `aws_cloudwatch_metrics` or a aggregation transform

Created on 12 Sep 2019  Â·  10Comments  Â·  Source: timberio/vector

Hi.

The CloudWatch Metrics (PutMetricData) have some hard limits:

  • 20 items (MetricDatum) per request;
  • 40 KB for HTTP POST requests.

I am trying to use the Vector in production for counting requests according some categories inside my application, but it receives 10k rpm. So I need to batch it all before sending it to CloudWatch.

I am having this error using the aws_cloudwatch_metrics:

Sep 11 20:25:07 xxxxxxxxxxxxxxxxxxxxxxxxxxxxx vector[21662]: Sep 11 20:25:07.470 ERROR sink{name="cloudwatch_metrics"}: log: encountered non-retriable error. error=unknown error  log.target="vector::sinks::util::retries" log.module_path="vector::sinks::util::retries" log.file="src/sinks/util/retries.rs" log.line=86
Sep 11 20:25:07 xxxxxxxxxxxxxxxxxxxxxxxxxxxxx vector[21662]: Sep 11 20:25:07.470 ERROR sink{name="cloudwatch_metrics"}: vector::sinks::util::retries: encountered non-retriable error. error=unknown error
Sep 11 20:25:07 xxxxxxxxxxxxxxxxxxxxxxxxxxxxx vector[21662]: Sep 11 20:25:07.471 ERROR sink{name="cloudwatch_metrics"}: log: request failed. error=unknown error  log.target="vector::sinks::util" log.module_path="vector::sinks::util" log.file="src/sinks/util/mod.rs" log.line=200
Sep 11 20:25:07 xxxxxxxxxxxxxxxxxxxxxxxxxxxxx vector[21662]: Sep 11 20:25:07.471 ERROR sink{name="cloudwatch_metrics"}: vector::sinks::util: request failed. error=unknown error

My config for the sink:

[sinks.cloudwatch_metrics]
  inputs = ["log_to_metric"]
  type = "aws_cloudwatch_metrics"
  namespace = "Application"
  region = "sa-east-1"
  batch_size = 10000
  batch_timeout = 60

[sinks.cloudwatch_metrics.buffer]
  type = "memory"
  when_full = "block"
  num_items = 5000

Is there a way to sum the counting values from the metrics according it tags before sinking it?

If not, does the vector would have the responsibility for doing it?

If so, what would be more interesting? Creating a transform to aggreate it data or something in the sink and its buffer/batch configuration?

requirements aws_cloudwatch_metrics bug

Most helpful comment

Hi! I've done some tests (in production haha…) with the version when I saw the fix merged.
The behaviour was correct! I think we can close this issue. :slightly_smiling_face:

All 10 comments

Hi @gumieri, thanks for details. We are currently designing metrics aggregation, and your input is very much welcome.

Unfortunately setting batch_size to 10000 will not work, because it will try to put 10000 metrics into single request, which is against the limits (like you've said).

Could I ask a few questions to understand your use case better?

Which types of metrics do you use? Is it only counters?
Are there many unique tags combinations?
Would you tolerate the loss of precise timestamp information because of aggregation?
Do you think sampling could help?

Unfortunately setting batch_size to 10000 will not work, because it will try to put 10000 metrics into single request, which is against the limits (like you've said).

Oh I see. I go the batch_size configuration from the aws_cloudwatch_logs and there is described as bytes, so I was not sure how much was too much.

Could I ask a few questions to understand your use case better?

Sure!

Which types of metrics do you use? Is it only counters?

At the moment I have some counting metrics:

  • number of http requests to a variety of "microservices";
  • number of messages in a queue;
  • number of consumers (workers) listening a queue;
  • the division of messages by consumers of a queue (to help scaling these consumers);

Are there many unique tags combinations?

At the moment it is 22 unique tags combination to a specific application.
Considering namespace as a constant, the application name must be a tag.
Or I would start another vector or having another sink configured to different applications.

Would you tolerate the loss of precise timestamp information because of aggregation?

I cannot see a way of aggregating without loosing the precision. The only precision needed is considering the batch_timeout, the moment that the metrics are sent or something like the average of the timestamp…

Do you think sampling could help?

I cannot see sampling being beneficial, I would only use it if there was no other option.

Thanks @gumieri, we appreciate the detailed information. I think it's worth having the team weigh in and produce a spec for resolving this. Ideally, you would not have to worry about this since the default CloudWatch metrics limit seem to be liberal:

40 KB for HTTP POST requests. PutMetricData can handle 150 transactions per second (TPS), which is the maximum number of operation requests you can make per second without being throttled.

You can request a limit increase.

We'll also think about a way you can control and specify loss of precision.

One thing we could do that might not be that invasive is to "dedup" each batch of events that arrive in the sink before sending them. It wouldn't work for every data type right now, but counters and gauges are two that allow for relatively straightforward merging.

That could either happen in the sink itself, before encoding, or we could build a smarter batch type that knows how to either insert or update existing values based on the type/name/tags combo. A batch type would likely work better with the various batching configs.

@gumieri We have merged the deduplication/aggregation of metrics in the buffer. For instance, if the buffer contains 100 Counters with 1.0 value (and having the same name and tags), it will emit only a single Counter that will carry the value of 100.0

Looking forward your feedback!

I am really amazed by the work of your team and the attention I received on this and another Issue that I opened. I feel these are Issues without much significance, but even so your team was very considerate of me and are even delivering it with great agility. thank you very much!

At the moment I am using the Vector in production for feeding a Prometheus/Grafana and it is being amazing.
Now with this feature of aggregation to aws_cloudwatch_metrics I will be able to optimize the actual cost of some very important metrics. Thank you! :heart:

I will not be able to test this feature this month, but ASAP I will be back with the feedback. :wink:

Hi @loony-bean, I've tried to run (make run) the master branch (commit ref: 55766802be0a) but I've seen no changes. Should I try in a different branch or change something in the vector.toml config?

Thanks for pointing out @gumieri!
Apparently there is a bug in the implementation, I'm looking into this

Hi @gumieri, did you get a chance to check the new version?

Hi! I've done some tests (in production haha…) with the version when I saw the fix merged.
The behaviour was correct! I think we can close this issue. :slightly_smiling_face:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

trK54Ylmz picture trK54Ylmz  Â·  3Comments

valyala picture valyala  Â·  3Comments

Hoverbear picture Hoverbear  Â·  3Comments

lewisthompson picture lewisthompson  Â·  3Comments

LucioFranco picture LucioFranco  Â·  3Comments