Sidekiq: Remove duplicate/redundant metrics

Created on 14 Nov 2019 · 21Comments · Source: mperham/sidekiq

Ruby version: 2.4.7
Sidekiq / Pro / Enterprise version(s): 5.2.5 / 4.0.4 / 1.8.1

Sidekiq.configure_server do |config|
  Sidekiq::Pro.dogstatsd = ->{ CustomMetric.dogstatsd } # we have a dogstatsd instance we use for all metrics so we pass it to Sidekiq too
  config.server_middleware do |chain|
    require 'sidekiq/middleware/server/statsd'
    chain.add Sidekiq::Middleware::Server::Statsd
  end

  logging_frequency_seconds = 30
  config.retain_history(logging_frequency_seconds)
end

Sidekiq currently has duplicate metrics for jobs.count, jobs.failure, jobs.success with a worker tag for each. We also have metrics for jobs.<worker_name>.count|failure|success. Outside of this, we have jobs.<worker_name>.perform metrics which we don't have in the more general jobs.perform style like count/failure/success above. I'd like to propose a new design.

Remove all jobs.<worker_name>.<metric> metrics
Keep jobs.count|failure|success
To this add, jobs.perform related metrics
Both of the above should have worker tags

Benefits of above approach:

Redundancy removal. This simplifies understanding, finding, and graphing metrics. It also reduces the metrics footprint of Sidekiq which can have implications on cost depending on # of workers/plan you pay for/etc. Currently, for 100 workers, you'd get 300 redundant custom metrics.
It unlocks more global graphs with the jobs.perform metric where users can use that data to find poorly written workers, notice worker time spikes (potentially due to external or internal services timing out, etc.)

Are you using an old version? Yes
Have you checked the changelogs to see if your issue has been fixed in a later version? Yes

Source

nakulpathak3

All 21 comments

This is a good issue but it likely requires a major version bump unless we make it opt-in.

On Nov 13, 2019, at 20:00, Nakul Pathak notifications@github.com wrote:

Ruby version: 2.4.7
Sidekiq / Pro / Enterprise version(s): 5.2.5 / 4.0.4 / 1.8.1

Sidekiq.configure_server do |config|
Sidekiq::Pro.dogstatsd = ->{ CustomMetric.dogstatsd } # we have a dogstatsd instance we use for all metrics so we pass it to Sidekiq too
config.server_middleware do |chain|
require 'sidekiq/middleware/server/statsd'
chain.add Sidekiq::Middleware::Server::Statsd
end

logging_frequency_seconds = 30
config.retain_history(logging_frequency_seconds)
end
Sidekiq currently has duplicate metrics for jobs.count, jobs.failure, jobs.success with a worker tag for each. We also have metrics for jobs..count|failure|success. Outside of this, we have jobs..processing_time metrics which we don't have in the more general jobs.perform style like count/failure/success above. I'd like to propose a new design.

Remove all jobs.. metrics
Keep jobs.count|failure|success
To this add, jobs.processing_time related metrics
Both of the above should have worker tags
Benefits of above approach:

Redundancy removal. This simplifies understanding, finding, and graphing metrics. It also reduces the metrics footprint of Sidekiq which can have implications on cost depending on # of workers/plan you pay for/etc.
It unlocks more global graphs with the jobs.processing_time metric where users can use that data to find poorly written workers, notice worker time spikes (potentially due to external or internal services timing out, etc.)
Are you using an old version? Yes
Have you checked the changelogs to see if your issue has been fixed in a later version? Yes

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

mperham on 14 Nov 2019

Great idea.

Same goes for Ent Historical Metrics https://github.com/mperham/sidekiq/wiki/Ent-Historical-Metrics

The sidekiq.enqueued.#{x} - Current Size of queue x metrics could be reported as a single metric sidekiq.enqueued with a tag for each queue, following recommendation from https://docs.datadoghq.com/dashboards/faq/when-i-query-can-i-use-wildcards-in-metric-names-and-events/

However I don't think tags are standard to the statsd protocol, so this would be a Datadog (or whoever supports tags) specific feature?

Is there any recommendation to do this today without a major version? Perhaps you could add a way to disable this metric when calling retain_history so we could follow a pattern similar to reporting queue latency, for queue size?

Something like...

  config.retain_history(30, except: ['enqueued']) do |s|
    s.batch do |b|
      %w(penny nickel dime).each do |qname|
        q = Sidekiq::Queue.new(qname)
        tag_name = "queue:#{q.name}"
        b.gauge("sidekiq.queue.latency", q.latency, tags: [tag_name])
        b.gauge("sidekiq.queue.size", q.size, tags: [tag_name])
      end
    end
  end

jeffblake on 7 Jun 2020

👍1

Based on the Ent changelog for 6.1.0/2.1.0, we expected to see metrics reported to Datadog such as:

sidekiq.enqueued.#{name} -> sidekiq.queue.size with tag queue:#{name}
sidekiq.latency.#{name} -> sidekiq.queue.latency with tag queue:#{name}

While we did loose the queue name in the metric, I am not seeing queue tags applied.
Possible bug or is more required to get this change enabled?

We are seeing metrics and tags applied as:

job.WorkerName.count -> job.count with tag worker:WorkerName
job.WorkerName.perform -> job.perform with tag worker:WorkerName
job.WorkerName.failure -> job.failure with tag worker:WorkerName

as expected from the Pro changelog.

mcg on 7 Jul 2020

          stats.queues.each_pair do |q, length|
            s.gauge("sidekiq.queue.size", length, queue: q)
          end
          s.gauge("sidekiq.queue.latency", stats.default_queue_latency, queue: "default")

You can see the queue tag applied right here. I'm not sure why you aren't seeing it.

mperham on 7 Jul 2020

Ah, it needs to be tags: ["queue:default"], etc. My bug.

mperham on 7 Jul 2020

ruby stats.queues.each_pair do |q, length| s.gauge("sidekiq.queue.size", length, tags: ["queue:#{q}"]) end s.gauge("sidekiq.queue.latency", stats.default_queue_latency, tags: ["queue:default"])

Here's the fix.

mperham on 7 Jul 2020

Very nice! When will this be released?

mcg on 7 Jul 2020

You can monkeypatch capture_default if you need a fix immediately. I'll release Ent 2.1.1 Friday along with Sidekiq 6.1.1.

mperham on 7 Jul 2020

👍1

@mperham Sorry to have to bug again. We just deployed 6.1.1 and are still not seeing queue tags assigned to sidekiq.queue.size and sidekiq.queue.latency. We are now seeing both worker and queue tags being assigned to job.count, job.perform and job.failure. From the changelog we should only have worker for these.

mcg on 22 Jul 2020

Ent 2.1.1 has not been released.

On Jul 22, 2020, at 09:58, Matthew notifications@github.com wrote:

@mperham Sorry to have to bug again. We just deployed 6.1.1 and are still not seeing queue tags assigned to sidekiq.queue.size and sidekiq.queue.latency. We are now seeing both worker and queue tags being assigned to job.count, job.perform and job.failure. From the changelog we should only have worker for these.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

mperham on 22 Jul 2020

Sigh, yeah just noticed that. Changelog has it as being cut. Sorry for the noise.

mcg on 22 Jul 2020

Fixed, sorry for the confusion!

Released Sidekiq Enterprise 2.1.1

mperham on 22 Jul 2020

I slipped application preload into 2.1.1 and that distracted me from this promise, hope it was worth the wait!

https://github.com/mperham/sidekiq/wiki/Ent-Multi-Process#application-preload

mperham on 22 Jul 2020

Thanks for getting the gem out and it does fix the issue, new issue with the metric name tho. Metrics from the Ent integration have include an extra "sidekiq". Datadog requires a "sidekiq" namespace the Sidekiq integration and with the Pro metrics this is fine(no extra sidekiq), but it looks like you're adding the namespace in the metric name. Basically "sidekiq.queue.size" becomes "sidekiq.sidekiq.queue.size". I can create a new issue for this if needed.

mcg on 22 Jul 2020

All of the history metrics have a sidekiq prefix and have since day one. Do you suggest I remove sidekiq from all here?

```ruby
s.gauge("sidekiq.failed", stats.failed)
s.gauge("sidekiq.processed", stats.processed)
s.gauge("sidekiq.enqueued", stats.enqueued)
s.gauge("sidekiq.retries", stats.retry_size)
s.gauge("sidekiq.dead", stats.dead_size)
s.gauge("sidekiq.scheduled", stats.scheduled_size)
s.gauge("sidekiq.busy", stats.workers_size)

      stats.queues.each_pair do |q, length|
        s.gauge("sidekiq.queue.size", length, tags: ["queue:#{q}"])
      end
      s.gauge("sidekiq.queue.latency", stats.default_queue_latency, tags: ["queue:default"])

```

mperham on 22 Jul 2020

The thinking here is that "jobs" metrics are part of your app metrics whereas the history metrics are more like infrastructure metrics and part of Sidekiq itself, so your namespace should be "appname". Then you will get metrics like:

appname.job.perform
appname.sidekiq.queue.size
etc

In other words, my working assumption is that your statsd namespace is not "sidekiq".

mperham on 22 Jul 2020

Unsure, but to match what the Pro metrics send, I think so. Confused as to why we were not seeing the double sidekiq until this change tho. Anyhow, if you don't use the "sidekiq" namespace in the Datadog integration all metrics from sidekiq are treated as custom. I think a better approach, that works better with the DD integration at least, would be to send:

appname.job.perform
appname.queue.size

Where appname is not the namespace.

The Datadog integration will append sidekiq, via the namespace. So these become:

sidekiq.appname.job.perform
sidekiq.appname.queue.size

Right now these become:

sidekiq.job.perform
sidekiq.sidekiq.queue.size

And for clarification this is the DD setup I am referencing(https://docs.datadoghq.com/integrations/sidekiq/):

  require 'datadog/statsd' # gem 'dogstatsd-ruby'

    Sidekiq::Pro.dogstatsd = ->{ Datadog::Statsd.new('metrics.example.com', 8125, namespace:'sidekiq') }

You must use namespace:'sidekiq' here.

mcg on 22 Jul 2020

Sounds good. I can remove the sidekiq prefix in 2.2. I don't want to do it in a patch release like 2.1.2.

mperham on 23 Jul 2020

Do you want to open a new issue so you can track it and I can link the issue in the changelog?

mperham on 23 Jul 2020

👍1

You must use namespace:'sidekiq' here.

@mcg where did you get this requirement from? We use the same integration without that namespace and it works fine.

nakulpathak3 on 14 Aug 2020

@nakulpathak3 From Datadog support. It “works” without the namespace, but the metrics will be billed as custom. Are your metrics showing up as custom?

mcg on 15 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings