Thanos: Query: Get difference result with rate function between Thanos Query and Prometheus instance

Created on 24 Jul 2018  路  5Comments  路  Source: thanos-io/thanos

Thanos, Prometheus and Golang version used


Thanos: improbable/thanos:master-2018-07-18-a412835
Prometheus: v2.3.0

What happened

Try to query the rate of kubernetes container cpu usage with expression rate(container_cpu_usage_seconds_total{image!="",pod_name!="",container_name!="POD"}[30s]) in Thanos Query UI, there was no data returned.

But there were some series when I queried in the Prometheus instance as the same expression.

What you expected to happen

The query result should be as same as the original Prometheus instance in Thanos Query UI

How to reproduce it (as minimally and precisely as possible):

To setup the interval of scrape as 15s in Prometheus Scrape Config, to query with expression rate(container_cpu_usage_seconds_total{image!="",pod_name!="",container_name!="POD"}[30s])

Full logs to relevant components

There was no related query logs in Thanos Query and Sidecar.

Anything else we need to know

Here is the configuration of Prometheus Scrape

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    monitor: prometheus
    replica: prom-thanos-s3-0
scrape_configs:
- job_name: prometheus
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - localhost:9090
- job_name: kubernetes-apiservers
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: default;kubernetes;https
    replacement: $1
    action: keep
- job_name: kubernetes-nodes
  scrape_interval: 15s
  scrape_timeout: 15s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics
    action: replace
- job_name: kubernetes-cadvisor
  scrape_interval: 15s
  scrape_timeout: 15s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - api_server: null
    role: node
    namespaces:
      names: []
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: false
  relabel_configs:
  - separator: ;
    regex: __meta_kubernetes_node_label_(.+)
    replacement: $1
    action: labelmap
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: kubernetes.default.svc:443
    action: replace
  - source_labels: [__meta_kubernetes_node_name]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    action: replace
- job_name: kubernetes-service-endpoints
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names: []
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_label_k8s_app]
    separator: ;
    regex: node-exporter
    replacement: $1
    action: drop
  - source_labels: [__meta_kubernetes_service_label_k8s_app]
    separator: ;
    regex: kube-state-metrics
    replacement: $1
    action: drop
- job_name: kubernetes-pods
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: pod
    namespaces:
      names: []
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_label_app]
    separator: ;
    regex: frostmourne
    replacement: $1
    action: drop
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_namespace]
    separator: ;
    regex: data
    replacement: $1
    action: drop
- job_name: node-exporter
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names:
      - devops
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_node_name]
    separator: ;
    regex: (.*)
    target_label: node_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_host_ip]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_label_k8s_app]
    separator: ;
    regex: node-exporter
    replacement: $1
    action: keep
  metric_relabel_configs:
  - source_labels: [node_name]
    separator: ;
    regex: (.*)
    target_label: node
    replacement: $1
    action: replace
- job_name: kube-state-metrics
  scrape_interval: 15s
  scrape_timeout: 15s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - api_server: null
    role: endpoints
    namespaces:
      names:
      - kube-system
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    separator: ;
    regex: "true"
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    separator: ;
    regex: (https?)
    target_label: __scheme__
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    separator: ;
    regex: (.+)
    target_label: __metrics_path__
    replacement: $1
    action: replace
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    separator: ;
    regex: ([^:]+)(?::\d+)?;(\d+)
    target_label: __address__
    replacement: $1:$2
    action: replace
  - separator: ;
    regex: __meta_kubernetes_service_label_(.+)
    replacement: $1
    action: labelmap
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: kubernetes_namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: kubernetes_name
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_label_k8s_app]
    separator: ;
    regex: kube-state-metrics
    replacement: $1
    action: keep
  metric_relabel_configs:
  - source_labels: [pod]
    separator: ;
    regex: (.*)
    target_label: pod_name
    replacement: $1
    action: replace
  - source_labels: [container]
    separator: ;
    regex: (.*)
    target_label: container_name
    replacement: $1
    action: replace
bug

Most helpful comment

I didn't 100% know the query steps for both Prometheus and Thanos, I spent time to track the line to fetch the result at first.

Dependent on the query steps, here is the checklist to compare with original Prometheus

  1. query expression
  2. the range of time
  3. the params for function querier.Select
  4. the storage.SeriesSet returned by querier.Select

It was lucky that the storage.SeriesSet from Thanos Source was as same as the original Prometheus, I continued to track the evaluate process. I guessed that it was failed to evaluate rate with no enough data.

So I tried to print the debug log in function eval at L766 both Thanos and original Prometheus. I found that the Points at L870 was difference with original Prometheus, Prometheus were two, Thanos was one, I checked that series both Thanos and Prometheus returned had two samples within 30s from the checklist above. So I confirmed the problem located in matrixIterSlice to fetch the samples from the SeriesIterator implemented by CounterSeriesIterator.

Dependent on this clue, I found that the Promql engine would create the instance of storage.BufferedSeriesIterator, it reset the iterator with new one and invoke the iterator.Next function at L57, before fix the bug, the cursor would move to the first sample in CounterSeriesIterator. After that it would seek the cursor to specify timestamp, I found the promblem at Seek function.

It was not easy to track down the bug. I printed a lot of logs to trace the series samples, but I'm happy that I'm deep understanding the Promql and Thanos due to this issues.

All 5 comments

Thanks for reporting. Scrape configuration is not that useful. What would be useful is for what time range you query? Because in rc1 downsampling is enabled but it does not always work. Might that issue?

@jojohappy Can you share how you tracked down that bug? 0.o Does it fix this issue, or it is something additional you found?

Anyway, great job!

I didn't 100% know the query steps for both Prometheus and Thanos, I spent time to track the line to fetch the result at first.

Dependent on the query steps, here is the checklist to compare with original Prometheus

  1. query expression
  2. the range of time
  3. the params for function querier.Select
  4. the storage.SeriesSet returned by querier.Select

It was lucky that the storage.SeriesSet from Thanos Source was as same as the original Prometheus, I continued to track the evaluate process. I guessed that it was failed to evaluate rate with no enough data.

So I tried to print the debug log in function eval at L766 both Thanos and original Prometheus. I found that the Points at L870 was difference with original Prometheus, Prometheus were two, Thanos was one, I checked that series both Thanos and Prometheus returned had two samples within 30s from the checklist above. So I confirmed the problem located in matrixIterSlice to fetch the samples from the SeriesIterator implemented by CounterSeriesIterator.

Dependent on this clue, I found that the Promql engine would create the instance of storage.BufferedSeriesIterator, it reset the iterator with new one and invoke the iterator.Next function at L57, before fix the bug, the cursor would move to the first sample in CounterSeriesIterator. After that it would seek the cursor to specify timestamp, I found the promblem at Seek function.

It was not easy to track down the bug. I printed a lot of logs to trace the series samples, but I'm happy that I'm deep understanding the Promql and Thanos due to this issues.

Fix landed - can you confirm that it fixes this exact issue?

Yes, I confirm. I have got same result both Thanos and Prometheus. We can close it.

Was this page helpful?
0 / 5 - 0 ratings