Thanos, Prometheus and Golang version used
Thanos: improbable/thanos:master-2018-07-18-a412835
Prometheus: v2.3.0
What happened
Try to query the rate of kubernetes container cpu usage with expression rate(container_cpu_usage_seconds_total{image!="",pod_name!="",container_name!="POD"}[30s]) in Thanos Query UI, there was no data returned.
But there were some series when I queried in the Prometheus instance as the same expression.
What you expected to happen
The query result should be as same as the original Prometheus instance in Thanos Query UI
How to reproduce it (as minimally and precisely as possible):
To setup the interval of scrape as 15s in Prometheus Scrape Config, to query with expression rate(container_cpu_usage_seconds_total{image!="",pod_name!="",container_name!="POD"}[30s])
Full logs to relevant components
There was no related query logs in Thanos Query and Sidecar.
Anything else we need to know
Here is the configuration of Prometheus Scrape
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 30s
external_labels:
monitor: prometheus
replica: prom-thanos-s3-0
scrape_configs:
- job_name: prometheus
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- localhost:9090
- job_name: kubernetes-apiservers
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
separator: ;
regex: default;kubernetes;https
replacement: $1
action: keep
- job_name: kubernetes-nodes
scrape_interval: 15s
scrape_timeout: 15s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: node
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
action: replace
- job_name: kubernetes-cadvisor
scrape_interval: 15s
scrape_timeout: 15s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: node
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
action: replace
- job_name: kubernetes-service-endpoints
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names: []
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: node-exporter
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kube-state-metrics
replacement: $1
action: drop
- job_name: kubernetes-pods
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: pod
namespaces:
names: []
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: kubernetes_pod_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_label_app]
separator: ;
regex: frostmourne
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_namespace]
separator: ;
regex: data
replacement: $1
action: drop
- job_name: node-exporter
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names:
- devops
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: (.*)
target_label: node_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_host_ip]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: node-exporter
replacement: $1
action: keep
metric_relabel_configs:
- source_labels: [node_name]
separator: ;
regex: (.*)
target_label: node
replacement: $1
action: replace
- job_name: kube-state-metrics
scrape_interval: 15s
scrape_timeout: 15s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names:
- kube-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kube-state-metrics
replacement: $1
action: keep
metric_relabel_configs:
- source_labels: [pod]
separator: ;
regex: (.*)
target_label: pod_name
replacement: $1
action: replace
- source_labels: [container]
separator: ;
regex: (.*)
target_label: container_name
replacement: $1
action: replace
Thanks for reporting. Scrape configuration is not that useful. What would be useful is for what time range you query? Because in rc1 downsampling is enabled but it does not always work. Might that issue?
@jojohappy Can you share how you tracked down that bug? 0.o Does it fix this issue, or it is something additional you found?
Anyway, great job!
I didn't 100% know the query steps for both Prometheus and Thanos, I spent time to track the line to fetch the result at first.
Dependent on the query steps, here is the checklist to compare with original Prometheus
querier.Selectstorage.SeriesSet returned by querier.SelectIt was lucky that the storage.SeriesSet from Thanos Source was as same as the original Prometheus, I continued to track the evaluate process. I guessed that it was failed to evaluate rate with no enough data.
So I tried to print the debug log in function eval at L766 both Thanos and original Prometheus. I found that the Points at L870 was difference with original Prometheus, Prometheus were two, Thanos was one, I checked that series both Thanos and Prometheus returned had two samples within 30s from the checklist above. So I confirmed the problem located in matrixIterSlice to fetch the samples from the SeriesIterator implemented by CounterSeriesIterator.
Dependent on this clue, I found that the Promql engine would create the instance of storage.BufferedSeriesIterator, it reset the iterator with new one and invoke the iterator.Next function at L57, before fix the bug, the cursor would move to the first sample in CounterSeriesIterator. After that it would seek the cursor to specify timestamp, I found the promblem at Seek function.
It was not easy to track down the bug. I printed a lot of logs to trace the series samples, but I'm happy that I'm deep understanding the Promql and Thanos due to this issues.
Fix landed - can you confirm that it fixes this exact issue?
Yes, I confirm. I have got same result both Thanos and Prometheus. We can close it.
Most helpful comment
I didn't 100% know the query steps for both Prometheus and Thanos, I spent time to track the line to fetch the result at first.
Dependent on the query steps, here is the checklist to compare with original Prometheus
querier.Selectstorage.SeriesSetreturned byquerier.SelectIt was lucky that the
storage.SeriesSetfrom Thanos Source was as same as the original Prometheus, I continued to track the evaluate process. I guessed that it was failed to evaluateratewith no enough data.So I tried to print the debug log in function
evalat L766 both Thanos and original Prometheus. I found that thePointsat L870 was difference with original Prometheus, Prometheus were two, Thanos was one, I checked that series both Thanos and Prometheus returned had two samples within 30s from the checklist above. So I confirmed the problem located inmatrixIterSliceto fetch the samples from theSeriesIteratorimplemented byCounterSeriesIterator.Dependent on this clue, I found that the Promql engine would create the instance of
storage.BufferedSeriesIterator, it reset the iterator with new one and invoke theiterator.Nextfunction at L57, before fix the bug, the cursor would move to the first sample inCounterSeriesIterator. After that it would seek the cursor to specify timestamp, I found the promblem atSeekfunction.It was not easy to track down the bug. I printed a lot of logs to trace the series samples, but I'm happy that I'm deep understanding the Promql and Thanos due to this issues.