Thanos, Prometheus and Golang version used:
thanos 0.11.0, go1.13.7
Object Storage Provider:
s3 / ceph
What happened:
Queries don't timeout after specified --query.timeout=180s and keeps running. The metric prometheus_engine_queries goes up and and never down until the query instance is restarted. Queries come from rule and can run for hours instead of seconds.
It happens on all instances of query at the same time so its propably related to temporary problems with some stores.
Anything else we need to know:
I have rolled back query to 0.10.1 and keeped rest of compontest on 0.11.0 and this is working for me for now.
Looking git log I have picked and tested these two commits.
2e3ece187f8fe3a356609a3ee3457162ce7f8545 works
c39ddb2559ecaea4bcd754d8456016ed2e96f67e broken
I suspect this commit is the cause a354bfbac862ad6eab5af94d10f597079b6ad175
Hi! Sorry for the problems. Can't say anything about this specific problem but I wouldn't be surprised if this is related to https://github.com/thanos-io/thanos/issues/2213. Thanks for dissecting this to almost a specific commit. Could you try reverting that commit on https://github.com/thanos-io/thanos/commit/c39ddb2559ecaea4bcd754d8456016ed2e96f67e and see if it helps?
Yes, I will try that tomorrow.
So I was able only run a354bfbac862ad6eab5af94d10f597079b6ad175 and the problem was present.
I am running v0.11.0 with both a354bfbac862ad6eab5af94d10f597079b6ad175 and c39ddb2559ecaea4bcd754d8456016ed2e96f67e reverted on query now and it is prefectly stable.
I haved missed that number of goroutines goes up too as they got stuck. So I have dumped them with /debug/pprof/ goroutine.txt
I haved missed that number of goroutines goes up too as they got stuck. So I have dumped them with
/debug/pprof/goroutine.txt
I have spent some time on this and it seems like this is some tricky edge-case. All of our tests on slow StoreAPI nodes also pass on the latest master. I have made some adjustments to them in my own branch but couldn't catch anything.
Does it help if you apply the following patch:
diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 974a2521..bb77bd09 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -443,7 +443,7 @@ func startStreamSeriesSet(
}
func (s *streamSeriesSet) handleErr(err error, done chan struct{}) {
- defer close(done)
+ close(done)
s.closeSeries()
if s.partialResponse {
?
I think the culprit is this https://github.com/thanos-io/thanos/blob/cc6c5b54baf7ea92edef84da475a1e97ab1d21ec/pkg/store/proxy.go#L437
It does not handle context cancel like in the original version
https://github.com/thanos-io/thanos/blob/f034581bffbb15324a7608236282e035f2c989d2/pkg/store/proxy.go#L413-L418
What is happening is that many proxy are waiting there with responses ready to be merged but some of them are errors. Then the request fails, merge is aborted and request context is canceled with rest of the proxies stucked.
I'm using v0.11.0 with this patch and did not encounter any problems in the last two days.
diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 974a2521..8700e1be 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -436,7 +436,14 @@ func startStreamSeriesSet(
s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
continue
}
- s.recvCh <- rr.r.GetSeries()
+
+ select {
+ case s.recvCh <- rr.r.GetSeries():
+ continue
+ case <-ctx.Done():
+ s.handleErr(errors.Wrapf(ctx.Err(), "failed to receive any data from %s", s.name), done)
+ return
+ }
}
}()
return s
I'm using v0.11.0 with this patch and did not encounter any problems in the last two days.
diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go index 974a2521..8700e1be 100644 --- a/pkg/store/proxy.go +++ b/pkg/store/proxy.go @@ -436,7 +436,14 @@ func startStreamSeriesSet( s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w))) continue } - s.recvCh <- rr.r.GetSeries() + + select { + case s.recvCh <- rr.r.GetSeries(): + continue + case <-ctx.Done(): + s.handleErr(errors.Wrapf(ctx.Err(), "failed to receive any data from %s", s.name), done) + return + } } }() return s
Thank you for your patch! I believe that the problem is a bit more nuanced.
My stupid way of reproducing is by doing this change:
diff --git a/pkg/store/proxy.go b/pkg/store/proxy.go
index 8440de2f..e801d98b 100644
--- a/pkg/store/proxy.go
+++ b/pkg/store/proxy.go
@@ -430,11 +430,13 @@ func startStreamSeriesSet(
}
numResponses++
- if w := rr.r.GetWarning(); w != "" {
- s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
- continue
+ for i := 0; i < 15; i++ {
+ if w := rr.r.GetWarning(); w != "" {
+ s.warnCh.send(storepb.NewWarnSeriesResponse(errors.New(w)))
+ continue
+ }
+ s.recvCh <- rr.r.GetSeries()
}
- s.recvCh <- rr.r.GetSeries()
}
}()
return s
At the very least the backtraces are the same but that's expected since we shouldn't send more things if it has errored out due to the partial response being disabled.
(deleted a lot stuff because I'm still not sure)
Finally came up with a normal test-case: https://github.com/thanos-io/thanos/pull/2411 :smile_cat:
Nice. I was trying to write up a test case. But somehow I got the fix first.
I will try 0.12.0 day or two after release to confrim that the problem is gone.