Thanos, Prometheus and Golang version used
Thanos: 0.6.0
Prometheus: 2.10.0
What happened
The Thanos store won't start. It tries to start up, but crashes in ~30 seconds. Inspecting the pod indicates that the process exited with a non-zero code. The log output with debug enabled is below.
level=info ts=2019-08-23T18:55:30.952906789Z caller=main.go:154 msg="Tracing will be disabled"
level=info ts=2019-08-23T18:55:30.952955736Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-08-23T18:55:30.969576127Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=4294967296 maxSizeBytes=8589934592 maxItems=math.MaxInt64
level=debug ts=2019-08-23T18:55:30.969822743Z caller=store.go:144 msg="initializing bucket store"
What you expected to happen
Thanos store to start successfully.
Anything else we need to know
6 HA pairs of Prometheus instances (12 total instances) are uploading metrics to the AWS S3 bucket. The current bucket size is ~750GB. The store pod manifest is below (I removed the obj-store config, AWS IAM config, etc)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: thanos-store
namespace: monitoring
labels:
app: thanos-store
spec:
replicas: 3
selector:
matchLabels:
app: thanos-store
serviceName: thanos-store
template:
metadata:
labels:
app: thanos-store
spec:
containers:
- name: thanos-store
imagePullPolicy: Always
image: "improbable/thanos:v0.6.0"
args:
- store
- --data-dir=/data
- --log.level=debug
- --index-cache-size=8GB
- --chunk-pool-size=20GB
ports:
- name: http
containerPort: 10902
protocol: TCP
- name: grpc
containerPort: 10901
protocol: TCP
livenessProbe:
httpGet:
path: /metrics
port: http
readinessProbe:
httpGet:
path: /metrics
port: http
resources:
limits:
cpu: 2000m
memory: 32000Mi
requests:
cpu: 2000m
memory: 32000Mi
volumeMounts:
- mountPath: /data
name: storage-volume
volumeClaimTemplates:
- metadata:
name: storage-volume
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "128Gi"
Hi, any update on this issue?
I am also getting the same issue that thanos store gateway is stuck with "initializing bucket store" when starting the container. No other warning/error is appearing in the log. Any idea why this is happening or how to find out the root cause of this issue?
The logs are given below:
level=info ts=2019-09-05T14:37:53.221491945Z caller=flags.go:75 msg="gossip is disabled"
level=info ts=2019-09-05T14:37:53.222294564Z caller=factory.go:39 msg="loading bucket configuration"
level=debug ts=2019-09-05T14:37:53.223374047Z caller=store.go:128 msg="initializing bucket store"
Thanks,
Sorry for delay!
Store Gateway Startup grabs portion of the objects into memory and thus if you don't have compactor (do you have it? Is it working?) it will be quite a long process, plus memory intensive.
Most likely Store is just OOMing for your case. Give more memory, time shard store gateway (see: https://github.com/thanos-io/thanos/pull/1077), or add compactor if missing (!).
Things which we are planning to do:
Hope that helps (:
@anoop2503 I just needed to give the store more time to startup (about 5 minutes in my case). It seems that the more memory I feed the store the less time it takes to start.
Also, we could and should probably be more verbose here at the debug level (or info) so that users would know what blocks we are pulling just like Prometheus, for example, prints what blocks it finds on the disk.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Sorry for delay!
Store Gateway Startup grabs portion of the objects into memory and thus if you don't have compactor (do you have it? Is it working?) it will be quite a long process, plus memory intensive.
Most likely Store is just OOMing for your case. Give more memory, time shard store gateway (see: https://github.com/thanos-io/thanos/pull/1077), or add compactor if missing (!).
Things which we are planning to do:
Hope that helps (: