Thanos, Prometheus and Golang version used:
thanos: 0.12.0
prometheus: 2.17.1
go: 1.13
NB I'm aware of the https://github.com/thanos-io/thanos/releases/tag/v0.12.1 release, however regarding fixes within that release I presume this issue is still here.
Object Storage Provider:
ceph s3
What happened:
A thanos sidecar gets restarted from time to time when a config file has been changed (as a reloaded configmap).
What you expected to happen:
No restarts after such changes.
How to reproduce it (as minimally and precisely as possible):
Run a pod with both prometheus and thanos sidecar containers and update a config file (a configmap).
Full logs to relevant components:
_RESTARTS_
Time kubernetes.pod.name kubernetes.container.name log
April 21st 2020, 15:09:05.046 prometheus-1 thanos level=warn ts=2020-04-21T13:09:05.046664068Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=warn ts=2020-04-21T13:09:05.047028972Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.046979769Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.046947111Z caller=http.go:81 service=http/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.047 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.047056469Z caller=grpc.go:125 service=gRPC/server component=sidecar msg="gracefully stopping internal server"
April 21st 2020, 15:09:05.544 prometheus-1 thanos level=info ts=2020-04-21T13:09:05.544623524Z caller=grpc.go:137 service=gRPC/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory"
April 21st 2020, 15:09:05.545 prometheus-1 thanos level=error ts=2020-04-21T13:09:05.545009199Z caller=main.go:212 err="lstat /etc/prometheus/rules/..2020_04_17_06_44_43.813421346/config.yaml: no such file or directory\nbuild hash\ngithub.com/thanos-io/thanos/pkg/reloader.(*Reloader).apply\n\t/build/source/pkg/reloader/reloader.go:266\ngithub.com/thanos-io/thanos/pkg/reloader.(*Reloader).Watch\n\t/build/source/pkg/reloader/reloader.go:178\nmain.runSidecar.func5\n\t/build/source/cmd/thanos/sidecar.go:260\ngithub.com/oklog/run.(*Group).Run.func1\n\t/build/go/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/asm_amd64.s:1357\nsidecar command failed\nmain.main\n\t/build/source/cmd/thanos/main.go:212\nruntime.main\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/proc.go:203\nruntime.goexit\n\t/nix/store/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee-go-1.13.8/share/go/src/runtime/asm_amd64.s:1357"
April 21st 2020, 15:09:06.551 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.551061208Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
April 21st 2020, 15:09:06.551 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.550946487Z caller=intrumentation.go:48 msg="changing probe status" status=ready
April 21st 2020, 15:09:06.560 prometheus-1 prometheus level=info ts=2020-04-21T13:09:06.560Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:09:06.565 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.565159974Z caller=intrumentation.go:48 msg="changing probe status" status=ready
April 21st 2020, 15:09:06.810 prometheus-1 prometheus level=info ts=2020-04-21T13:09:06.810Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:09:06.810 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.810723101Z caller=reloader.go:157 component=reloader msg="started watching config file and non-recursively rule dirs for changes" cfg=/etc/prometheus/prometheus.yaml.in out=/etc/prometheus-shared/prometheus.yaml dirs=/etc/prometheus/rules,/etc/prometheus/rules/production
April 21st 2020, 15:09:06.810 prometheus-1 thanos level=info ts=2020-04-21T13:09:06.810553588Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"
_OK_
Time kubernetes.pod.name kubernetes.container.name log
April 21st 2020, 15:10:33.506 prometheus-0 prometheus level=info ts=2020-04-21T13:10:33.506Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.773 prometheus-0 thanos level=info ts=2020-04-21T13:10:37.773480987Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"
April 21st 2020, 15:10:37.773 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.773Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.781 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.781Z caller=main.go:788 msg="Loading configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.903 prometheus-0 prometheus level=info ts=2020-04-21T13:10:37.903Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus-shared/prometheus.yaml
April 21st 2020, 15:10:37.904 prometheus-0 thanos level=info ts=2020-04-21T13:10:37.904476442Z caller=reloader.go:289 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yaml.in cfg_out=/etc/prometheus-shared/prometheus.yaml rule_dirs="/etc/prometheus/rules, /etc/prometheus/rules/production"
Anything else we need to know:
sidecar's config snippets:
_Mounts:_
...
/etc/prometheus from config (rw)
/etc/prometheus-shared from config-shared (rw)
/etc/prometheus/rules from rules (rw)
/etc/prometheus/rules/production from production-rules (rw)
_Volumes:_
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-config
Optional: false
rules:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-rules
Optional: false
production-rules:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-production-rules
Optional: false
config-shared:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
_flags:_
--reloader.config-file=/etc/prometheus/prometheus.yaml.in
--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml
--reloader.rule-dir=/etc/prometheus/rules
--reloader.rule-dir=/etc/prometheus/rules/production
Environment:
Kernel Version: 5.3.0-24-generic
OS Image: Ubuntu 18.04.2 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.3
Kubelet Version: v1.14.8
Kube-Proxy Version: v1.14.8
I can confirm we're experiencing this issue with thanos 0.12.1.
Hey @hawran, it looks like when a configuration change happens while sidecar in a reload cycle it crashes.
How frequently have you been observing this? Have you observed this with any of the previous versions?
In the meanwhile, I will investigate more.
Hey @hawran, it looks like when a configuration change happens while sidecar in a reload cycle it crashes.
How frequently have you been observing this?
Yes.
A configuration file was changed, deployed to kubernetes (as a configmap) and some "reload" action was expected.
Sometimes it went smoothly, sometimes there were those shutdowns (hence "_container restart_" alerts) there.
Have you observed this with any of the previous versions?
I can trace back similar problems with version 0.10.1.
In the meanwhile, I will investigate more.
Thank you.
Hello,
We are expecting the same errors from time to time.
level=warn ts=2020-04-29T15:16:38.644408018Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.644736147Z caller=http.go:81 service=http/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.644769927Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=warn ts=2020-04-29T15:16:38.662903596Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=info ts=2020-04-29T15:16:38.662951644Z caller=grpc.go:125 service=gRPC/server component=sidecar msg="gracefully stopping internal server"
level=info ts=2020-04-29T15:16:38.663748389Z caller=grpc.go:137 service=gRPC/server component=sidecar msg="internal server shutdown" err="build hash: lstat /etc/prometheus/rules/..data_tmp: no such file or directory"
level=error ts=2020-04-29T15:16:38.664335462Z caller=main.go:212 err="lstat /etc/prometheus/rules/..data_tmp: no such file or directory\nbuild hash\ngithub.com/thanos-io/thanos/pkg/reloader.(*Reloader).apply\n\t/go/src/github.com/thanos-io/thanos/pkg/reloader/reloader.go:266\ngithub.com/thanos-io/thanos/pkg/reloader.(*Reloader).Watch\n\t/go/src/github.com/thanos-io/thanos/pkg/reloader/reloader.go:178\nmain.runSidecar.func5\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/sidecar.go:260\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357\nsidecar command failed\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:212\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
After quick investigation, I think we had occurence before the 0.12.x release. to be confirmed.
More info (openshift client + server version) where we succeed to reproduce and from where the configmap is changed (as json file --- oc apply -f configmap.json) Containing prometheus alert rules.
oc version
oc v3.7.2+282e43f
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://openshift.int-a.test1.paas.amadeus.net:8443
openshift v3.7.119
kubernetes v1.7.6+a08f5eeb62
@hawran @ahurtaud Thanks for further information. I'm gonna mark this as a bug. I think we have to either handle this error gracefully or immediately (?) read the contents of the file when before we start the update loop.
Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Hello wave Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! hugs
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, useremindcommand if you wish to be reminded at some point in future.
Guys, any progress on this matter?
@hawran I don't think anyone hasn't had a chance to look into it. I'll try my best to check it out, but no promises. So help wanted. (Also added the label)
@hawran I don't think anyone hasn't had a chance to look into it. I'll try my best to check it out, but no promises. So help wanted. (Also added the label)
OK, thank you for update.
Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Seeing this quite often too.
I have a test at https://github.com/ebabani/thanos/blob/eb/lstat-error/pkg/reloader/reloader_test.go#L261 to reproduce this issue. Seems timing dependent but I get an error ~50% of the time.
Definitely related to how k8s updates the volume on configmap changes.
Repro test is super helpful thanks! Feel free to propose a fix, otherwise we will take a look soon (:
@bwplotka What do you think of logging the error and increase the watchErrors metric if the error cause is os.PathError, and otherwise return it?
https://github.com/thanos-io/thanos/blob/master/pkg/reloader/reloader.go#L221-L223
@ebabani It could be a good strategy. Especially this part is not the main functionality of the sidecar. @bwplotka already mentioned it in the comments.
// TODO(bwplotka): There is no need to get the process down in this case and decrease availability, handle the error in a different way.
@ebabani Are you actively working on it? It'd be awesome to have this fix in the upcoming release in two weeks.
@kakkoyun I'll send a PR today.
Fixed by #2996
Most helpful comment
Seeing this quite often too.
I have a test at https://github.com/ebabani/thanos/blob/eb/lstat-error/pkg/reloader/reloader_test.go#L261 to reproduce this issue. Seems timing dependent but I get an error ~50% of the time.
Definitely related to how k8s updates the volume on configmap changes.