Prometheus-operator: Permission denied writing to mount using volumeClaimTemplate

Created on 6 Feb 2018  路  53Comments  路  Source: prometheus-operator/prometheus-operator

What did you do?

I ran the latest versions of the Prometheus Operator and Kube-Prometheus helm charts configured to use persistent storage on the Prometheus pods using the following storage config:

  storageSpec: 
    volumeClaimTemplate:
      spec:
        selector:
          matchLabels:
            app: k8s-prometheus
        resources:
          requests:
            storage: 20Gi

What did you expect to see?

Volume mount used for storing persistent data.

What did you see instead? Under which circumstances?

The prometheus-kube-prometheus-0 pod keeps crashing with the following error of permission denied on the mounted volume. If I change the configuration to not use the volumeClaimTemplate it works fine. I have also tried using prometheus 2.1 image instead of the default 2.0 used in the helm chart.

This issue looks to be exactly the same as #541 but that looked to have been resolved by setting a securityContext. Inspecting the stateful json set I can see that

        "securityContext": {
          "runAsUser": 1000,
          "runAsNonRoot": true,
          "fsGroup": 2000
        },

is already set. So suspect this problem has not been fully fixed

Environment

  • Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.1", GitCommit:"3a1c9449a956b6026f075fa3134ff92f7d55f812", GitTreeState:"clean", BuildDate:"2018-01-04T20:00:41Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.6-4+9c2a4c1ed1ee7e", GitCommit:"9c2a4c1ed1ee7e2e121203aa9a87315633a89eca", GitTreeState:"clean", BuildDate:"2018-01-22T08:23:41Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

This cluster is running on the IBM Cloud

  • Manifests:
public helm repo used i.e. helm upgrade --install --recreate-pods kube-prometheus coreos/kube-prometheus --namespace monitoring -f kube-prometheus-values.yaml 
  • Prometheus Operator Logs:
level=info ts=2018-02-06T20:33:54.685030752Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
level=info ts=2018-02-06T20:33:54.685207496Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
level=info ts=2018-02-06T20:33:54.685299129Z caller=main.go:227 host_details="(Linux 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 prometheus-kube-prometheus-0 (none))"
level=info ts=2018-02-06T20:33:54.685377696Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-02-06T20:33:54.693871782Z caller=main.go:499 msg="Starting TSDB ..."
level=info ts=2018-02-06T20:33:54.694022982Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-02-06T20:33:54.700144448Z caller=main.go:386 msg="Stopping scrape discovery manager..."
level=info ts=2018-02-06T20:33:54.700176485Z caller=main.go:400 msg="Stopping notify discovery manager..."
level=info ts=2018-02-06T20:33:54.700193323Z caller=main.go:424 msg="Stopping scrape manager..."
level=info ts=2018-02-06T20:33:54.700211266Z caller=manager.go:460 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-02-06T20:33:54.700237176Z caller=manager.go:466 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-02-06T20:33:54.700253366Z caller=notifier.go:493 component=notifier msg="Stopping notification manager..."
level=info ts=2018-02-06T20:33:54.700274951Z caller=main.go:382 msg="Scrape discovery manager stopped"
level=info ts=2018-02-06T20:33:54.700297847Z caller=main.go:396 msg="Notify discovery manager stopped"
level=info ts=2018-02-06T20:33:54.700361658Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."
level=info ts=2018-02-06T20:33:54.700386478Z caller=main.go:418 msg="Scrape manager stopped"
level=info ts=2018-02-06T20:33:54.700413177Z caller=main.go:570 msg="Notifier manager stopped"
level=error ts=2018-02-06T20:33:54.700427801Z caller=main.go:579 err="Opening storage failed mkdir /var/prometheus/data/wal: permission denied"
level=info ts=2018-02-06T20:33:54.700465658Z caller=main.go:581 msg="See you next time!"
helm

Most helpful comment

Some further investigations has highlighted what the problem is here. It looks like that by default the mounts of this type are mounted with the user/group permissions of drwxr-xr-x 4 nobody 42949672 4096 Feb 12 10:41 data

I suspect this due to https://github.com/coreos/prometheus-operator/blob/master/pkg/prometheus/statefulset.go#L365 which is setting prometheus to run as userid 1000, guid 2000. A user with these permissions are not allowed to write to the directory with the permissions as shown above. The fix is to update the ownership of the /var/prometheus/data directory on startup to match the one the program is being run as. This has already been done in the offical prometheus helm charts - https://github.com/kubernetes/charts/commit/7d5a3ff4b105c695f332b2a8ff360e891477e6e9#diff-97df733ade0fb9ea384f77bf3a393a0a

i.e. the statefulset needs to have

  initContainers:
  - name: "init-chown-data"
    image: "busybox"
    # 1000 is the user that prometheus uses.
    command: ["chown", "-R", "1000:2000", /var/prometheus/data]
    volumeMounts:
    - name: prometheus-kube-prometheus-db
      mountPath: /var/prometheus/data

All 53 comments

What kind of storage volume are you using? Are there possibly permissions set already?

could you also provide the value of kube-prometheus-values.yaml

The type of storage is an nfs filesystem - Name:ibmc-file-bronze, type: ibm.io/ibmc-file (https://console.bluemix.net/docs/containers/cs_storage.html#create) . We use the same storage for other images such as Consul and it works without problem

Complete kube-prometheus-values.yaml file

# exporter-node configuration
deployExporterNode: True

# Grafana
deployGrafana: True

alertmanager:
  ## Alertmanager configuration directives
  ## Ref: https://prometheus.io/docs/alerting/configuration/
  ##
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 5m
      receiver: 'ANS'
    receivers:
      - name: 'null'
      - name: 'ANS'
        slack_configs:
        - api_url: 'https://hooks.slack.com/services/xxxxxx/xxxxxx/xxxxxx'
          text: "{{ range .Alerts }}{{ .Annotations.summary }} {{ .Annotations.description }}\n{{ end }}"
          channel: '#xxxxxxxxxx'
        webhook_configs:
        - url: 'https://ibmnotifybm.eu-gb.mybluemix.net/webhook/webhookincoming/xxxxxxxxxx/xxxxxxxxxxxxxxxx'

  ## External URL at which Alertmanager will be reachable
  ##
  externalUrl: ""

  ## Alertmanager container image
  ##
  image:
    repository: quay.io/prometheus/alertmanager
    tag: v0.9.1

  ingress:
    ## If true, Alertmanager Ingress will be created
    ##
    enabled: false

    ## Annotations for Alertmanager Ingress
    ##
    annotations: {}
      # kubernetes.io/ingress.class: nginx
      # kubernetes.io/tls-acme: "true"

    fqdn: ""

    ## TLS configuration for Alertmanager Ingress
    ## Secret must be manually created in the namespace
    ##
    tls: []
      # - secretName: alertmanager-general-tls
      #   hosts:
      #     - alertmanager.example.com

  ## Node labels for Alertmanager pod assignment
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  ##
  nodeSelector: {}

  ## If true, the Operator won't process any Alertmanager configuration changes
  ##
  paused: false

  ## Number of Alertmanager replicas desired
  ##
  replicaCount: 2

  ## Pod anti-affinity can prevent the scheduler from placing Alertmanager replicas on the same node.
  ## The default value "soft" means that the scheduler should *prefer* to not schedule two replica pods onto the same node but no guarantee is provided.
  ## The value "hard" means that the scheduler is *required* to not schedule two replica pods onto the same node.
  ## The value "" will disable pod anti-affinity so that no anti-affinity rules will be configured.
  podAntiAffinity: "soft"

  ## Resource limits & requests
  ## Ref: https://kubernetes.io/docs/user-guide/compute-resources/
  ##
  resources: {}
    # requests:
    #   memory: 400Mi

  service:
    ## Annotations to be added to the Service
    ##
    annotations: {}

    ## Cluster-internal IP address for Alertmanager Service
    ##
    clusterIP: ""

    ## List of external IP addresses at which the Alertmanager Service will be available
    ##
    externalIPs: []

    ## External IP address to assign to Alertmanager Service
    ## Only used if service.type is 'LoadBalancer' and supported by cloud provider
    ##
    loadBalancerIP: ""

    ## List of client IPs allowed to access Alertmanager Service
    ## Only used if service.type is 'LoadBalancer' and supported by cloud provider
    ##
    loadBalancerSourceRanges: []

    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30903

    ## Service type
    ##
    type: ClusterIP

  ## Alertmanager StorageSpec for persistent data
  ## Ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/storage.md
  ##
  storageSpec: 
    volumeClaimTemplate:
      spec:
        selector:
          matchLabels:
            app: k8s-prometheus
        resources:
          requests:
            storage: 20Gi

  prometheusRules: 
    alertmanager.rules: |-
      groups:
      - name: alertmanager.rules
        rules:
        - alert: AlertmanagerConfigInconsistent
          expr: count_values("config_hash", alertmanager_config_hash) BY (service) / ON(service)
            GROUP_LEFT() label_replace(prometheus_operator_alertmanager_spec_replicas, "service",
            "alertmanager-$1", "alertmanager", "(.*)") != 1
          for: 5m
          labels:
            severity: critical
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: The configuration of the instances of the Alertmanager cluster `{{$labels.service}}` are out of sync.
        - alert: AlertmanagerDownOrMissing
          expr: label_replace(prometheus_operator_alertmanager_spec_replicas, "job", "alertmanager-$1",
            "alertmanager", "(.*)") / ON(job) GROUP_RIGHT() sum(up) BY (job) != 1
          for: 5m
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: An unexpected number of Alertmanagers are scraped or Alertmanagers disappeared from discovery.
        - alert: AlertmanagerFailedReload
          expr: alertmanager_config_last_reload_successful == 0
          for: 10m
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: Reloading Alertmanager's configuration has failed for {{ $labels.namespace}}/{{ $labels.pod}}.

## If true, create & use RBAC resources
##
rbacEnable: true

prometheus:
  ## Alertmanagers to which alerts will be sent
  ## Ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#alertmanagerendpoints
  ##
  alertingEndpoints: []
  #   - name: ""
  #     namespace: ""
  #     port: 9093
  #     scheme: http

  ## Prometheus configuration directives
  ## Ignored if serviceMonitors are defined
  ## Ref: https://prometheus.io/docs/operating/configuration/
  ##
  config:
    specifiedInValues: true
    value: {}

  ## External URL at which Prometheus will be reachable
  ##
  externalUrl: ""

  ## Prometheus container image
  ##
  image:
    repository: quay.io/prometheus/prometheus
    tag: v2.1.0

  securityContext:
    fsGroup: 2000
    runAsUser: 1000
    runAsNonRoot: true

  ingress:
    ## If true, Prometheus Ingress will be created
    ##
    enabled: false

    ## Annotations for Prometheus Ingress
    ##
    annotations: {}
      # kubernetes.io/ingress.class: nginx
      # kubernetes.io/tls-acme: "true"

    fqdn: ""

    ## TLS configuration for Prometheus Ingress
    ## Secret must be manually created in the namespace
    ##
    tls: []
      # - secretName: prometheus-k8s-tls
      #   hosts:
      #     - prometheus.example.com

  ## Node labels for Prometheus pod assignment
  ## Ref: https://kubernetes.io/docs/user-guide/node-selection/
  ##
  nodeSelector: {}

  ## If true, the Operator won't process any Prometheus configuration changes
  ##
  paused: false

  ## Number of Prometheus replicas desired
  ##
  replicaCount: 1

  ## Pod anti-affinity can prevent the scheduler from placing Prometheus replicas on the same node.
  ## The default value "soft" means that the scheduler should *prefer* to not schedule two replica pods onto the same node but no guarantee is provided.
  ## The value "hard" means that the scheduler is *required* to not schedule two replica pods onto the same node.
  ## The value "" will disable pod anti-affinity so that no anti-affinity rules will be configured.
  podAntiAffinity: "soft"

  ## Resource limits & requests
  ## Ref: https://kubernetes.io/docs/user-guide/compute-resources/
  ##
  resources: {}
    # requests:
    #   memory: 400Mi

  ## How long to retain metrics
  ##
  retention: 24h

  ## Prefix used to register routes, overriding externalUrl route.
  ## Useful for proxies that rewrite URLs.
  ##
  routePrefix: /

  ## Rules configmap selector
  ## Ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/design.md
  ##
  rulesSelector: {}

  ## Prometheus alerting & recording rules
  ## Ref: https://prometheus.io/docs/querying/rules/
  ## Ref: https://prometheus.io/docs/alerting/rules/
  ##
  rules:
    specifiedInValues: true
    value: {}

  service:
    ## Annotations to be added to the Service
    ##
    annotations: {}

    ## Cluster-internal IP address for Prometheus Service
    ##
    clusterIP: ""

    ## List of external IP addresses at which the Prometheus Service will be available
    ##
    externalIPs: []

    ## External IP address to assign to Prometheus Service
    ## Only used if service.type is 'LoadBalancer' and supported by cloud provider
    ##
    loadBalancerIP: ""

    ## List of client IPs allowed to access Prometheus Service
    ## Only used if service.type is 'LoadBalancer' and supported by cloud provider
    ##
    loadBalancerSourceRanges: []

    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30900

    ## Service type
    ##
    type: ClusterIP

  ## Service monitors selector
  ## Ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/design.md
  ##
  serviceMonitorsSelector: {}

  ## ServiceMonitor CRDs to create & be scraped by the Prometheus instance.
  ## Ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/service-monitor.md
  ##
  serviceMonitors: []
    ## Name of the ServiceMonitor to create
    ##
    # - name: ""

      ## Service label for use in assembling a job name of the form <label value>-<port>
      ## If no label is specified, the service name is used.
      ##
      # jobLabel: ""

      ## Label selector for services to which this ServiceMonitor applies
      ##
      # selector: {}

      ## Namespaces from which services are selected
      ##
      # namespaceSelector:
        ## Match any namespace
        ##
        # any: false

        ## Explicit list of namespace names to select
        ##
        # matchNames: []

      ## Endpoints of the selected service to be monitored
      ##
      # endpoints: []
        ## Name of the endpoint's service port
        ## Mutually exclusive with targetPort
        # - port: ""

        ## Name or number of the endpoint's target port
        ## Mutually exclusive with port
        # - targetPort: ""

        ## File containing bearer token to be used when scraping targets
        ##
        #   bearerTokenFile: ""

        ## Interval at which metrics should be scraped
        ##
        #   interval: 30s

        ## HTTP path to scrape for metrics
        ##
        #   path: /metrics

        ## HTTP scheme to use for scraping
        ##
        #   scheme: http

        ## TLS configuration to use when scraping the endpoint
        ##
        #   tlsConfig:

            ## Path to the CA file
            ##
            # caFile: ""

            ## Path to client certificate file
            ##
            # certFile: ""

            ## Skip certificate verification
            ##
            # insecureSkipVerify: false

            ## Path to client key file
            ##
            # keyFile: ""

            ## Server name used to verify host name
            ##
            # serverName: ""

  ## Prometheus StorageSpec for persistent data
  ## Ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/storage.md
  ##
  storageSpec: {}
  #   volumeClaimTemplate:
  #     spec:
  #       selector:
  #         matchLabels:
  #           app: k8s-prometheus
  #       resources:
  #         requests:
  #           storage: 20Gi

  prometheusRules: 
    prometheus.rules: |-
      groups:
      - name: prometheus.rules
        rules:
        - alert: PrometheusConfigReloadFailed
          expr: prometheus_config_last_reload_successful == 0
          for: 10m
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}}
        - alert: PrometheusNotificationQueueRunningFull
          expr: predict_linear(prometheus_notifications_queue_length[5m], 60 * 30) > prometheus_notifications_queue_capacity
          for: 10m
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{
              $labels.pod}}
        - alert: PrometheusErrorSendingAlerts
          expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m]) > 0.01
          for: 10m
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
              $labels.pod}} to Alertmanager {{$labels.Alertmanager}}
        - alert: PrometheusErrorSendingAlerts
          expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m]) > 0.03
          for: 10m
          labels:
            severity: critical
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
              $labels.pod}} to Alertmanager {{$labels.Alertmanager}}
        - alert: PrometheusNotConnectedToAlertmanagers
          expr: prometheus_notifications_alertmanagers_discovered < 1
          for: 10m
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected
              to any Alertmanagers
        - alert: PrometheusTSDBReloadsFailing
          expr: increase(prometheus_tsdb_reloads_failures_total[2h]) > 0
          for: 12h
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}}
              reload failures over the last four hours.'
            summary: Prometheus has issues reloading data blocks from disk
        - alert: PrometheusTSDBCompactionsFailing
          expr: increase(prometheus_tsdb_compactions_failed_total[2h]) > 0
          for: 12h
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}}
              compaction failures over the last four hours.'
            summary: Prometheus has issues compacting sample blocks
        - alert: PrometheusTSDBWALCorruptions
          expr: tsdb_wal_corruptions_total > 0
          for: 4h
          labels:
            severity: warning
            where: "[ibm:yp:us-south] Cluster:Test"
          annotations:
            description: '{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead
              log (WAL).'
            summary: Prometheus write-ahead log is corrupted


# default rules are in templates/general.rules.yaml
prometheusRules: 
  general.rules: |-
    groups:
    - name: general.rules
      rules:
      - record: fd_utilization
        expr: process_open_fds / process_max_fds
      - alert: FdExhaustionClose
        expr: predict_linear(fd_utilization[1h], 3600 * 4) > 1
        for: 10m
        labels:
          severity: warning
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
            will exhaust in file/socket descriptors within the next 4 hours'
          summary: file descriptors soon exhausted
      - alert: FdExhaustionClose
        expr: predict_linear(fd_utilization[10m], 3600) > 1
        for: 10m
        labels:
          severity: critical
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
            will exhaust in file/socket descriptors within the next hour'
          summary: file descriptors soon exhausted
      - alert: HighCPU
        expr: ((sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"})
          BY (instance, job)) - (sum(node_cpu{mode=~"idle|iowait"}) BY (instance, job)))
          / (sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) BY
          (instance, job)) * 100 > 95
        for: 1m
        labels:
          severity: critical
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: This machine  has really high CPU usage for over 30 seconds
          summary: High CPU Usage
      - alert: PendingPods
        expr: (sum(kube_pod_status_phase{phase="Pending"}) BY (pod)) > 0
        for: 8m
        labels:
          severity: critical
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: Pod has been in pending state for more than 2 minutes
          summary: Pod has been in pending state for more than 2 minutes
      - alert: PodsNotReady
        expr: (sum(kube_pod_status_ready{condition="false"}) BY (pod)) > 0
        for: 10m
        labels:
          severity: critical
        annotations:
          description: Pod has has not passed readiness checks for more than 2 minutes
          summary: Pod has has not passed readiness checks for too long
      - record: instance:node_cpu:rate:sum
        expr: sum(rate(node_cpu{mode!="idle",mode!="iowait",mode!~"^(?:guest.*)$"}[3m]))
          BY (instance)
      - record: instance:node_filesystem_usage:sum
        expr: sum((node_filesystem_size{mountpoint="/"} - node_filesystem_free{mountpoint="/"}))
          BY (instance)
      - record: instance:node_network_receive_bytes:rate:sum
        expr: sum(rate(node_network_receive_bytes[3m])) BY (instance)
      - record: instance:node_network_transmit_bytes:rate:sum
        expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance)
      - record: instance:node_cpu:ratio
        expr: sum(rate(node_cpu{mode!="idle"}[5m])) WITHOUT (cpu, mode) / ON(instance)
          GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance)
      - record: cluster:node_cpu:sum_rate5m
        expr: sum(rate(node_cpu{mode!="idle"}[5m]))
      - record: cluster:node_cpu:ratio
        expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu))
      - alert: NodeExporterDown
        expr: absent(up{job="node-exporter"} == 1)
        for: 10m
        labels:
          severity: warning
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: Prometheus could not scrape a node-exporter for more than 10m,
            or node-exporters have disappeared from discovery
      - alert: NodeDiskRunningFull
        expr: predict_linear(node_filesystem_free[6h], 3600 * 24) < 0
        for: 30m
        labels:
          severity: warning
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: device {{$labels.device}} on node {{$labels.instance}} is running
            full within the next 24 hours (mounted at {{$labels.mountpoint}})
      - alert: NodeDiskRunningFull
        expr: predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
        for: 10m
        labels:
          severity: critical
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: device {{$labels.device}} on node {{$labels.instance}} is running
            full within the next 2 hours (mounted at {{$labels.mountpoint}})
      - alert: HighNumberOfFailedHTTPRequests
        expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
          BY (method) > 0.05
        for: 5m
        labels:
          severity: critical
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
            instance {{ $labels.instance }}'
          summary: a high number of HTTP requests are failing
      - alert: HTTPRequestsSlow
        expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15
        for: 10m
        labels:
          severity: warning
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: on etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method
            }} are slow
          summary: slow HTTP requests
      - alert: EtcdMemberCommunicationSlow
        expr: histogram_quantile(0.99, rate(etcd_network_member_round_trip_time_seconds_bucket[5m])) > 0.15
        for: 10m
        labels:
          severity: warning
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: etcd instance {{ $labels.instance }} member communication with
            {{ $labels.To }} is slow
          summary: etcd member communication is slow
      - alert: HighNumberOfFailedProposals
        expr: increase(etcd_server_proposals_failed_total{job="etcd"}[1h]) > 5
        labels:
          severity: warning
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: etcd instance {{ $labels.instance }} has seen {{ $value }} proposal
            failures within the last hour
          summary: a high number of proposals within the etcd cluster are failing
      - alert: HighFsyncDurations
        expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: etcd instance {{ $labels.instance }} fync durations are high
          summary: high fsync durations
      - alert: HighCommitDurations
        expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25
        for: 10m
        labels:
          severity: warning
          where: "[ibm:yp:us-south] Cluster:Test"
        annotations:
          description: etcd instance {{ $labels.instance }} commit durations are high
          summary: high commit durations

exporter-kube-controller-manager:
  prometheusRules: 
    exporter-kube-controller-manager.rules: |-
      groups:
      - name: exporter-kube-controller-manager.rules
        rules:
exporter-kube-etcd:
  prometheusRules:
    exporter-kube-etcd.rules: |-  
      groups:
      - name: exporter-kube-etcd.rules
        rules:
exporter-kube-scheduler:
  prometheusRules: 
    exporter-kube-scheduler.rules: |-
      groups:
      - name: exporter-kube-scheduler.rules
        rules:
exporter-kube-state: 
  prometheusRules: 
    exporter-kube-state.rules: |-
      groups:
      - name: eexporter-kube-state.rules
        rules:
exporter-kubelets: 
  prometheusRules: 
    exporter-kubelets.rules: |- 
      groups:
      - name: exporter-kubelets.rules
        rules: 
exporter-kubernetes:
  prometheusRules:
    exporter-kubernetes.rules: |-
      groups:
      - name: exporter-kubernetes.rules
        rules:
exporter-node: 
  prometheusRules: 
    exporter-node.rules: |-
      groups:
      - name: exporter-node.rules
        rules:

Some further investigations has highlighted what the problem is here. It looks like that by default the mounts of this type are mounted with the user/group permissions of drwxr-xr-x 4 nobody 42949672 4096 Feb 12 10:41 data

I suspect this due to https://github.com/coreos/prometheus-operator/blob/master/pkg/prometheus/statefulset.go#L365 which is setting prometheus to run as userid 1000, guid 2000. A user with these permissions are not allowed to write to the directory with the permissions as shown above. The fix is to update the ownership of the /var/prometheus/data directory on startup to match the one the program is being run as. This has already been done in the offical prometheus helm charts - https://github.com/kubernetes/charts/commit/7d5a3ff4b105c695f332b2a8ff360e891477e6e9#diff-97df733ade0fb9ea384f77bf3a393a0a

i.e. the statefulset needs to have

  initContainers:
  - name: "init-chown-data"
    image: "busybox"
    # 1000 is the user that prometheus uses.
    command: ["chown", "-R", "1000:2000", /var/prometheus/data]
    volumeMounts:
    - name: prometheus-kube-prometheus-db
      mountPath: /var/prometheus/data

@IBMRob would you like to contribute to the project applying this changes? Here is the guide explaining how to contribute.

If this was a simple yaml update more than happy to. Given that the statefulSet are actually Go it might be more of a challenge as I have no experience of the language.

Honestly the user/group we chose was a mistake, I've been trying to think of how we could migrate from 1000:2000 to nobody:nobody (65534:65534).

Although a bit more complicated what we could do is this:

  • For new statefulsets we just set 65534:65534 by default.
  • For existing statefulsets we keep the user/group settings and insert an init container that changes the existing file permissions to 65534:65534. Then in a subsequent release of the Prometheus operator we remove the init container migration.

How does that sound?

Ultimately I think I might also be ok with breaking this once without a migration, but would have to think about it some more.

I would be happy with that behavior @brancz.

I did try coding a short term fix but its currently failing the e2e tests :(

We're hitting the same issue as @IBMRob.
Is there already a solution for that ?
We tried to patch the statefulset and insert the initcontainer required by the ibm pvc ( see Adding non-root user access to NFS file storage )
using the following patch:

kubectl patch sts prometheus-kube-prometheus -n monitoring --patch "$(cat helm/kube-prometheus/patches/patch-sts-kube-prometheus.yaml)"

with

cat helm/kube-prometheus/patches/patch-sts-kube-prometheus.yaml
prometheus/patches/patch-sts-kube-prometheus.yaml)"
spec:
  template:
    spec:
      initContainers:
      - name: permissionsfix
        image: alpine:latest
        command: ["/bin/sh", "-c"]
        args:
          - chmod 777 /mount;
        volumeMounts:
        - name: volume
          mountPath: /mount

The patch gets applied but unfortunatly it looks like it gets overriden immediatly by the prometheuses.monitoring.coreos.com kube-prometheus
servicemonitor.
Is there a way to patch the servicemonitor to include change above into the statefulset ?
Or is there an outlook when this gets fixed in the prometheus-operator ?

Due to recent learnings I believe we should simply not preset any values in the first place, as the kubelet ensures the group of the volume mounted before the pod is started. We should just remove the default values.

@brancz @IBMRob are you working on a fix for this bug? It is an important feature for prometheus-operator to rely on persistent volumes instead of emptyDir{}.

I am not actively working to fix this.

Note that NFS is in many situation a bit wonky and is therefore unsupported by the Prometheus server. I recommend using a different storage provider.

In the next release there will be a new flag on the prometheus operator --disable-auto-user-group, which does not pre-set any of the user or fsgroup, so it should give you total control over that part.

Also the kubelet should be chowning the data volume appropriately already: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#discussion

fsGroup: Volumes that support ownership management are modified to be owned and writable by the GID specified in fsGroup

Choosing a different storage provided isn't really possible in the IBM Cloud Containers service so we are stuck with NFS.

@brancz are you saying this will not be fixed in the current version? Whats the ETA on the next release?

We're currently finishing up some work, but a new release should be out soon.

@brancz what is soon? Days, Weeks, Month? :)

Less than a month, more than a day, my estimation would be some time next week, but it might be the week after as well. Publishing dates is never good though, as it sets expectations and things can easily be delayed, so don't take my word :slightly_smiling_face:

@brancz thanks for the info. What would be behind this new flag --disable-auto-user-group on the prometheus operator ? Would it only fix the current bug in statefulSets's generation for mounting nfs storage or would it generally change the user/group management in other kubernetes artificats?

The point is that the kubelet should be taking care of the appropriate permissions on your volume, if it's not doing that, then there is little we can do in the prometheus operator apart from a hack. Instead I recommend preparing the volume yourself with the appropriate permissions.

The --disable-auto-user-group flag disables the prometheus operator from pre-setting user/group on pods.

@brancz We are pretty interested to get a working solution for using NFS storage. So, we are looking forward to get this flag enabled pretty soon.

As I said, the Prometheus project does not support NFS as a storage mechanism, if you run into trouble you're on your own. I do not recommend doing this - that said, if you're ok with those terms you can go ahead. The flag as described will land in the next release - although I'm still uncertain whether it will actually solve the permission problems and I believe you will still end up having to set permissions manually at setup of your volume.

The only way, we figured out to set permissions manually is to do it on pod level, means using the init container approach referenced above. But this way is not doable with prometheus-operator as the prometheus service monitor will re-generate its statefulSets every time when you try to patch it.

You can manually provision your volume and prepare the permission, the Prometheus Operator will through the PVC it creates always choose the same volume for the same Pod. The SecurityContext allows you so specify the user and group you want your Pod to run as, which will allow you to predict the permissions your volume will need. This is entirely possible without init containers, and is a one-off thing, which is why init containers are probably not the right solution for this.

If you want a solution now and today, you'll need to use a privileged security context, saying that you want to run the container as root.

  securityContext:
    runAsNonRoot: false
    runAsUser: 0
    fsGroup: 0

Overriding the securitycontext is already possible: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#prometheusspec

@brancz we hear you but where to put that part really?

A full example of the above could be

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: persisted
spec:
  replicas: 1
  storage:
    volumeClaimTemplate:
      metadata:
        annotations:
          annotation1: foo
      spec:
        resources:
          requests:
            storage: 1Gi
  securityContext:
    runAsNonRoot: false
    runAsUser: 0
    fsGroup: 0

This is simply the manifest example from the storage user guide plus the security context.

we are working with helm charts

@brancz I have added the securityContext in the StorageSpec section in the prometheus-operator's value files, but it will not be picked up during deployment:

storageSpec:
    class: ""
    selector:
      matchLabels:
        app: prometheus
    resources: {}
    volumeClaimTemplate:
      spec:
        storageClassName: ibmc-file-gold
        accessModes:
          - ReadWriteMany
        resources:
          requests:
            storage: 50Gi
      selector:
        matchLabels:
          app: prometheus
    securityContext:
      runAsNonRoot: false
      runAsUser: 0
      fsGroup: 0

So, I will give up for now and wait for flag in the next release.

@MIllgner the securityContext is a sibling of the storage definition, not within it. I believe this is just a matter of adding support to helm. cc @gianrubio

Helm chart doesn't support securityContext for now, @MIllgner do you want to send a PR to us adding this ability?

@gianrubio @brancz The issue is still that the prometheus operator is using 1000:2000 instead of nobody:nobody (65534:65534), as you pointed out above. With that I am still getting the error in the prometheus server:
caller=main.go:582 err="Opening storage failed mkdir /var/prometheus/data/wal: permission denied"
I tried to overwrite the securityContext in mulitple places, but did not find any working patch. Is there any update when the new release is available which provides the feature to overwrite this user/group setting?

@gianrubio @brancz When I add a pvc for alertmanager and grafana by adding the storageSpec in the values.yaml for the kube-prometheus chart, everything is fine. Two PVCs are created and can be mounted by alertmanager and grafana pod. There is only the permission issue on the prometheus server itself. Any help is appreciated to get this issue fixed.

@IBMRob the initContainer worked beautifully for me on a classic Prometheus container, I had to use 65534:65534 on v. 2.1 tho. Complete example here: https://github.com/figaw/gke-prometheus-mvp

@gianrubio @brancz The issue is still that the prometheus operator is using 1000:2000 instead of nobody:nobody (65534:65534), as you pointed out above. With that I am still getting the error in the prometheus server:
caller=main.go:582 err="Opening storage failed mkdir /var/prometheus/data/wal: permission denied"
I tried to overwrite the securityContext in mulitple places, but did not find any working patch. Is there any update when the new release is available which provides the feature to overwrite this user/group setting?

We experierenced the exact same permission problem when running on IBM Cloud. After setting the security context for Prometheus as suggested by @brancz the issue was gone.

For anyone who comes across this issue while looking for why your PersistantVolumeClaim isn't mounting (as I did), if you've recently updated to v0.26.0 of the Prometheus operator, see this issue: alertmanager and prometheus-k8s breaks podsecuritypolicy in 0.26

As @BrianChristie mentioned, this is due to the change in securityContext.

This is expected behavior given that the containers are (correctly) not running as root, and you're trying to write to a mounted NFS volume, which would be mounted as nobody by default. But I don't think the solution is to just run as nobody.

In any other k8s deployment you would need to set the correct fsGroup in the securityContext. In Prometheus operator you can't just do that unfortunately, hence the issue filed to add back default securityContext that was there prior to v0.26.

In Prometheus operator you can't just do that unfortunately

Why not? The PrometheusSpec has the securityContext field you can set to whatever you need/want it to be.

prometheus chart supports chown using initContainer

https://github.com/helm/charts/blob/master/stable/prometheus/values.yaml#L272

I think operator should support this option.

@gangseok514, this can be solved only in new chart location, not there. Helm code you see in this repository will never be changed again.

@paskal, I don't think the chart change fix this problem.
initChownData is in prometheus chart, not prometheus-operator.
prometheus-operator runs prometheus without prometheus chart.
initContainer config can be added by prometheus-operator's sourcecode.

prometheus:
   ... 
      securityContext:
        fsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000

as the above, changing securityContext solve this problem. but why we have to change securityContext?
we can solve this problem simply with supporting initChownData.

@gangseok514, again, prometheus-operator here (in coreos) is deprecated and will never be changed again. prometheus-operator migrated to helm/stable and if you want to change it, you can change it only in helm/stable.

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

This is a really old issue. But I still got this with prometheus-operator 0.31.
BTW, it is not a good idea to run this container as root.

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

@gangseok514, again, prometheus-operator here (in coreos) is deprecated and will never be changed again. prometheus-operator migrated to helm/stable and if you want to change it, you can change it only in helm/stable.

@paskal What does this mean? I don't really understand. Is that prometheus-operator not developed anymore because of an alternative? As I understand, helm/chart doesn't provide any software and/or custom crd...

@icy please see here: https://github.com/coreos/prometheus-operator#prometheus-operator-vs-kube-prometheus-vs-community-helm-chart

@icy please see here: https://github.com/coreos/prometheus-operator#prometheus-operator-vs-kube-prometheus-vs-community-helm-chart

Thank you, but I couldn't read any deprecation there. Please fix if I'm wrong.

It's been quite some time since the chart was moved. https://github.com/coreos/prometheus-operator/blob/master/helm/README.md

It's been quite some time since the chart was moved. https://github.com/coreos/prometheus-operator/blob/master/helm/README.md

Thanks a lot. So it's all about the template/helm-chart support. I though that's the operator per-se would be deprecated.

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

We're still experiencing the same issue with the latest version of the operator (0.38.0). We're deploying the operator on Azure AKS with Azure managed disks backing the persistent volumes. Is there any progress except for the initContainer workaround?

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

I'm also having this same exact problem.

Since we don't support the helm chart in this repo anymore, closing this issue. Feel free to open a new one if you think it's not valid, thanks!

Was this page helpful?
0 / 5 - 0 ratings