Cloud-on-k8s: Elastic Agent writes data into its own container

Created on 19 Feb 2021  路  8Comments  路  Source: elastic/cloud-on-k8s

We mount /usr/lib/$K8S_NS/$NAME/agent-data to /usr/share/data but Elastic Agent uses /usr/share/elastic-agent/data

https://github.com/elastic/cloud-on-k8s/blob/072c591f69f190d3a8c30789014a517d8a648208/pkg/controller/agent/pod.go#L33

This means that Elastic Agent is writing data into its own container. That includes all binaries it install as part of any configured packages (Metricbeat and Filebeat for example). This also means that it will lose its identity and runtime state on container restarts.

The intention behind using the hostPath volume was to create a persistent store for Agent identity and runtime state. We are doing the same thing for Beats.

>bug v1.4.1

Most helpful comment

Alternatively we could just document the limitation and explain how users can mount a hostPath volume themselves to the right path for the time being.

I think I'm leaning towards this. We can even keep this documentation updated with right paths for any future Elastic Agent versions released without the fix.

All 8 comments

This turns out to be non-trivial to fix. HostPath volumes are mounted noexec but Elastic Agent does try to install and execute the managed programs (typically Filebeat and Metricbeat) from its data directory.

Persistent Volumes (I tried the the GKE ones) are mounted without the noexec flag but we don't have a good way of using them from a DaemonSet. Also I am not sure if we can rely on the mount flags being the same across all persistent volume plugins.

cc @david-kow

The filesystem structure is:
/usr/share/ealstic-agent => home --path.home
/usr/share/elastic-agent/data => data directory derived from home
/usr/share/elastic-agent/data/elastic-agent-${hash}/run => runtime state derived from home
/usr/share/elastic-agent/data/elastic-agent-${hash}/install => install location for the programs to execute agent.download.install_path
/usr/share/elastic-agent/data/elastic-agent-${hash}/download => download location for packages to install agent.download.target_directory

The problem is that the install path by default is a subdirectory of the data directory which we want to have on the host via hostPath. But all we need to persist Agent identity across restarts is the runtime state in run.

@david-kow suggested to override the latter two download settings to have Elastic Agent install into a directory inside the container in which we can execute and keep the data directory on the hostPath volume. Unfortunately Agent still creates a tmp folder below the data directory and then tries to copy via rename from there to the install directory which is now on a different filesystem and fails subsequently.

2021-02-22T15:00:30.935Z    ERROR    log/reporter.go:36    2021-02-22T15:00:30Z: type: 'ERROR': sub_type: 'FAILED' message: Application: metricbeat--7.11.0[1ecf5613-accd-4c8c-930e-e432c726a43d]: State changed to FAILED: rename /usr/share/elastic-agent/var/lib/data/tmp/elastic-agent-install004569386/metricbeat-7.11.0-linux-x86_64 /usr/share/elastic-agent/data/install/metricbeat-7.11.0-linux-x86_64: invalid cross-device link              

The only approach I got working was to to bind mount on top of the run directory which unfortunately requires the operator to know the git hash from which Elastic Agent was build because it is part of the directory name:
/usr/share/elastic-agent/data/elastic-agent-84c4d4/run But this has the massive disadvantage that we need to figure out a way to find the hash of the Elastic Agent build ahead of deploy time.

I raised an issue against the Beats/Elastic Agent repository. Until this is fixed in Elastic Agent which could take a few releases.

We can do two things:

  1. Document the limitation and wait for new Elastic Agent releases
  2. Implement a workaround. My only idea for that so far is as follows:

We bind mount the hostPath volume to /usr/share/elastic-agent/data/elastic-agent-${hash}/run
We figure out the hash by running a k8s job/pod to inspect the Elastic Agent Docker container

apiVersion: batch/v1
kind: Job
metadata:
  name: agent-inspect
spec:
  template:
    spec:
      containers:
      - name: agent
        image: docker.elastic.co/beats/elastic-agent:7.11.1
        command: ["ls",  "/usr/share/elastic-agent/data"]
      restartPolicy: Never

We keep that information around in memory or even in a ConfigMap keyed by Elastic Agent version and use it to construct the volume mount.

It's not great but should do the trick. Open to alternative suggestions

Can we somehow avoid storing the hash information somewhere by running a fancy one-liner script in an init container that would create a symlink from /usr/share/elastic-agent/data/elastic-agent-${hash}/run to somewhere in the mounted volume?

Can we somehow avoid storing the hash information somewhere by running a fancy one-liner script in an init container that would create a symlink from /usr/share/elastic-agent/data/elastic-agent-${hash}/run to somewhere in the mounted volume?

Cross device/filesystem symlinks are not possible afaik

I prototyped a workaround here. There is another idea that uses a custom entrypoint to symlink the run directory to the hostPath volume. It has the advantage that it does not need another Pod to be spun up temporarily but otherwise shares the same drawback as the Pod based approach:

  • potentially makes the operator incompatible with future versions of Agent
  • we don't have a good way of retracting the workaround when the issue is fixed in Agent

Alternatively we could just document the limitation and explain how users can mount a hostPath volume themselves to the right path for the time being.

Alternatively we could just document the limitation and explain how users can mount a hostPath volume themselves to the right path for the time being.

I think I'm leaning towards this. We can even keep this documentation updated with right paths for any future Elastic Agent versions released without the fix.

Closing this, we added documentation, adjusted the recipes. A fix is merged in Elastic Agent and will ship with 7.13. I have raised https://github.com/elastic/cloud-on-k8s/issues/4260 to integrate the Agent fix into ECK's Agent controller

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sebgl picture sebgl  路  3Comments

barkbay picture barkbay  路  5Comments

sebgl picture sebgl  路  3Comments

sebgl picture sebgl  路  5Comments

nkvoll picture nkvoll  路  4Comments