We mount /usr/lib/$K8S_NS/$NAME/agent-data to /usr/share/data but Elastic Agent uses /usr/share/elastic-agent/data
This means that Elastic Agent is writing data into its own container. That includes all binaries it install as part of any configured packages (Metricbeat and Filebeat for example). This also means that it will lose its identity and runtime state on container restarts.
The intention behind using the hostPath volume was to create a persistent store for Agent identity and runtime state. We are doing the same thing for Beats.
This turns out to be non-trivial to fix. HostPath volumes are mounted noexec but Elastic Agent does try to install and execute the managed programs (typically Filebeat and Metricbeat) from its data directory.
Persistent Volumes (I tried the the GKE ones) are mounted without the noexec flag but we don't have a good way of using them from a DaemonSet. Also I am not sure if we can rely on the mount flags being the same across all persistent volume plugins.
cc @david-kow
The filesystem structure is:
/usr/share/ealstic-agent => home --path.home
/usr/share/elastic-agent/data => data directory derived from home
/usr/share/elastic-agent/data/elastic-agent-${hash}/run => runtime state derived from home
/usr/share/elastic-agent/data/elastic-agent-${hash}/install => install location for the programs to execute agent.download.install_path
/usr/share/elastic-agent/data/elastic-agent-${hash}/download => download location for packages to install agent.download.target_directory
The problem is that the install path by default is a subdirectory of the data directory which we want to have on the host via hostPath. But all we need to persist Agent identity across restarts is the runtime state in run.
@david-kow suggested to override the latter two download settings to have Elastic Agent install into a directory inside the container in which we can execute and keep the data directory on the hostPath volume. Unfortunately Agent still creates a tmp folder below the data directory and then tries to copy via rename from there to the install directory which is now on a different filesystem and fails subsequently.
2021-02-22T15:00:30.935Z ERROR log/reporter.go:36 2021-02-22T15:00:30Z: type: 'ERROR': sub_type: 'FAILED' message: Application: metricbeat--7.11.0[1ecf5613-accd-4c8c-930e-e432c726a43d]: State changed to FAILED: rename /usr/share/elastic-agent/var/lib/data/tmp/elastic-agent-install004569386/metricbeat-7.11.0-linux-x86_64 /usr/share/elastic-agent/data/install/metricbeat-7.11.0-linux-x86_64: invalid cross-device link
The only approach I got working was to to bind mount on top of the run directory which unfortunately requires the operator to know the git hash from which Elastic Agent was build because it is part of the directory name:
/usr/share/elastic-agent/data/elastic-agent-84c4d4/run But this has the massive disadvantage that we need to figure out a way to find the hash of the Elastic Agent build ahead of deploy time.
I raised an issue against the Beats/Elastic Agent repository. Until this is fixed in Elastic Agent which could take a few releases.
We can do two things:
We bind mount the hostPath volume to /usr/share/elastic-agent/data/elastic-agent-${hash}/run
We figure out the hash by running a k8s job/pod to inspect the Elastic Agent Docker container
apiVersion: batch/v1
kind: Job
metadata:
name: agent-inspect
spec:
template:
spec:
containers:
- name: agent
image: docker.elastic.co/beats/elastic-agent:7.11.1
command: ["ls", "/usr/share/elastic-agent/data"]
restartPolicy: Never
We keep that information around in memory or even in a ConfigMap keyed by Elastic Agent version and use it to construct the volume mount.
It's not great but should do the trick. Open to alternative suggestions
Can we somehow avoid storing the hash information somewhere by running a fancy one-liner script in an init container that would create a symlink from /usr/share/elastic-agent/data/elastic-agent-${hash}/run to somewhere in the mounted volume?
Can we somehow avoid storing the hash information somewhere by running a fancy one-liner script in an init container that would create a symlink from /usr/share/elastic-agent/data/elastic-agent-${hash}/run to somewhere in the mounted volume?
Cross device/filesystem symlinks are not possible afaik
I prototyped a workaround here. There is another idea that uses a custom entrypoint to symlink the run directory to the hostPath volume. It has the advantage that it does not need another Pod to be spun up temporarily but otherwise shares the same drawback as the Pod based approach:
Alternatively we could just document the limitation and explain how users can mount a hostPath volume themselves to the right path for the time being.
Alternatively we could just document the limitation and explain how users can mount a hostPath volume themselves to the right path for the time being.
I think I'm leaning towards this. We can even keep this documentation updated with right paths for any future Elastic Agent versions released without the fix.
Closing this, we added documentation, adjusted the recipes. A fix is merged in Elastic Agent and will ship with 7.13. I have raised https://github.com/elastic/cloud-on-k8s/issues/4260 to integrate the Agent fix into ECK's Agent controller
Most helpful comment
I think I'm leaning towards this. We can even keep this documentation updated with right paths for any future Elastic Agent versions released without the fix.