Beats: [Agent] persistence of configuration

Created on 13 Jan 2020 · 18Comments · Source: elastic/beats

The agent needs to persist the last retrieved configuration from Fleet. When the agent receives a policy change, the agent _ACK_ the change and fleet will never send back the configuration again. We need to actually persist this configuration locally.

This is useful in the following use case:

When the Agent is restarted for any reason. (crash, normal maintenance reboot, etc)
When the Agent is run on an ephemeral machine like a laptop.

We could implement this the following way, when the "ActionPolicyChange" is received and just before ACKing it with the _FleetGateway_ we could persist the change to disk if the action was successfully handled by the handler.

Now the question arise to what we should persist to disk:

Persist the configuration: Persist the actual configuration might be simplier, its a "unique" thing.
Persist the ActionPolicyChange: Persist the whole action and when we are back online we could replay some or all of the actions.

Now, the way the code is structured would mean that if we send the action again to the _ActionDispatcher_ we could restore the appropriate state to the system. The only difference whole be the ACKer ?

Source

Most helpful comment

@ferullo If this is okay with you I think we could something like this, when the Agent receives config and successfully applied it. We persist it to disk, when Agent is started, it will read the configuration and send the configuration to Endpoint.

Endpoint is free to keep a local cached version of the configuration, when it receive the configuration from Agent you can assert it's the same an just start to operate with local version.

There's also a chance that we'll receive additional non-Policy tasking/configuration that needs to be persisted across crashes/reboots (e.g. if "host isolation" is implemented as a task and not a part of Policy).

I think, this will also be required in the future for the agent, but we can add it later.

ph on 15 Jan 2020

👍2

All 18 comments

Pinging @elastic/ingest (Project:fleet)

elasticmachine on 13 Jan 2020

fyi @michalpristas

ph on 13 Jan 2020

i think i like the idea of replaying last change on start before pulling anything else. this would also produce additional ACK which can be ignored on fleet side or used to detect undesired restarts

michalpristas on 13 Jan 2020

ACK which can be ignored on fleet side or used to detect undesired restarts

I haven't thought of that, but that would require the FleetGateway to handle the "reboot" ?

ph on 13 Jan 2020

@ph i dont think so. reboot is fine and can be acked before reboot. after reboot it would normally load what is stored applied config, ACK it.
then it would ask fleet for checkin and make a diff based on new config (reboot was ACKed upfront) backup policy change and apply changes and ack it.

in case of undesired restart (agent crash) it would be similar, agent crashes it load whatever is stored and apply policy change, then it will ACK it. then it asks for new changes and so on.

when fleet sees recurring ACK of the same event it can definitely say something is wrong with agent and show warning. with unexpected crash we dont have a way of reporting event to fleet as whole batch disappears. fleet will just see another checkin.

michalpristas on 14 Jan 2020

So if I understand correctly the following will happen:

Agent starts
Agent calls checkin => [PolicyChange(actionID: 1)]
Agent dispatch [PolicyChange(actionID: 1)]
Agent diff the Configuration. (all new)
Agent applies the configuration.
Agent persists the [PolicyChange(actionID#1)] on disk.
Agent ACKs actionID#1
Agent stops (assuming gracefully)
Agent restarts.
Agent reads action on disk [PolicyChange(actionID: 1)]
Agent dispatch [PolicyChange(actionID: 1)]
Agent diff the Configuration. (all new)
Agent applies the configuration.
Agent calls checkin => [PolicyChange(actionID: 1)]

@michalpristas If this is the above model that you propose, I think it fit well with our current code and shouldn't require us to add a lot of code to make it work. The only things I am asking is who will be responsible of steps 10-11 and I presume the _ActionHandler_ will do the action to persist or the _ActionDispatcher_?

ph on 14 Jan 2020

Also, I don't want us to over complexify here, we can get away with the first version to only persist last _good_ policy change and nothing else.

ph on 14 Jan 2020

i think for starters last good policy seems ok

michalpristas on 14 Jan 2020

👍1

Also, let's make it unencrypted for now and a separate file and have a second step for protecting that data. We can use the same io.Reader strategy as before..

ph on 14 Jan 2020

👍1

I am going to take this.

ph on 14 Jan 2020

I think we might want to have some kind of defensive code here, lets say that we receive a configuration from fleet we did ack but we crash before persisting to disk with the current ACK strategy the agent will never receive the configuration. We will need some kind of way to communicate that "orphan" state back to fleet. I wonder if we should report periodically the config id (or hash) that we currently running.

I am going to create a followup issue for that specific use case.

ph on 14 Jan 2020

@ferullo Pinging you on this one here as we discussed this also in the context of the endpoint that it will also need it.

ruflin on 15 Jan 2020

@ruflin @ferullo what is the link between the two endpoint / agent?

ph on 15 Jan 2020

I have expected that Endpoint would save its last good policy in the clear as a user-viewable YAML file, at least to start, as it seems this discussion is concluding Agent will do for itself.

I think it will be simpler if Endpoint is responsible for saving its config on its own rather than have Agent save it.

Endpoint will download some artifacts out of band (e.g. malware data science models) and so already be managing some other configuration state, as well as storing its own non-configuration persisted data (like local copies of events not streamed to Elasticsearch).
There's also a chance that we'll receive additional non-Policy tasking/configuration that needs to be persisted across crashes/reboots (e.g. if "host isolation" is implemented as a task and not a part of Policy).
Since we have concluded that Endpoint will be a separately started service from Agent it seems we should be able to get up and running without needing to connect to Agent to get a copy of our configuration. That way both services can come up and down independently.

Having Endpoint store all of its persisted state will seems simpler on our end. I'm not sure if this is the case for Beats, but the Endpoint is already going to need to manage on-disk persisted state. Fully owning that won't add new requirements for us.

ferullo on 15 Jan 2020

👍1

Endpoint is free to keep a local cached version of the configuration, when it receive the configuration from Agent you can assert it's the same an just start to operate with local version.

There's also a chance that we'll receive additional non-Policy tasking/configuration that needs to be persisted across crashes/reboots (e.g. if "host isolation" is implemented as a task and not a part of Policy).

I think, this will also be required in the future for the agent, but we can add it later.

ph on 15 Jan 2020

👍2

@ph I think that would work. Am I right that in that model from Endpoint's point of view the config Agent is persisting and sending on startup would appear to Endpoint to just be a new configuration request akin to a user changing the Policy in Kibana?

CC @brian-mckinney for visibility.

ferullo on 16 Jan 2020