Beats: [Agent] Batch reported events to Fleet

Created on 2 Jan 2020  路  16Comments  路  Source: elastic/beats

As noted in https://github.com/elastic/beats/pull/15237#discussion_r362079722, Kibana has a 1mb limitation for the body size (this is the default size of Hapijs). Because Agent can report a log of events we need to take into consideration to split a slice of events in batches.

Related to https://github.com/elastic/beats/issues/15297

Most helpful comment

additional endpoint would be nice,
as we perform actions sequentially it's highly unlikely to have a conflict there. we either ack or pull

but if you return anything not acked i think it's not breaking any contract, it's up on us to deal with duplications

All 16 comments

Pinging @elastic/ingest (Project:fleet)

I wonder if batching will cause a problem with the fact that we use a single endpoint for configuration retrieval and sending events. Let's consider the situation where we use a fixed number of events as the limit for simplicity.

Batch size limit of 10 events.

Happy path

  1. Agent has no local events to report.
  2. Agent retrieves configuration changes.
  3. Agent generates an ACK event.
  4. Agent sends 1 event (ACK event).
  5. Agent waits for the next tick.

Generated events exceed the batch size limit.

  1. Agent has no local events to report.
  2. Agent retrieves configuration changes.
  3. Agent generates an ACK event.
  4. Agent has 10 local events (10 + 1 ACK)
  5. Agent sends 10 events
  6. Agent sends 1 Event. (ACK event of the configuration change)
  7. Agent waits for the next tick.

If the Agent didn't send the ACK Event for the configuration does this mean we will receive the action again at step 5?

@mattapperson @nchaulet @michalpristas ?

@ph yes it's going to send the action again

@nchaulet @mattapperson This is indeed a problem, either we:

  1. Prioritize ACK event.
  2. Different endpoint.
  1. prioritize ack events should do the job no? we should be able to send at least a few thousands of events with 1mb, so it is really a problem?

@nchaulet yes I think you are right, proritize should do the trick.

@ph @nchaulet let's say we have 10 actions resulting out of checkin command.
until next tick we manage to complete 5 of them. so reporter will report
5xACK + 5 other events

will fleet generate remaining actions again or should we postpone sending other events until all ACKs are reported?

@michalpristas Fleet will resend every action until we get the ACK

so i guess we have 2 options, either to introduce some deduplication layer which will filter out these actions until they are acked or some timeframe passed.
other way would be to block sending of event up to the point when all ACKs for latest batch are present in a queue.

i think i like both solution evenly but second one might have a bit negative impact on memory consumption. if we are in the middle of applying policy change and another policy change comes in i fear that we might end up with incorrect diff.
what do you prefer @ph

I would be in favor of making the ACK work first without worry too much about having a limited batch size.

The implementation we have is in majority sequential, meaning in a fetch loop we will do this: Fetch, Parse, Dispatch actions, execute actions and report the ack.

Concerning the two suggestions, deduplication vs tracking ACK, let's try to expand the use case to support future features. Right now the only kind of "Action" is a _policy change_ but soon we will need to support more operations:

  1. Pause
  2. Restart
  3. Send command to a specific process (endpoint)
  4. Queries?

All of the above will also require reliable ACK for the action, in that case, would that make more sense to also "TRACK" them in memory? If we do this we say that these ACK are what the action handlers returns and not let them handled by the reporter? It doesn't change much in the implementation but we don't need to add special logic to the reporter.

This means the reporter is indeed just a general-purpose reporter targetting async reporter, not sure it's the right idea yet, just throwing it out there.

i see no problem in returning ACK for now, but restart might cause some issues, lost ACKs when machine is returned sooner than restart is ACKed and so we will end up with new restart action when we ask fleet for next steps and we might end up in a restart loop.
same race i think would be possible using reporter, less likely but still possible.
also we can loose whole in memory collection of ACK which will just increase load put on agent when it will need to perform whole set of actions again.

what about ACKing immediately in a handler, so e.g restart handler can ACK before restarting and other handlers can ACK at different point when it makes sense. i would suggest behaving like we have 2 endpoints one for checking another one for ACKing. and when acking we would just ignore whatever comes from fleet.

What is the size of each event you send up as an ACK? Could you share an example? How many ACK do you expect to fit into 1mb?

@ruflin an ACK event is something like { "type": "ACTION", "subtype": "AKNOWLEDGED", "action_id": "uuid-uuid-uuid-uuid", "message": "Acknowledged action uuid" } we can fit more than 1000 in 1mb

@ruflin There are not too big for now, but normal situations can cause a lot of them to be generated or need to be send at the same time.

what about ACKing immediately in a handler, so e.g restart handler can ACK before restarting and other handlers can ACK at different point when it makes sense. i would suggest behaving like we have 2 endpoints one for checking another one for ACKing. and when acking we would just ignore whatever comes from fleet.

I like that idea, I think it works for now @nchaulet WDYT?

Yes we can have two endpoints, POST /agents/:id/acks to acknowledge actions and /checkin for agent polling and reporting events.

Can having two endpoint cause more problems? What happens if the POST /acks happens at the same time the agent poll for checkin?

additional endpoint would be nice,
as we perform actions sequentially it's highly unlikely to have a conflict there. we either ack or pull

but if you return anything not acked i think it's not breaking any contract, it's up on us to deal with duplications

Was this page helpful?
0 / 5 - 0 ratings