Following some testing by SIEM (@rylnd and @FrankHassanabad) we've encountered an issue where ES couldn't keep up with the requests made by the Alerting service.
@FrankHassanabad 's descriptions of the situation:
The interactions between alerting and task manager we experienced was if we bulk turn on our _277_ rules immediately and set the
max_workersto a value of say 300 then the rules all run immediately and we get _300_ searches racing towards ES which can then cause 429 too many requests, or misc issues such as timeouts.
We can stagger the turning on of the rules when we get a bulk request if we need to from our end to avoid the 429 issues. However, the other effect we experienced was if you turn on all _277_ rules slowly over time everything runs fine because they are slightly spread out during execution. However, if you shut down your single Kibana instance for more than the interval (5 minutes in our case) then the next time you restart Kibana and have yourmax_workersset to a high number of _300_, it will execute all _277_ as fast as it can and cause 429 too many requests, timeouts, etc... to happen. Then it will repeat this pattern every 5 minutes on the dot since they're all scheduled for 5 minutes and cause a repeated 429 too many requests or other misc issues since it surges at the same time.
Basically the rules all "clump" together if you shut down for more than 5 minutes rather than spreading them a bit on restart after X amount of time.
Questioning the validity of setting a max_workers thats higher than what the ES cluster can handle, @FrankHassanabad mentions that 429 too many requests might not actually mean that the cluster can't handle that many open requests, but rather that they have been throttled by ES to prevent a spike of calls flooding the system all at once (_rapid fire_). If that's the case, then staggering the requests made by Kibana could address this.
I'm not familiar enough with ES to advise on that, perhaps someone else knows?
Open questions:
Pinging @elastic/kibana-alerting-services (Team:Alerting Services)
Pinging @elastic/siem (Team:SIEM)
max_workers is effectively a throttle, no? It determines how many tasks can run concurrently.
if the bulk of the task load is (mostly) querying Elasticsearch and (occasionally) indexing signal data, then I think a good starting point would be to set max_workers to match the throughput of the cluster.
A rough measure for something like query throughput is the default search thread pool size for nodes on a cluster, which is
int((# of cpu on a node * 3) / 2) + 1
So if you have 32 CPU nodes, you'll have 49 threads for searches.
A value like 40 might be good in that case for max_workers, leaving some room for other queries. That assumes there's only queries being done. I believe indexing is being done as well.
429s/503s will be due to the throughput of the cluster: Once all the threads in a pool are used on elasticsearch, the requests are queued. If the active requests take a long time, the queue will fill up and start rejecting, which you'll see as 429 or 503 depending on the API.
This assumes all tasks put load on Elasticsearch, which is not true ( actions for instance won't place much load, with the exception of indexing actions). At some point it may be worth having a different pool of workers for tasks that hit Elasticsearch.
Some unorganized notes/thoughts:
max_workers is the maximum number of concurrent workers for a single Kibana instance to handle alert and action executors (and in the future other things like reporting). When scaling Kibana horizontally (more server instances), you'll get (# of Kibana instances) * max_workers total workers across the Kibana "cluster".
The README.md talks about a numWorkers property that you could perhaps use to do some per-type throttling of picking workers to run, but it appears to be vestigial - IIRC we removed support for it at some point in time. That's fine, because it didn't really make sense the way it was described; it would be better to have a per-type max_workers (per Kibana instance)
Node.js itself should be fine handling the details of running 1000's of concurrent workers, but it appears obvious (now) that if those workers are handling 1000's of ES calls, we're going to kill ES.
ES returning a 429 is fantastic, but let's find out what it really means. I'm guessing ES has determined it's under a high load and shouldn't take on more queries to run. We obviously want to be running _just under_ the "429 limit", figuring out how to do that will be interesting.
Applying throttling in the callCluster we provide to executors seems like it makes sense
There's also poll_interval, which indicates how often TM looks for new work, set to 3 seconds. IIRC, there is a limit to how low you can get this to run today with the current TM architecture. Some of the alternatives architectures (in PRs we never merged) let you get the interval lower. I don't remember the numbers, but thinking I've seen sub-second with the alt-architecture PRs, but can only get about 2 seconds with the current architecture.
Will cron scheduling help with the staggering? You could imagine SIEM "managing" that by perhaps picking random offsets for the seconds value in a cron expression. Or perhaps we could even provide or force a random second offset for minute- and hourly- based cron expressions.
- The README.md talks about a
numWorkersproperty that you could perhaps use to do some per-type throttling of picking workers to run, but it appears to be vestigial - IIRC we removed support for it at some point in time. That's fine, because it didn't really make sense the way it was described; it would be better to have a per-typemax_workers(per Kibana instance)
This is indeed vestigial, but it's coming back in 7.7 :)
The message from TM that gets logged when all the workers are in use seems ... not great.
[Task Ownership]: Task Manager has skipped Claiming Ownership \
of available tasks at it has ran out Available Workers. \
If this happens often, consider adjusting the \
"xpack.task_manager.max_workers" configuration
max_workers, which will cause a thundering herd issue as SIEM has seenideas:
I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.
I have a stress test for alerts that uses some other plugins/tools I've built for heartbeats that makes essql calls - in a gist here: https://gist.github.com/pmuellr/29042ba4fd58f8e4c088d3b0a703da2e
One of the interesting things running this with 100 alerts at a 1 sec interval with 10 workers, is that the action execution is about 30 seconds after the action is scheduled to run in the alert. With 1000 alerts at 1 sec interval, the delay is getting up to 2.5 minutes.
I imagine because it basically gets put at the "end of the queue". That seems not great to me, wondering if action tasks should always have priority over alert tasks. That doesn't seem perfect either, for expensive actions. I _sounds_ nice to maybe have a "priority" system here, where I could set my alerts at a lower priority than my actions, but _it's complicated_ 鈩笍
Catching up to master so I can change the max_workers, I ran 100 alerts at 1 sec interval with 100 workers, and it's a lot more lively in that there's on 3 seconds latency from when actions are scheduled till when they are executed. Kibana is running at 150-200% CPU, 1GB RAM, ES at 80-150% CPU, 1GB RAM.
It seems like there are 2 throughput concerns to consider:
max_workers creates a throughput limit, which may be artificial, as SIEM found in their tests. Throttling would cause gaps would appear in detection, yet the system was capable of scaling beyond that limit. max_workers which moved the source of back-pressure from task manager to elasticsearch. The throughput may even vary with the part of the system being used - are we getting errors from elasticsearch indexing? or elasticsearch queries? We may be hitting limits on one but not the other. This creates a few problems today:
The max_workers default of 10 as a global throttle is going to be too low for many SIEM systems and likely other deployments. What's the right value? Set too high it could cripple the Elasticsearch cluster. Set too low it will leave gaps in detection, which will only get worse as other apps add alerts.
There's no way to deal with downstream system throughput limits. numWorkers will work at a task type level, but there might be many task types that use Elasticsearch, and they may use it in different ways (indexing vs search, or both in the case of SIEM).
I feel like 10 is way too low. 100 seems about right to me. May be too big for some use cases (CPU /resource expensive alerts | actions).
numWorkers doesn't seem great since it doesn't "scale" with max_workers. if max_workers is 10, you could set reporting tasks to numWorkers 5 to limit running 2 reports at a time, but then if max_workers gets reset to 100, now you can run 20, which will probably kill your system. That's why "max_workers per type" makes a bit more sense to me.
Regarding numWorkers: my direction of thought was to scrap the old model and shift to a model where Task Type Definitions specify how many concurrent tasks can be executed - so it's would have nothing to do with max_workers.
The reason for this is that using numWorkers, as Patrick pointed out, is scaled by max_workers and logically that means you can't say "I want only 2 reporting tasks, but as many Actions as possible".
Another reason is that it makes it hard to only claim as many tasks as you have capacity for - you'd often end up claiming too many and then locking them off from other Kibana instances until you finish processing the preceding tasks. This harmed performance.
I feel like 10 is way too low. 100 seems about right to me. May be too big for some use cases (CPU /resource expensive alerts | actions).
Agreed, and I don't think we have any reason to keep it at 10, as I think it was only there in the first place to prevent multiple Reporting tasks from running on the same Kibana, and that doesn't actually work anymore (Reporting are aware and okayed it, agreeing I'd work on it in 7.7).
The message from TM that gets logged when all the workers are in use seems ... not great.
[Task Ownership]: Task Manager has skipped Claiming Ownership \ of available tasks at it has ran out Available Workers. \ If this happens often, consider adjusting the \ "xpack.task_manager.max_workers" configuration
- it's very long
- it's annoying, and so that will lead the customer to update
max_workers, which will cause a thundering herd issue as SIEM has seen- I don't really have a great list of concrete alternatives, but will discuss some ideas below
ideas:
- have TM print some basic stats in one line, every minute, if any tasks have run in that minute; start basic - total number of executions, failures, timeouts, etc
- have TM print some basic stats in one line, every threshold number of executions (100, 1000, not sure)
I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.
We could definitely collect this and flush on a configurable interval, and in fact, much of the work to do that has already been done as we now track these events internally for the runNow api.
How would you envision describing the case of _Task Manager has skipping the polling for tasks when it rans out of Available Workers_? A count of how many cycles were skipped? Not sure that would be helpful....
Some unorganized notes/thoughts:
- Applying throttling in the callCluster we provide to executors seems like it makes sense
This shouldn't be too difficult, but would you envision a lossy throttle (callCluster rejects calls that are over capacity) or backpressure (accumulate requests until you have capacity, slowing down the calling executor)?
- There's also
poll_interval, which indicates how often TM looks for new work, set to 3 seconds. IIRC, there is a limit to how low you can get this to run today with the current TM architecture. Some of the alternatives architectures (in PRs we never merged) let you get the interval lower. I don't remember the numbers, but thinking I've seen sub-second with the alt-architecture PRs, but can only get about 2 seconds with the current architecture.
In my perf tests I was polling at a few 100ms, but the wall I kept hitting was that my tasks were waiting on ES, so TM polled more often than it was freeing up workers.
On Slack @peterschretlen asks:
I wonder could we predict (or adjust) a good max_workers value. ~autoscaling for task manager.
I have thought about this in the past, so brain dumping some questions we'd need ways of answering to achieve this:
max_workers causing more load that further slows the system, causing it to spin up more workers?
- Applying throttling in the callCluster we provide to executors seems like it makes sense
This shouldn't be too difficult, but would you envision a lossy throttle (callCluster rejects calls that are over capacity) or backpressure (accumulate requests until you have capacity, slowing down the calling executor)?
backpressure
The message from TM that gets logged when all the workers are in use seems ... not great.
[Task Ownership]: Task Manager has skipped Claiming Ownership \ of available tasks at it has ran out Available Workers. \ If this happens often, consider adjusting the \ "xpack.task_manager.max_workers" configuration
- it's very long
- it's annoying, and so that will lead the customer to update
max_workers, which will cause a thundering herd issue as SIEM has seen- I don't really have a great list of concrete alternatives, but will discuss some ideas below
ideas:
- have TM print some basic stats in one line, every minute, if any tasks have run in that minute; start basic - total number of executions, failures, timeouts, etc
- have TM print some basic stats in one line, every threshold number of executions (100, 1000, not sure)
I think I like the first. Default might be 10 minutes, for dev I'd want to set it to 1 minute, or maybe even 30 seconds.
I think this is a good measure we can take now - regardless of other measures we take, I think we can improve logging and monitoring.
Opened as https://github.com/elastic/kibana/issues/54920
Following a call between Alerting & Siem, we've decided to move forward on:
We discussed the need for a circuit breaker on the callCluster api provided to Alert Executors, but first we'll investigate the origin of the stress on ES.
We also discussed the need for the ability to limit the number of concurrent instances of an AlertType in a single Kibana (related to #54916 but at AlertType level, rather than just TaskTypeDefinition level), @gmmorris will look into this further.
We also discussed the need for the ability to limit the number of concurrent instances of an AlertType in a single Kibana
If alerts gets their own per-alertType max_workers, may want the same for actions as well. Is there already a discuss issue on this? Seems complicated, there are a bunch of things we max_worker-ize, some will be easier, others harder ... both our own implementation and users tweaking the knobs and dials.
I think the following test will re-create the problem, suggests that the issue is the saved object write load from alerting/task manager.
xpack.task_manager.max_workers : 1000./auditbeat setup ( note you don鈥檛 actually have to run auditbeat, just run the setup to create the index and ILM policies )If you are using a dev environment with ssl, use the following auditbeat.yml
auditbeat.modules:
- module: file_integrity
paths:
- /bin
- /usr/bin
- /usr/local/bin
- /sbin
- /usr/sbin
- /usr/local/sbin
- module: system
datasets:
- host # General host information, e.g. uptime, IPs
- package # Installed, updated, and removed packages
- process # Started and stopped processes
state.period: 12h
setup.template.settings:
index.number_of_shards: 1
setup.kibana:
host: "https://localhost:5601"
ssl:
verification_mode: "none"
output.elasticsearch:
hosts: ["localhost:9200"]
protocol: "https"
username: "elastic"
password: "changeme"
ssl:
verification_mode: "none"
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- add_docker_metadata: ~
Result:
You'll see a lot of entries like the following in the kibana logs about failed task updates:
server log [17:45:51.208] [error][plugins][taskManager][taskManager] Failed to mark Task alerting:siem.signals "75355de0-38b1-11ea-be6b-3999ff8c2ef6" as running: rejected execution of processing of [101718][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[.kibana_task_manager_1][0]] containing [index {[.kibana_task_manager][task:75355de0-38b1-11ea-be6b-3999ff8c2ef6], source[{"migrationVersion":{"task":"7.6.0"},"task":{"taskType":"alerting:siem.signals","retryAt":"2020-01-16T22:55:49.107Z","runAt":"2020-01-16T22:43:23.758Z","scope":["alerting"],"startedAt":"2020-01-16T22:45:49.001Z","state":"{\"alertInstances\":{},\"previousStartedAt\":\"2020-01-16T22:42:23.758Z\"}","params":"{\"alertId\":\"e233cf7a-efb7-4647-b8dc-a9de8934fad3\",\"spaceId\":\"default\"}","ownerId":"kibana:5b2de169-2785-441b-ae8c-186a1936b17d","scheduledAt":"2020-01-16T22:42:21.502Z","attempts":1,"status":"running"},"references":[],"updated_at":"2020-01-16T22:45:49.107Z","type":"task"}]}], target allocation id: pYZI4R-fR0-vEFIyMcw-VQ, primary term: 1 on EsThreadPoolExecutor[name = Peters-MacBook-Pro.local/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1dfe5c6b[Running, pool size = 12, active threads = 12, queued tasks = 200, completed tasks = 25129]]: [remote_transport_exception] [Peters-MacBook-Pro.local][127.0.0.1:9300][indices:data/write/update[s]]
server log [17:45:51.234] [error][plugins][taskManager][taskManager] Failed to mark Task alerting:siem.signals "7534e8b0-38b1-11ea-be6b-3999ff8c2ef6" as running: rejected execution of processing of [101721][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[.kibana_task_manager_1][0]] containing [index {[.kibana_task_manager][task:7534e8b0-38b1-11ea-be6b-3999ff8c2ef6], source[{"migrationVersion":{"task":"7.6.0"},"task":{"taskType":"alerting:siem.signals","retryAt":"2020-01-16T22:55:49.107Z","runAt":"2020-01-16T22:43:23.758Z","scope":["alerting"],"startedAt":"2020-01-16T22:45:49.001Z","state":"{\"alertInstances\":{},\"previousStartedAt\":\"2020-01-16T22:42:23.758Z\"}","params":"{\"alertId\":\"f467691f-562c-4585-83cb-403f020a8b91\",\"spaceId\":\"default\"}","ownerId":"kibana:5b2de169-2785-441b-ae8c-186a1936b17d","scheduledAt":"2020-01-16T22:42:21.499Z","attempts":1,"status":"running"},"references":[],"updated_at":"2020-01-16T22:45:49.107Z","type":"task"}]}], target allocation id: pYZI4R-fR0-vEFIyMcw-VQ, primary term: 1 on EsThreadPoolExecutor[name = Peters-MacBook-Pro.local/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1dfe5c6b[Running, pool size = 12, active threads = 12, queued tasks = 200, completed tasks = 25031]]: [remote_transport_exception] [Peters-MacBook-Pro.local][127.0.0.1:9300][indices:data/write/update[s]]
The elasticsearch thread pool stats confirm that updates are being rejected:
peterschretlen ~ $ curl -k -u elastic:changeme https://localhost:9200/_cat/thread_pool/write?v
node_name name active queue rejected
Peters-MacBook-Pro.local write 0 0 456
We also discussed the need for the ability to limit the number of concurrent instances of an AlertType in a single Kibana
If alerts gets their own per-alertType
max_workers, may want the same for actions as well. Is there already a discuss issue on this? Seems complicated, there are a bunch of things wemax_worker-ize, some will be easier, others harder ... both our own implementation and users tweaking the knobs and dials.
I've changed the task specific one to be both Task Types and Alert Types, as I don't feel we'd like those two to be separate solutions. Best to merge these into one discussion that takes both into account.
Most helpful comment
I think the following test will re-create the problem, suggests that the issue is the saved object write load from alerting/task manager.
xpack.task_manager.max_workers : 1000./auditbeat setup( note you don鈥檛 actually have to run auditbeat, just run the setup to create the index and ILM policies )If you are using a dev environment with ssl, use the following auditbeat.yml
(this imports 1000 rules, each making an API call from the browser so it takes a while )
Result:
You'll see a lot of entries like the following in the kibana logs about failed task updates:
The elasticsearch thread pool stats confirm that updates are being rejected: