Graylog2-server: System notification for unused CPU cores

Created on 30 Apr 2018  路  6Comments  路  Source: Graylog2/graylog2-server

I just finished a support call with a customer who had seemingly idle systems that were not able to process all messages being ingested and filling up the journal.

It was a 32 core machine (64 threads) running graylog-server, but the *_processor settings in graylog-server.conf were on the default setttings. The result was that only 4 of the available CPU threads were working on messages, with the remaining 60 threads being idle.

We should not expect customers to know this but we also should not just attempt to set the *_processor settings automatically because there might be other systems running on the same box and we'd be stealing resources from them, probably making the situation even worse.

This is why I suggest adding a new system notification that is triggered when less than $cpu_threads-x (x to be determined. this is the reserve for OS tasks, the garbage collector etc) are used by Graylog. The notification should link to a documentation page explaining the settings and why it is required. There should also be a way to disable the notification (in graylog-server.conf?) for setups where this condition is OK and the constant notification would be annoying.

feature triaged

Most helpful comment

I'm with the idea having details for example on the Nodes detail page with the settings and suggested changes but no active notification.

All 6 comments

I think this gets way too complicated for what we want to do. There are pros and cons to just using all available cores:

Pros:

  • Defaulting to using all available cores is a more deterministic behavior across different machines than using n (where n is either arbitrarily static or a "complex" formula) cores
  • Using all available cores is what users _expect_ ("I am having 64 cores, Graylog is using only 4, get this sh*t fixed!")
  • The operating system does a good job (because it actually is its job) to distribute computing resources between different processes
  • In the age of containers you will most certainly not share the instance with another process
  • Every decent operating system offers tools which allow limiting a container/jail/vm to a cpu (set) and/or to define the process' niceness
  • If all of that fails, you can just limit the number of cores in our config after taking an _informed decision_ to do that

Cons:

  • Having e.g. 64 cores doing busy waiting on a low-volume disruptor might offer great performance, but also wastes a lot of energy

In my life before Graylog doing operations I have never encountered a situation where using too many cores for a process by default broke things, but using too little did in a lot of cases.

I don't like adding more notifications to the product.

If it is that important it should interface with a monitoring system and not display a bubble in a UI that is potentially not seen.
If it is not important, I'd prefer not to be distracted by it constantly.

Perhaps our defaults are a bit conservative, but it's also not trivial to set them to "all cores", because there is more than one thread count that is important to tune.

+1 for a notification. I'm relatively new to Graylog and would benefit from one-time notifications regarding poor system configuration choices.

I would prefer some tuning help/overview page instead of notifications.
It's hard to decide when to notify, do we notify for every server in the cluster, what happens when new servers join, do we need to notify again at some point, even if it was dismissed?

I think it's better to pick some sensible defaults and auto tune based on cpu count during start. Then display those choices again on a tuning page.

I'm with the idea having details for example on the Nodes detail page with the settings and suggested changes but no active notification.

@jalogisch: I'm with the idea having details for example on the Nodes detail page with the settings and suggested changes but no active notification.

I'd definitely find this useful.

Was this page helpful?
0 / 5 - 0 ratings