Systems like elasticsearch, mysql and other data processing engines have the concept of an slow log, a log file where the system is reporting slow operations. To have this in Logstash makes sense in order to increase the feedback users get.
_Concrete details will be updated on demand as this idea gets developed, so take this proposal as an initial dump of ideas that will for sure evolve._
To implement this feature, specially if we just wanted to get slow_log mix with the current log, we could just reuse the logger instance, _however we aim for a more general structure_, this solution is discarded.
Who should be initially reporting the slow operation?, definitely should be the component detecting the situation, we’ll use the grok filter as example to describe the solution. For the grok filter it should be transparent to report an slow operation, to report an slow operation should be as simple as:
report_slow_operation(action, duration, event, context, ….)
But internally in logstash this another situation, the log might need to get populated with collateral information such as current pipeline situation, workers and plugin status, etc This information should be gathered by other components inside the LS core logic, so this kind of mandates a system that is living in core and able to compose all necessary messages into a proper logger output.
The proposed solution will be:
More details could be grasp from an initial proof of concept:
_[Include PoC link when available]_.
For the initial version of this component we aim to:
Feature work:
MySQL provide its users with slow query log consisting of SQL statements that took more than long_query_time seconds to execute and required at least min_examined_row_limit rows to be examined. Execution time is usually reported with microseconds resolution, unless the logs are send to table where the microsecond part is ignored.
Some queries get not logged by default like administrative statements, replicated statement, etc.
Ref: https://dev.mysql.com/doc/refman/5.6/en/slow-query-log.html
Similar approaches are taken by Percona and MariaDB.
Example of Percona SlowLog
# Time: 130601 8:01:06.058915
# User@Host: root[root] @ localhost [] Id: 42
# Schema: imdb Last_errno: 0 Killed: 0
# Query_time: 7.725616 Lock_time: 0.000328 Rows_sent: 4 Rows_examined: 1543720 Rows_affected: 0
# Bytes_sent: 272 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0
# QC_Hit: No Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No
# Filesort: No Filesort_on_disk: No Merge_passes: 0
SET timestamp=1370073666;
SELECT id,title,production_year FROM title WHERE title = 'Bambi';
Elasticsearch have two types of low logs, one for searches and one for indexing, both works in similar fashion. The search slow log allows you to log slow search (query and fetch phases) into a dedicated log file.
Thresholds can be set for both the query phase of the execution, and fetch phase. They are then used when the loglevel is set to does as the limit to log.
Search slow log configuration example, [latest 2.3 branch]
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
Both logs log to a dedicated file, one for slow queries and one for indexing.
Is specific for the indexing log that the user get the option to log part of the _source field, how much is logged can be configured (default is first 1000 characters).
Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-slowlog.html
Happy to discuss details.
I like this, We can introducer a SlowLogger class alongside Logger when the feature/log4j2-logging lands. We can have this be configured with its own file output, helper method like the report_slow_operation.
This will be a good feature. Thanks for the detailed proposal. Some comments:
To implement this feature, specially if we just wanted to get slow_log mix with the current log, we could just reuse the logger instance, however we aim for a more general structure, this solution is discarded.
This should be one of the requirements. Slowlog and normal operating log must be in 2 separate files to be effective. It gets too noisy otherwise.
Build an slow log service able to receive slow log messages, gather context (when choosen to) and output the final log.
+1 for this to be in core and exposing an API for plugins.
Add an slow log feature flag (disabled by default)
My preference is to not have a (yet another) flag for this. I agree slowlog needs to be disabled out of the box, but that can be controlled via the thresholds. This way users need only change a single setting. This is also consistent with how Elasticsearch slowlog works.
I suggest we have 2 types of slowlog:
Before we progress to implementation, I would love to see:
- @talevy brought up feature/log4j2-logging -- how will this feature work with it?
Depending on how one wants to organize the configuration for this, this can all be done in the log4j2.properties file. Simply have a specific logger for anything in the slowlog namespace. This way it would really just be logged just like any other logger. The difference here being that there would be custom logic in the SlowLogger class to determine when it is appropriate to log the desired item.
Moin, thanks @suyograo and @talevy for your feedback. Here are the relevant bits from my current proposal that can be seen at https://github.com/purbon/logstash/tree/feature/slow_log for more in detail WiP branch.
@talevy for sure this will need to be adapted as soon as your log4j feature gets merged.
For now, important parts are:
How to configure settings, using the logstash.yml file from logstash-core/lib/logstash/environment.rb
Setting::Boolean.new("slow_log.pipeline", false),
Setting::Boolean.new("slow_log.plugin", false),
Setting::String.new("slow_log.dir", nil, false),
Thresholds for the plugins could be defined either in the main environment.rb class or specifically into each plugin by providing proper initialization methods, however from user point of view will be always going through the logstash.yml file. I think they should be keep in this file as they are actually logstash related.
I added feature flags because I think is important to flag a all or nothing flag, we'll have for sure also threshold flags, but if we add this global flags we don't have to put imaginary big thresholds and can have "resonable" values in there.
End of the day we should pick up the approach in terms of making our users life easier, what means we should bring intuition and misunderstandings in mind.
For sure, as far as @talevy feature include the option to change this settings from the api, this should be added here when his job is merged. Makes sense to me that his is pending his work.
plugins can communicate with the slow_log by using a code snippet like:
slow_logger.log("this is slow", :action => "test")
this log method is accepting as many parameters are passed, like this there is no limit to the amount of information passed by the component to the logger. In case of having the log disabled, this logger is going to be basically a NullLogger, logging nothing.
plugins should also be able to fetch settings, for example to know thresholds, using a reference like:
setting("plugins.grok.slow_filter_in_sec")
The slow log component has a log method, that will include alias if necessary. This method is the responsible of actually logging the information and fetching the context as complement.
def log(*args)
context_data = fetch_context_info(*args)
logger.warn(*args, context)
end
How this will look like in code end of the day will depend of the logger used, but most probably by having specific classes per each slow log type. As of the first version I propose to have a:
PipeLineSlowLog to log pipeline slow log information, for example when the pipeline is stalled.PluginSlowLog to log plugin slow logs, for example slow grok operations.Each of this slow logs will be logger to their specific file, or STDOUT if configured.
what do you think?
@purbon - I think of this as a kind of metric.
I like the idea that the top N slow operations will appear over the web api.
If it is in a file, will this make it harder to 'see' when LS is deployed in Cloud?
Do we really need to see the full history of all slow operations in a file or will top N be enough?
Slow operations, logging and metrics are cross-cutting concerns - a 'capture' statement can occur anywhere in the code.
We did briefly discuss in the Batch Object PR about an alternative calling mechanism for metrics - could this feature pioneer this alternative mechanism and port it back to metrics when the API is settled?
In some ways, I agree with @guyboertje. Since Logstash runs on millions of events... if any one operation is slow, we will see it many many times. I know this is also true of queries in ES, but those tend to happen much less frequently than an event passing through the pipeline. Keeping track of the Top N slowest operations is a good place to start
Why not do both? I see value in both -- expose a top N slowest operations via an API, and append slowlog as we encounter to the file?
Why not do both?
totally an option, but the question is which one should be implemented first (aka more important)?
Both works for me. Sometimes one needs realtime feedback and other times postmortem analysis is required.
@guyboertje good proposal to introduce the top N operations, we might like to for example do:
Makes sense?
If there are no more comments, I would like to keep us moving.
Looking forward to keep this moving 😄
I see value in both, N slowest via API and everything in a file.
@purbon I am currious about the implementation details, how we will capture or detect the slow down? What are the solutions you are looking? One idea that come to my mind its to supercharge the metric collecting methods with a callback object?
clock = metric.timer(:response_time, SlowThreshold.ceate("this takes a long time", 3))
## slow execution.
clock.stop
@ph In my initial thoughts this was not planned. I basically planned to provide a mechanism to:
How this was timed I thought it would be dependant of every plugin, but we could for sure provide uniform mechanism also if makes sense? Open questions for this to me know are:
Time.now?What do you think?
@ph In my initial thoughts this was not planned. I basically planned to provide a mechanism to:
I think we should do an exercice in parallel to see _how slow operation_ will be captured, since this might influence the architecture/design of the _slow log_ implementation.
Not sure if I would go for a component that is universally responsible of stopping executions as this might be very tricky to archive in general sense.
A watchdog feature with the nature of jruby could be hard to have, I wont go that route for now.
A watchdog feature with the nature of jruby could be hard to have, I wont go that route for now.
basically this is why I'm proposing only the very minimal thing read threshold + report slow operation. Then how do we decide if something is slow is something that can happen when after we decide on this interface is setup, unless I'm missing something.
might influence the architecture/design of the slow log implementation.
Would really love to understand your thoughts behind this? how can it influence the implementation if internally this is mostly a call that reports to a logger, for the case of reporting to a file.
@purbon Sorry for taking the issue off the track, my curiosity was more at how it will be collected not where it would be saved or displayed.
@ph very nice discussion to have, I think we can open a new issue to discuss and see options there. makes sense to you? I would name it something around task timing management or so...
@ph and @purbon - all good points.
It would be sad to have to add two statements, 1 for logging the slow activity and 2 for the metric.
I mean the metric object would have to have all the data/content to create a top N type entry anyway - could it not log as well? Its not the metric subsystem that would log it would be the SlowThreshold class that did the logging.
@guyboertje no need to be sad, we are here to discuss and reach consensus around this feature. Couple questions to try to understand a bit more your thoughts:
@ph and @purbon - I guess I see the slow log function as a SectionTimer class that knows how to send the metric and log to the appropriate logger if the time taken is above a threshold.
That is: it has the threshold value, it has the metrics channel and it has the logger reference.
I would ideally keep the metrics feature away from slowlog. To me, they are 2 separate concerns. Slowlog captures a lot more context than what the monitoring APIs do. Also, slowlog can be expensive, hence they need to be turned on. Metrics is always on. Agree that both are useful for debugging, but complementary ways of looking at "What's going on with my LS".
@purbon as @ph pointed out, how do you intend to capture or record slowlog for operations in plugins. Can you give us a concrete example? Maybe use DNS filter as example...
@suyograo as I pointed to @ph in my comment here I see a few questions:
How to decide an operation was slow? This is out of the scope of this PR as it will involve specific solution per plugin, usually.
I am not sure why this is out of scope. Isn't this a core part of this design issue? How are you thinking of measuring a slow operation from a plugin point of view? What infra will we need there?
@suyograo Well in short version, I see a difference between:
Point 2. I see it as a different issue because it can involve different approaches for different kind of operations, but to provide with at least one solution, I think for most of them we could use a monitor approach.
slow_log_monitor(threshold) do
### run the operation to be monitored
end
This will involve having a module that basically does:
Open questions:
This are the questions that I can see for now and why I do think while it might be _simple_ for grok, or others, to provide a general enough solution for all plugins is tricky and might deserve it's very own design issue.
Makes sense for you?
@suyograo - I do agree that reporting a slowlog episode to metrics and reporting it to a logger are different concerns - but I feel that they are facets of one higher concern, reporting.
I see a slowlog "device" as having three higher concerns
I think capture would include techniques seen in benchmark code that attempts to split the signal from the noise.
However figuring out what slow means is not simple.
I see three sources of execution speed variation
We should probably know the event or string size when when judging for data based captures.
We should consider self-calibration instead of manual calibration - here I mean, for example, that a slowlog device monitoring a dns lookup function should calibrate itself by periodically sampling a reference (nslookup) or tail a file that gets calibration updates, maybe from a file output on a different reference instance of LS.
I still think the metric are linked to this slow log idea as @guyboertje mentioned.
The metric already collect all the data, sampling/aggregation is really core to what a metric system is and adding support for it could yield into more feature.
I can see that the metric could yield event based on the sampling it does.
Which would keep theses system decoupled, turning on or off could be also easily implemented in that scenario.
@purbon concerning the external thread are you thinking of launching 1 thread per slow_log_monitor calls? this could lead to creates a lot of new threads, I am not sure it will be good for performance.
@ph I'm happy to see other ideas to timing tasks and operations, seriously this is why I said this deserves it's on issue because the question is not easy to ask in general terms.
not sure if I see it as a metric because end of the day this follows this pattern.
The big open question here is:
If I understand your approach your approach is:
While I do understand your concepts, I see this other approach as really more complex. I would really love us to keep this initial implementation, later on we can always improve the approach.
If you ask me now, my idea is:
After the initial proposal (see top of the issue), a few comments and great ideas, this is aiming to be a resume of the different topics discussed, gaps and missing comments are unintended, let me know and I will fix.
The intention is to enable us moving forward and reach consensus.
This is the list of topics in discussion:
My preference is to not have a (yet another) flag for this. I agree slowlog needs to be disabled out of the box, but that can be controlled via the thresholds. https://github.com/elastic/logstash/issues/5733#issuecomment-238412582
Current proposal is to use feature flags and a definition for each threshold, like this users can control if a feature is on/off and what slow means for each one. Also avoiding the need to put extra large threshold to be sure the slow log is disabled. Feature flags are common practice in many DB systems for example.
I like the idea that the top N slow operations will appear over the web api. https://github.com/elastic/logstash/issues/5733#issuecomment-238514678
This was initially not scoped, but is certainly a very good idea that should be implemented. Should not be hard to provide in an initial scoping.
Timing slow operations might get interfered with time spent into another operations, how concerned are we with the precision and isolation of this measurements?
The user is really concerned about the complete time an operation runs, does not care if is in GC or others, because end of the day he is the user of this, let's say grok filter.
@purbon
Related https://github.com/elastic/logstash/issues/5733#issuecomment-239445851
In the initial proposal the distribution of responsibilities is:
Timing the operation is something provided to the plugins for the logstash-core (for the general use case), but could certainly be extended by the plugins when necessary. It looks like
slow_log_monitor(threshold) do
### run the operation to be monitored
end
This is proposed like this because it provided a proper separation of concerns between the plugin (who knows if something is slow, the thresholds, etc), so the one who decide something has gone strange, and the core components, responsible of gathering context and reporting.
However there has been an alternative proposal:
https://github.com/elastic/logstash/issues/5733#issuecomment-238514678, https://github.com/elastic/logstash/issues/5733#issuecomment-238882890, https://github.com/elastic/logstash/issues/5733#issuecomment-239457705
Interesting notes:
The metric already collect all the data, sampling/aggregation is really core to what a metric system is and adding support for it could yield into more feature.
I can see that the metric could yield event based on the sampling it does.
Which would keep theses system decoupled, turning on or off could be also easily implemented in that scenario.
@ph
We should consider self-calibration instead of manual calibration - here I mean, for example, that a slowlog device monitoring a dns lookup function should calibrate itself by periodically sampling a reference (nslookup) or tail a file that gets calibration updates, maybe from a file output on a different reference instance of LS.
@guy
What in terms of responsibilities will mean:
To resume this, the couple to metrics is highlighted because of:
I think some of this feature are possible to be implemented without interfering with the metrics module, that has another objective.
We can add logic, if agreed with, to the slow log module to:
Adding self calibration is something that looks very powerful, but might require lots of research to make it right, I would make it a second phase feature.
+1 on self-calibration being second phase.
@purbon - consider slow_log over a batch. This will have an automatic averaging effect.
After having reached consensus on the different ideas discussed in this issue, this will be the initial scope of deliverables:
This only include work to be done at core level, once this is in making plugins report slow operations will be necessary, first obvious candidate will be the grok filter.
I personally feel this feature is premature.
I suggest we let metrics settle in, see what works, does not work. see how our users react and use it and then re-assess the need to a slow logger.
One good proposal by guy was the idea to introduce some degree of topN reporting in the webapi. I am wondering what might be the best url for them. I have several ideas, but not sure what might be the best
_node/monitoring/slowlog_node/slowlogInitial proposal for slowlog data to be shown in the api is:
All this resources will be grouped by logger and include all relevant information. For example,
{
"loggers.slow.plugins": {
"filters.grok.execution": 12,
"output.elasticsearch.store": 9
}
}
to report top N slow operations for the plugins logger.
@ph @suyograo @guyboertje what do you think?
Some initial experimentation on how to represent this information is:

goal was to reduce the amount of information shown, what do you think?
re: https://github.com/elastic/logstash/issues/5733#issuecomment-240093470
- Thresholds will be defined:
- At the settings (logstash.yml) file, affect all plugins.
I think it is more appropriate to only enable this configuration in the settings file. Maybe we can add
runtime APIs to allow updating this at runtime so users could potentially introduce a specific threshold to monitor any unexpected slowness.
- Per plugin inside as a configuration option. (override global thresholds).
Configuration options in the pipeline definition feel like they should relate to how the plugin operates rather than any general logstash operational features
@andrewvc brought a related concern in the PR, I c&p here to keep the discussion going on here.
I don't think we should be putting slowlog thresholds in the main plugin config. That sort of stuff should be in the settings.yml. In my mind the plugin config is for defining what the pipeline does, and should not include settings related to performance. While it is true that some plugins today do have some performance related settings included we should try to minimize these, not add to them.
@talevy I like the idea of adding an option to change this settings at runtime via the API, something similar to what #5871 is doing, right?
@suyograo, @talevy and @andrewvc brought two valid concerns about the fact to enable plugins to have thresholds defined in their config. As you were the one who took that idea in I would like to know what do you think?
Even if we agreed previously to have both, I am not completely sold to that and see their arguments as good points, especially as soon as #5871 gets merged and we can change settings at runtime.
@purbon yeah, adding an option to the logging api would be great
@purbon I am ok with not having an option to specify thresholds in plugins. We can evaluate later
closing in favor of #6128
Most helpful comment
One good proposal by guy was the idea to introduce some degree of topN reporting in the webapi. I am wondering what might be the best url for them. I have several ideas, but not sure what might be the best
_node/monitoring/slowlog_node/slowlogInitial proposal for slowlog data to be shown in the api is:
All this resources will be grouped by logger and include all relevant information. For example,
to report top N slow operations for the plugins logger.
@ph @suyograo @guyboertje what do you think?