Logstash: The Logstash Slow log

Created on 5 Aug 2016  ·  43Comments  ·  Source: elastic/logstash

Description of the problem

Systems like elasticsearch, mysql and other data processing engines have the concept of an slow log, a log file where the system is reporting slow operations. To have this in Logstash makes sense in order to increase the feedback users get.

The proposed system should be able to:

  • Be configurable. A user should be able to change a config variable to decide what means slow for a given component.
  • Should be integrated with all plugins flavors in Logstash, this means it should be possible to report slow operations from inputs, codecs, filters and outputs.
  • Should report also from core components, ex: pipeline being stalled.
  • Should be logged in a different file than the regular logs.
  • As much context as possible should be provided, this means:

    • The component (plugin, pipeline, etc..), the action, duration, event (if applicable), context (configuration?).

  • In the slow log there could be also context information useful to debug the problems.

    Solution proposed

_Concrete details will be updated on demand as this idea gets developed, so take this proposal as an initial dump of ideas that will for sure evolve._

To implement this feature, specially if we just wanted to get slow_log mix with the current log, we could just reuse the logger instance, _however we aim for a more general structure_, this solution is discarded.

Who should be initially reporting the slow operation?, definitely should be the component detecting the situation, we’ll use the grok filter as example to describe the solution. For the grok filter it should be transparent to report an slow operation, to report an slow operation should be as simple as:

report_slow_operation(action, duration, event, context, ….)

But internally in logstash this another situation, the log might need to get populated with collateral information such as current pipeline situation, workers and plugin status, etc This information should be gathered by other components inside the LS core logic, so this kind of mandates a system that is living in core and able to compose all necessary messages into a proper logger output.

The proposed solution will be:

  • A component that it stays inside logstash-core.
  • Is available to plugins, but also to core components.
  • Knows how to receive a message from one of this components and act accordingly.
  • Is able to gather contextual information when necessary.

More details could be grasp from an initial proof of concept:
_[Include PoC link when available]_.

Pros and Cons

  • Easier to grab extended context from the logstash agent and pipelines. Providing easier to debug logs for end users.
  • Easy interface with for the component consumers.
  • Properly single responsibility, providing us with easier to test and maintain components.
  • More complex to implementation than the straightforward solution.
  • Need to scale properly to multiple producers of content (slow operations).

    Work packages

For the initial version of this component we aim to:

  • Make the current file settings available for plugins.
  • Build an slow log service able to receive slow log messages, gather context (when choosen to) and output the final log.
  • Enable plugins to report slow log operations. First candidate should be grok filter.
  • Have good unit and integration tests for the feature.
  • Add an slow log feature flag (disabled by default)

Feature work:

  • When the work on the new logging gets merged, would be nice to try to exploit the feature in there that enable users to change settings on demand from the api. Should be possible to:

    • Enable/Disable Slow-Log.

    • Change slow-log thresholds, for example what slow means for the grok filter.

  • Analyse if more components/operations could be added. Should they be added in different log files?
  • Analyse slow-query logs could be a difficult thing, especially if they grow a lot. To provide an easy debug experience, would be nice to provide ways to visualize this data.

What others are doing

MySQL and related projects

MySQL provide its users with slow query log consisting of SQL statements that took more than long_query_time seconds to execute and required at least min_examined_row_limit rows to be examined. Execution time is usually reported with microseconds resolution, unless the logs are send to table where the microsecond part is ignored.
Some queries get not logged by default like administrative statements, replicated statement, etc.

Ref: https://dev.mysql.com/doc/refman/5.6/en/slow-query-log.html

Similar approaches are taken by Percona and MariaDB.

Example of Percona SlowLog

# Time: 130601  8:01:06.058915
# User@Host: root[root] @ localhost []  Id:    42
# Schema: imdb  Last_errno: 0  Killed: 0
# Query_time: 7.725616  Lock_time: 0.000328  Rows_sent: 4  Rows_examined: 1543720  Rows_affected: 0
# Bytes_sent: 272  Tmp_tables: 0  Tmp_disk_tables: 0  Tmp_table_sizes: 0
# QC_Hit: No  Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: No  Filesort_on_disk: No  Merge_passes: 0
SET timestamp=1370073666;
SELECT id,title,production_year FROM title WHERE title = 'Bambi';

Elasticsearch

Elasticsearch have two types of low logs, one for searches and one for indexing, both works in similar fashion. The search slow log allows you to log slow search (query and fetch phases) into a dedicated log file.
Thresholds can be set for both the query phase of the execution, and fetch phase. They are then used when the loglevel is set to does as the limit to log.

Search slow log configuration example, [latest 2.3 branch]

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

Both logs log to a dedicated file, one for slow queries and one for indexing.

Is specific for the indexing log that the user get the option to log part of the _source field, how much is logged can be configured (default is first 1000 characters).

Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-slowlog.html

Happy to discuss details.

design discuss logging improvements

Most helpful comment

One good proposal by guy was the idea to introduce some degree of topN reporting in the webapi. I am wondering what might be the best url for them. I have several ideas, but not sure what might be the best

  1. _node/monitoring/slowlog
  2. _node/slowlog
  3. is namespaced because we might like to add another monitoring resources in the feature, for example pipeline status.

Initial proposal for slowlog data to be shown in the api is:

  • Top N slow operations.
  • Most frequent slow operations.

All this resources will be grouped by logger and include all relevant information. For example,

{
    "loggers.slow.plugins": {
        "filters.grok.execution": 12,
        "output.elasticsearch.store": 9
    }
}

to report top N slow operations for the plugins logger.

@ph @suyograo @guyboertje what do you think?

All 43 comments

I like this, We can introducer a SlowLogger class alongside Logger when the feature/log4j2-logging lands. We can have this be configured with its own file output, helper method like the report_slow_operation.

This will be a good feature. Thanks for the detailed proposal. Some comments:

To implement this feature, specially if we just wanted to get slow_log mix with the current log, we could just reuse the logger instance, however we aim for a more general structure, this solution is discarded.

This should be one of the requirements. Slowlog and normal operating log must be in 2 separate files to be effective. It gets too noisy otherwise.

Build an slow log service able to receive slow log messages, gather context (when choosen to) and output the final log.

+1 for this to be in core and exposing an API for plugins.

Add an slow log feature flag (disabled by default)

My preference is to not have a (yet another) flag for this. I agree slowlog needs to be disabled out of the box, but that can be controlled via the thresholds. This way users need only change a single setting. This is also consistent with how Elasticsearch slowlog works.

I suggest we have 2 types of slowlog:

  • Plugin specific slowlog which deals with event processing slowness that can be controlled per plugin.
  • Pipeline specific slowlog which deals with pipeline specific slowness (metrics, persistent queues being slow, etc)

Before we progress to implementation, I would love to see:

  1. How do the various configs for slowlog look like?
  2. How can users configure the slowlog (API, file based)?
  3. Can we provide an example of how plugins interact with slowlog. What does the API interface look like?
  4. @talevy brought up feature/log4j2-logging -- how will this feature work with it?
  5. It would be interesting to see what an actual slowlog might look like, for example, grok?
  1. @talevy brought up feature/log4j2-logging -- how will this feature work with it?

Depending on how one wants to organize the configuration for this, this can all be done in the log4j2.properties file. Simply have a specific logger for anything in the slowlog namespace. This way it would really just be logged just like any other logger. The difference here being that there would be custom logic in the SlowLogger class to determine when it is appropriate to log the desired item.

Moin, thanks @suyograo and @talevy for your feedback. Here are the relevant bits from my current proposal that can be seen at https://github.com/purbon/logstash/tree/feature/slow_log for more in detail WiP branch.

@talevy for sure this will need to be adapted as soon as your log4j feature gets merged.

For now, important parts are:

Settings

How to configure settings, using the logstash.yml file from logstash-core/lib/logstash/environment.rb

 Setting::Boolean.new("slow_log.pipeline", false),
 Setting::Boolean.new("slow_log.plugin", false),
 Setting::String.new("slow_log.dir", nil, false),

Thresholds

Thresholds for the plugins could be defined either in the main environment.rb class or specifically into each plugin by providing proper initialization methods, however from user point of view will be always going through the logstash.yml file. I think they should be keep in this file as they are actually logstash related.

Others

I added feature flags because I think is important to flag a all or nothing flag, we'll have for sure also threshold flags, but if we add this global flags we don't have to put imaginary big thresholds and can have "resonable" values in there.

End of the day we should pick up the approach in terms of making our users life easier, what means we should bring intuition and misunderstandings in mind.

For sure, as far as @talevy feature include the option to change this settings from the api, this should be added here when his job is merged. Makes sense to me that his is pending his work.

Plugin interaction

plugins can communicate with the slow_log by using a code snippet like:

slow_logger.log("this is slow", :action => "test")

this log method is accepting as many parameters are passed, like this there is no limit to the amount of information passed by the component to the logger. In case of having the log disabled, this logger is going to be basically a NullLogger, logging nothing.

plugins should also be able to fetch settings, for example to know thresholds, using a reference like:

 setting("plugins.grok.slow_filter_in_sec")

Slow log

The slow log component has a log method, that will include alias if necessary. This method is the responsible of actually logging the information and fetching the context as complement.

 def log(*args)
      context_data = fetch_context_info(*args)
      logger.warn(*args, context)
    end

How this will look like in code end of the day will depend of the logger used, but most probably by having specific classes per each slow log type. As of the first version I propose to have a:

  • PipeLineSlowLog to log pipeline slow log information, for example when the pipeline is stalled.
  • PluginSlowLog to log plugin slow logs, for example slow grok operations.

Each of this slow logs will be logger to their specific file, or STDOUT if configured.

what do you think?

@purbon - I think of this as a kind of metric.

I like the idea that the top N slow operations will appear over the web api.

If it is in a file, will this make it harder to 'see' when LS is deployed in Cloud?

Do we really need to see the full history of all slow operations in a file or will top N be enough?

Slow operations, logging and metrics are cross-cutting concerns - a 'capture' statement can occur anywhere in the code.

We did briefly discuss in the Batch Object PR about an alternative calling mechanism for metrics - could this feature pioneer this alternative mechanism and port it back to metrics when the API is settled?

In some ways, I agree with @guyboertje. Since Logstash runs on millions of events... if any one operation is slow, we will see it many many times. I know this is also true of queries in ES, but those tend to happen much less frequently than an event passing through the pipeline. Keeping track of the Top N slowest operations is a good place to start

Why not do both? I see value in both -- expose a top N slowest operations via an API, and append slowlog as we encounter to the file?

Why not do both?

totally an option, but the question is which one should be implemented first (aka more important)?

Both works for me. Sometimes one needs realtime feedback and other times postmortem analysis is required.

@guyboertje good proposal to introduce the top N operations, we might like to for example do:

  • Keep the log with all the slow operations and their metadata.
  • Keep a top N structure in memory, so it can be queried from the API and report real time without having to get to the log file.

Makes sense?

If there are no more comments, I would like to keep us moving.

Looking forward to keep this moving 😄

  • purbon

I see value in both, N slowest via API and everything in a file.

@purbon I am currious about the implementation details, how we will capture or detect the slow down? What are the solutions you are looking? One idea that come to my mind its to supercharge the metric collecting methods with a callback object?

clock = metric.timer(:response_time, SlowThreshold.ceate("this takes a long time", 3))
## slow execution.
clock.stop

@ph In my initial thoughts this was not planned. I basically planned to provide a mechanism to:

  • Read a threshold from some kind of settings.
  • Emit an event to the slow log.

How this was timed I thought it would be dependant of every plugin, but we could for sure provide uniform mechanism also if makes sense? Open questions for this to me know are:

  • Is this timer only responsible of collecting execution time? or is also responsible of stopping the execution of the slow component (this could be tricky).
  • If only responsible of collecting time, is this really something we should provide a wrapper for? or can the user simple use Time.now?
  • Not sure if I would go for a component that is universally responsible of stopping executions as this might be very tricky to archive in general sense.

What do you think?

@ph In my initial thoughts this was not planned. I basically planned to provide a mechanism to:

I think we should do an exercice in parallel to see _how slow operation_ will be captured, since this might influence the architecture/design of the _slow log_ implementation.

Not sure if I would go for a component that is universally responsible of stopping executions as this might be very tricky to archive in general sense.

A watchdog feature with the nature of jruby could be hard to have, I wont go that route for now.

A watchdog feature with the nature of jruby could be hard to have, I wont go that route for now.

basically this is why I'm proposing only the very minimal thing read threshold + report slow operation. Then how do we decide if something is slow is something that can happen when after we decide on this interface is setup, unless I'm missing something.

might influence the architecture/design of the slow log implementation.

Would really love to understand your thoughts behind this? how can it influence the implementation if internally this is mostly a call that reports to a logger, for the case of reporting to a file.

@purbon Sorry for taking the issue off the track, my curiosity was more at how it will be collected not where it would be saved or displayed.

@ph very nice discussion to have, I think we can open a new issue to discuss and see options there. makes sense to you? I would name it something around task timing management or so...

@ph and @purbon - all good points.

It would be sad to have to add two statements, 1 for logging the slow activity and 2 for the metric.
I mean the metric object would have to have all the data/content to create a top N type entry anyway - could it not log as well? Its not the metric subsystem that would log it would be the SlowThreshold class that did the logging.

@guyboertje no need to be sad, we are here to discuss and reach consensus around this feature. Couple questions to try to understand a bit more your thoughts:

  • Do we see the slow log as highly coupled with metrics? for me they are actually two _related_, but different components, let's say one thing is to gather metrics and another is to report something was slowish (akka is always groks fault).
  • I however see the Top N as something we could add to the metrics framework, might be some concept similar to the periodic poller we already have? @ph what do you think?

@ph and @purbon - I guess I see the slow log function as a SectionTimer class that knows how to send the metric and log to the appropriate logger if the time taken is above a threshold.
That is: it has the threshold value, it has the metrics channel and it has the logger reference.

I would ideally keep the metrics feature away from slowlog. To me, they are 2 separate concerns. Slowlog captures a lot more context than what the monitoring APIs do. Also, slowlog can be expensive, hence they need to be turned on. Metrics is always on. Agree that both are useful for debugging, but complementary ways of looking at "What's going on with my LS".

@purbon as @ph pointed out, how do you intend to capture or record slowlog for operations in plugins. Can you give us a concrete example? Maybe use DNS filter as example...

@suyograo as I pointed to @ph in my comment here I see a few questions:

  • How to decide an operation was slow? This is out of the scope of this PR as it will involve specific solution per plugin, _usually_.
  • How to communicate an operation has been slow ? This is certainly in the scope of this PR and has been explained in the plugin interaction section in this issue comments.

How to decide an operation was slow? This is out of the scope of this PR as it will involve specific solution per plugin, usually.

I am not sure why this is out of scope. Isn't this a core part of this design issue? How are you thinking of measuring a slow operation from a plugin point of view? What infra will we need there?

@suyograo Well in short version, I see a difference between:

  1. communicating an operation has been slow [explained in this issue]
  2. measuring how long it takes a given operation to run.

Point 2. I see it as a different issue because it can involve different approaches for different kind of operations, but to provide with at least one solution, I think for most of them we could use a monitor approach.

     slow_log_monitor(threshold) do 
         ### run the operation to be monitored
    end

This will involve having a module that basically does:

  • Wrap the operation to be monitor.
  • Start another thread as timer.
  • If the timer detects the threshold has been breach, report this to the main monitor class.
  • Monitor class has the responsibility of holding a reference to the slow log and decide to report.

Open questions:

  • Do we want the slow log monitor to be also kind of timeout? I would not go that way as is really mixing things, for me the slow log monitor will only be doing the timing and reporting.
  • What happens when an operation that is monitored also has a timeout mechanism? here we get into the land of _seconds_ being counted in the wrong bucket, do we worry about time being spend inside the timeout module being part of an slow log accounting? and the other way around.
  • If we enter the _timeout_ land, the open question here is how to terminate the operation without causing damage.

This are the questions that I can see for now and why I do think while it might be _simple_ for grok, or others, to provide a general enough solution for all plugins is tricky and might deserve it's very own design issue.

Makes sense for you?

@suyograo - I do agree that reporting a slowlog episode to metrics and reporting it to a logger are different concerns - but I feel that they are facets of one higher concern, reporting.

I see a slowlog "device" as having three higher concerns

  • capture
  • threshold crossing logic (at this moment in time, is this really slow? is it n% above my moving average for example)
  • reporting (initially log, enhance later to send metric)

I think capture would include techniques seen in benchmark code that attempts to split the signal from the noise.

However figuring out what slow means is not simple.

I see three sources of execution speed variation

  • when doing external IO
  • when the data in an event causes a slow down (string manipulation (csv, grok, dissect), json parsing/generation, UTF8 encoding)
  • when the JVM or the OS interrupts our execution.

We should probably know the event or string size when when judging for data based captures.

We should consider self-calibration instead of manual calibration - here I mean, for example, that a slowlog device monitoring a dns lookup function should calibrate itself by periodically sampling a reference (nslookup) or tail a file that gets calibration updates, maybe from a file output on a different reference instance of LS.

I still think the metric are linked to this slow log idea as @guyboertje mentioned.

The metric already collect all the data, sampling/aggregation is really core to what a metric system is and adding support for it could yield into more feature.

I can see that the metric could yield event based on the sampling it does.
Which would keep theses system decoupled, turning on or off could be also easily implemented in that scenario.

@purbon concerning the external thread are you thinking of launching 1 thread per slow_log_monitor calls? this could lead to creates a lot of new threads, I am not sure it will be good for performance.

@ph I'm happy to see other ideas to timing tasks and operations, seriously this is why I said this deserves it's on issue because the question is not easy to ask in general terms.

not sure if I see it as a metric because end of the day this follows this pattern.

  1. One threshold is defined somewhere.
  2. An operation is launched.
  3. If this threshold is reached the system should report an event that operation A is slow.

The big open question here is:

  1. How to properly time an operation, which is the degree of interference we aim for.

If I understand your approach your approach is:

  1. We wrap an with a timer that reports to the metrics central component.
  2. Metrics decide when should report an operation has been slow. This might get into different logic for each kind of _operation_.

While I do understand your concepts, I see this other approach as really more complex. I would really love us to keep this initial implementation, later on we can always improve the approach.

If you ask me now, my idea is:

  1. The user is really concerned about the complete time an operation runs, does not care if is in GC or others, because end of the day he is the user of this, let's say grok filter.
  2. I see the concern on how to approach timing of this operations right, agreeing on the degree of "interference".

Current state of the proposal

After the initial proposal (see top of the issue), a few comments and great ideas, this is aiming to be a resume of the different topics discussed, gaps and missing comments are unintended, let me know and I will fix.

The intention is to enable us moving forward and reach consensus.

This is the list of topics in discussion:

How to signal the enable/disable and configure the different thresholds.

My preference is to not have a (yet another) flag for this. I agree slowlog needs to be disabled out of the box, but that can be controlled via the thresholds. https://github.com/elastic/logstash/issues/5733#issuecomment-238412582

Current proposal is to use feature flags and a definition for each threshold, like this users can control if a feature is on/off and what slow means for each one. Also avoiding the need to put extra large threshold to be sure the slow log is disabled. Feature flags are common practice in many DB systems for example.

Adding an api endpoint to report the top N slow operations for a pipeline.

I like the idea that the top N slow operations will appear over the web api. https://github.com/elastic/logstash/issues/5733#issuecomment-238514678

This was initially not scoped, but is certainly a very good idea that should be implemented. Should not be hard to provide in an initial scoping.

What is the degree of interference when timing slow operations we are good with.

Timing slow operations might get interfered with time spent into another operations, how concerned are we with the precision and isolation of this measurements?

The user is really concerned about the complete time an operation runs, does not care if is in GC or others, because end of the day he is the user of this, let's say grok filter.

@purbon

Related https://github.com/elastic/logstash/issues/5733#issuecomment-239445851

Distribution of responsibilities, who does what and when.

In the initial proposal the distribution of responsibilities is:

  • The plugin decide when an operation has been slow.
  • The plugin communicate this situation to a core component responsible of:

    • Gathering core context (pipeline status, …)

    • Appending to an slow log file, one per component.

Timing the operation is something provided to the plugins for the logstash-core (for the general use case), but could certainly be extended by the plugins when necessary. It looks like

 slow_log_monitor(threshold) do 
         ### run the operation to be monitored
    end

This is proposed like this because it provided a proper separation of concerns between the plugin (who knows if something is slow, the thresholds, etc), so the one who decide something has gone strange, and the core components, responsible of gathering context and reporting.

However there has been an alternative proposal:

Thinking the slow log as a kind of metric.

https://github.com/elastic/logstash/issues/5733#issuecomment-238514678, https://github.com/elastic/logstash/issues/5733#issuecomment-238882890, https://github.com/elastic/logstash/issues/5733#issuecomment-239457705

Interesting notes:

The metric already collect all the data, sampling/aggregation is really core to what a metric system is and adding support for it could yield into more feature.
I can see that the metric could yield event based on the sampling it does.
Which would keep theses system decoupled, turning on or off could be also easily implemented in that scenario.

@ph

We should consider self-calibration instead of manual calibration - here I mean, for example, that a slowlog device monitoring a dns lookup function should calibrate itself by periodically sampling a reference (nslookup) or tail a file that gets calibration updates, maybe from a file output on a different reference instance of LS.

@guy

What in terms of responsibilities will mean:

  • The plugin time an operation.
  • If the operation is slow, it report this event to the metrics system.
  • The metrics system contains the logic necessary to sample, aggregate. This might be different per each kind of operation or “device”.
  • The metrics system outputs to the slowlog component when necessary

To resume this, the couple to metrics is highlighted because of:

  • [@ph, @guy] Per the nature of logstash where lots of events pass through the pipeline it could be situations when the operation being slow is reporting lots of slow events.

    • Top N and sampling/aggregation approaches.

  • The definition of rigid thresholds, taking into account the complete diverse ecosystem that logstash runs into, might not be the best solution.

    • Self calibration of thresholds.

I think some of this feature are possible to be implemented without interfering with the metrics module, that has another objective.

We can add logic, if agreed with, to the slow log module to:

  • Not overflow the slow log with a lot of possible repeated information.
  • Report the top N slow operations to the web api.

Adding self calibration is something that looks very powerful, but might require lots of research to make it right, I would make it a second phase feature.

+1 on self-calibration being second phase.

@purbon - consider slow_log over a batch. This will have an automatic averaging effect.

Deliverables and Phases

After having reached consensus on the different ideas discussed in this issue, this will be the initial scope of deliverables:

Phase 1

  • Build of an independent slow log component with the responsibility to report this operations to log appender.
  • The central (logstash-core) slow log component will:

    • Receive messages from plugins.

    • Output to a log file.

    • One per component, aka pipeline, plugins, etc…

    • Gather context, context will include in this phase.

    • Pipeline status, from the same pipeline.

    • Statistical information about last N times operations. At less avg will be included, leaving the room for more if possible.

    • Gather Top N slow log operations.

    • This information should be made available to the web api in form of an end point.

    • Top N operation will be ranked:



      • Based on total execution time.


      • Per each slow log, so one for plugins, one for pipeline and so on.



    • Provide helpers to time operations.

  • Thresholds will be defined:

    • At the settings (logstash.yml) file, affect all plugins.

    • Per plugin inside as a configuration option. (override global thresholds).

  • Plugins will:

    • Time the operations.

    • Timing the complete operation execution time, from beginning to end.

    • Timing methods are provided as core helpers.

    • If the threshold has been reached, report the operation as slow.

    • If the threshold is not defined (null), the message will be send to the null logger.

This only include work to be done at core level, once this is in making plugins report slow operations will be necessary, first obvious candidate will be the grok filter.

Phase 2 and the future

  • For the logstash-core slow log component

    • Slow log output based in aggregations and how often a given type of slow operation has been reported. [slow operations overflow control].

    • This work relates to better understanding of how to represent slow log data for the users to understand.

    • Self calibration and sampling references for threshold definition.

    • Extended statistics for the execution time.

    • Extended top N structure, aggregated by plugin.

  • For settings management:

    • Integration with the web api to change the threshold at runtime.

  • Metrics -> Monitoring:

    • Assess whether and how this could be integrated with metrics / monitoring component to make the later more actionable to slow operations.

  • UI, data digestion:

    • Improve web api topN endpoint with the new structure.

    • Build more endpoints that enable an easier slow log digestion:

    • Statistics per slow log type.

    • TOP N operation inspection.

    • ….

I personally feel this feature is premature.

  • The architecture for this is non-trivial and as the discussion moves forward we can already see the scope is exploding. I am concerned about the development cost versus the actual benefits for the users.
  • We already spent significant effort on the metrics architecture and have yet to gather real-world usage experience and users feedback for it.
  • Has this been an expressed concern by users and won't metrics help for some of these "slow" concerns?

I suggest we let metrics settle in, see what works, does not work. see how our users react and use it and then re-assess the need to a slow logger.

One good proposal by guy was the idea to introduce some degree of topN reporting in the webapi. I am wondering what might be the best url for them. I have several ideas, but not sure what might be the best

  1. _node/monitoring/slowlog
  2. _node/slowlog
  3. is namespaced because we might like to add another monitoring resources in the feature, for example pipeline status.

Initial proposal for slowlog data to be shown in the api is:

  • Top N slow operations.
  • Most frequent slow operations.

All this resources will be grouped by logger and include all relevant information. For example,

{
    "loggers.slow.plugins": {
        "filters.grok.execution": 12,
        "output.elasticsearch.store": 9
    }
}

to report top N slow operations for the plugins logger.

@ph @suyograo @guyboertje what do you think?

Some initial experimentation on how to represent this information is:

screen shot 2016-09-09 at 16 10 31

goal was to reduce the amount of information shown, what do you think?

re: https://github.com/elastic/logstash/issues/5733#issuecomment-240093470

  • Thresholds will be defined:

    • At the settings (logstash.yml) file, affect all plugins.

I think it is more appropriate to only enable this configuration in the settings file. Maybe we can add
runtime APIs to allow updating this at runtime so users could potentially introduce a specific threshold to monitor any unexpected slowness.

- Per plugin inside as a configuration option. (override global thresholds).

Configuration options in the pipeline definition feel like they should relate to how the plugin operates rather than any general logstash operational features

@andrewvc brought a related concern in the PR, I c&p here to keep the discussion going on here.

I don't think we should be putting slowlog thresholds in the main plugin config. That sort of stuff should be in the settings.yml. In my mind the plugin config is for defining what the pipeline does, and should not include settings related to performance. While it is true that some plugins today do have some performance related settings included we should try to minimize these, not add to them.

@talevy I like the idea of adding an option to change this settings at runtime via the API, something similar to what #5871 is doing, right?

@suyograo, @talevy and @andrewvc brought two valid concerns about the fact to enable plugins to have thresholds defined in their config. As you were the one who took that idea in I would like to know what do you think?

Even if we agreed previously to have both, I am not completely sold to that and see their arguments as good points, especially as soon as #5871 gets merged and we can change settings at runtime.

@purbon yeah, adding an option to the logging api would be great

@purbon I am ok with not having an option to specify thresholds in plugins. We can evaluate later

closing in favor of #6128

Was this page helpful?
0 / 5 - 0 ratings