Logstash: Pipeline initialization vs Persistent Queue initialization

Created on 14 Sep 2016  路  20Comments  路  Source: elastic/logstash

Currently in master the queue initialization is done in Pipeline#initialize and in the feature/java_persistence branch

This leads to a few potential problems:

  • Config reloading create new Pipeline instances and thus would close and reopen the PQ - this seems suboptimal?
  • OTOH config reloading might create the situation of an improper config for the existing persisted events.
  • Queue settings are pretty much decided at Pipeline initialization and can lead to problems in tests for example where we might not want disk-based PQ.

I thinks we should decouple the Queue initialization from the Pipeline and pass a Queue instance to the Pipeline constructor, giving us the choice of reusing an existing instance or not. This would also ease testing.

discuss

Most helpful comment

After discussing this with @suyograo and @jsvd here's what we propose for the first iteration:

To keep this simple and pretty much in-line with how users deal with external queues like Redis or Kafka for handling back pressure today, Logstash will simply point to the same queue path by default but it will be a globally configurable setting. We don't need to worry about multiple pipelines now since this is not yet a feature.

This means that by default, any kind of restart (after crash, after graceful shutdown or after a quick shutdown) will just use the same queue as before the restart.

It will be possible to globally configure a new queue path using the settings file which will default to :

path.data: LOGSTASH_HOME/data
path.queue: path.data/queue

User will be able to change specifically the queue path for example:

path.queue: /mnt/ls_queue

All 20 comments

Config reloading create new Pipeline instances and thus would close and reopen the PQ - this seems suboptimal?

until we think about separate input/filter/output pipeline fingerprinting, any reload should invalidate the queue, since the reload will only happen if the pipeline changes. If it changes, sending persisted data to a different filter section doesn't make sense, IMHO.

That said, how does the normal logstash restart work? In my mind, the pipeline starts and, based on the fingerprint of the config (hash of the config text, something else?) it opens the queue at a certain path. If the path exists it opens the queue, otherwise creates a new one for that pipeline ID.

You can revert the paradigm and have the pipeline ask a "queue factory manager" for a queue given a pipeline config, but it seems more complex, IMO.

NOTE: this is a dup of https://github.com/elastic/logstash/issues/5888, one of them should be closed.

I think we should give users the flexibility of either reusing a non-empty persisted queue on startup or creating a new one probably defaulting on creating new queues, leaving the previous one intact and providing the right tooling to re-feed an old queue. And/or maybe just be able to spin multiple pipelines and attach existing or new queues to each pipeline.

for example, a config A is running and need to be changed. a new config B is created from config A. LS is shutdown restarted with config B using a fresh queue. Now, older persisted events exists in the old queue that was associated with config A. Config A could be run as a second pipeline but without the inputs section to flush out that old queue (and the filters and output could also be updated if needed).

since the discussion is stating to take place here I will recopy @guyboertje original description:

This is a preliminary discussion aimed to capture what technicalities we should consider when a config is reloaded.

When the config is compatible with the events currently in the queue, e.g.:

  • close the queue before reload starts - this is this part of normal pipeline shutdown, but is more needed?
  • reopen the queue when the pipeline starts again, this is normal but is more needed?

When the config is incompatible with the events currently in the queue.

  • rename current queue folder and start a new queue etc.

I don't think there is a way to programatically figure out beforehand if a 芦_config is compatible with the events currently in the queue_禄 ... I guess we can only figure that a config has changed or not, by associating a config hash with a queue.

We could have a flush mode that starts any pipeline but doesn't start the inputs, just for the purpose of emptying the queue.
Also, assuming that the config hash isn't hardcoded in the queue files but only on the queue directory name, if a user wants to reuse the queue with a different config he/she only needs to change the directory name to match the new id, which could be printed during config test (-t), for example.

This kind of operation should be rare, as a graceful shutdown should flush all events, same with reloading.

That said, I don't have anything against more complex wiring options, just trying to keep things simple for "v1".

Agree for keeping things simple.

  • I like the idea of the flush mode! 馃憤
  • if the default is to start with a new queue, why don't we just provide the flexibility to configure the queue path in the config which would result in using a specific queue and not creating a fresh new queue? this would avoid having to deal with the association of config fingerprinting with queues?

why don't we just provide the flexibility to configure the queue path in the config which would result in using a specific queue and not creating a fresh new queue?

My thinking is that by default logstash starts with a config, creates a new queue, the config changes, the queue is emptied, logstash starts with the new config and a new queue, it crashes, restarts with the same config so it loads the queue for that config hash, and so on..
Nothing in here requires the user to come up with a name/location for the queue of the 2 configs if the mapping happens by default throught the config hash.

As for the queue path in the config, I'm not a big fan of adding this kind of configuration to the config dsl, maybe in the settings.yml? The downside of settings.yml is the same problem of the need to map pipeline configs/ids vs queues.

@jsvd but with 芦_the config changes, the queue is emptied, logstash starts with the new config and a new queue_禄 I don't think we can assume that upon config change the queue is/should be emptied.

I don't think we should see a non-empty queue as an abnormal situation. It is just a situation where the outputs are slower and having a non-empty queue could be very normal, think peak day periods for example. In that respect, changing config where you want those changes applied to that non-empty queue should be a very normal operation too.

The idea of specifying a queue in the config (in the input and filter sections for example) is that this is where the wiring is done. If you put it in the settings, you will have to ID the input/filter sections and associate a queue with a input/filter section. It feel more natural to do that there I believe. Note that with most configurations this is not required.

I understand it makes it easier, but the config has always been the description of the pipeline stages, for me it's strange to have this one setting section in the dsl.

If we do it, we should move other settings like pipeline id (defaults to uuid), batch size/delay (defaults to global value), workers and reload settings:

settings {
   "id" => "main"
   "batch.size" => 400
   "reload.automatic" => false
   "queue.path" => "/../../"
}
input {
   beats {}
}
output {
   elasticsearch {}
}

Also, if a user has a pipeline config spread over 10 files, I guess the queue path setting can be put in any of them and logstash throws an error if more than one queue path declaration is made?

If we do it, we should move other settings like pipeline id (defaults to uuid), batch size/delay (defaults to global value), workers and reload settings

Agree. And I think it would make more sense to have it there, especially in the context of enabling multiple pipelines.

After discussing this with @suyograo and @jsvd here's what we propose for the first iteration:

To keep this simple and pretty much in-line with how users deal with external queues like Redis or Kafka for handling back pressure today, Logstash will simply point to the same queue path by default but it will be a globally configurable setting. We don't need to worry about multiple pipelines now since this is not yet a feature.

This means that by default, any kind of restart (after crash, after graceful shutdown or after a quick shutdown) will just use the same queue as before the restart.

It will be possible to globally configure a new queue path using the settings file which will default to :

path.data: LOGSTASH_HOME/data
path.queue: path.data/queue

User will be able to change specifically the queue path for example:

path.queue: /mnt/ls_queue

Please bear in mind that multiple queue folders may consume all the disk if the user has not thought about or was not aware of this when specifying a size that is a large slice of the disk pizza.

@guyboertje good point - I think we can defer this discussion for when we tackle the queue limits handling? this is only be a problem when specifying a absolute byte size limit, whereas if using a percentage diskspace it will be good.

@colinsurprenant True. Do you mean % of available disk space at the time of queue creation?

@guyboertje not sure I understand how useful % of available disk space at the time of queue creation is? I was thinking about not allowing writing (blocking + backpressure) when at more than x% of disk space. obviously you don't want to have to compute % diskspace available on every write operation but this this kind of limit is somewhat fuzzy so I guess we can come up with a deferred mechanism for computing it which does not impact write operations performance.

@colinsurprenant - OK I understand. But that is a bit different from declaring the actual size upfront.

The ES use case is different because there is no real intention/training/advice to leave previous copies of ES data lying around.

In our case, while it may not be directly advised, it seems reasonable that people will have older (failed config) copies of queues on their machines. There is also the "challenge" of leaving enough space available for the DLQ file(s).

Not being too ops savvy, in production is their a chance that more disk space can be added by the ops people dynamically as LS is running without a LS restart? If so, then a %free check during runtime would work better than a static %free calc for queue size at startup. OTOH, the logfile, the DLQ and the PQ are all competing for disk space. I'm not sure how log4j handles out-of-disk-space, but we need to make sure PQ and any DLQ impl handles out-of-disk-space safely.

I thinks all possible queue limits/thresholds based on

  • max number of event
  • max total queue size
  • max % disk usage

are all valid.

Personally I think that in a backpressure/overflow use case, using max % disk space is what makes the most sense. If you reach this %, regardless of what actually fills the disk you stop writing and avoid a potential write corruption scenario upon no space left on the device.

For all other scenarios, max number of event or max queue size, there is no magic and there is so much the app can do to avoid a no space left on the device and I would argue that this is more an operations concerns.

But then, personally, if a hard size limit would be required for some use-case I would probably want to ALSO enable the max % disk space to avoid a no space left on the device.

all these options are possible and combinable too...

I think we should left the decision what to do with queue content on configuration author. He can want process old queue content with new configuration or create new queue for new configuration and drain old queue with other configuration or leave old queue content on disk for later processing (Logstash can display warning if there are events in queue without assigned pipeline):

settings {                    # global configuration, maybe in settings.yml
  queue.path => "/../../.."   # queue disk path
  queue.max_size => "NNN"     # maximum queue size in number of events (nnn) or
                              # disk size ("nnnGiB") or volume size percentage ("nn%")
  pipeline.max_running_wokers = N # max simultaneously running pipeline workers
  pipeline.batch_size = NNN   # default pipeline batch size
  #...
}

pipeline {                    # production pipeline configuration
  settings {
    name => "pipeline.v35"
    workers => N              # if ommited use global pipeline.max_running_wokers
    batch_size = NNN          # if ommited use global pipeline.batch_size
    # ...
  }
  input {                     # new events are assigned to this pipeline
    some_input_plugin {
      #...
      max_queue_size => NN    # percentage of queue.max_size for control
                              # backpresure on this input
    }
  }
  filter {
    #...
  }
  output {
    some_output_plugin {
      #...
    }
  }
}

pipeline {                    # legacy pipeline configuration
  settings {
    name => "pipeline.v34"
    drain_queue_and_stop => true # don't start input plugins, drain queue and
                                 # then stop and destroy pipeline
  }
  filter {
    #...
  }
  output {
    some_output_plugin {
      #...
    }
    # or
    pipeline {
      name => "pipeline.v35"   # re-assign events to this pipeline
    }
    # or
    null {}                    # discard events
  }
}

As a bonus, we could have more pipelines for different event types/inputs which can simplify large configurations and decouple filter and output stages if target can have long outages (for example when communicate over WAN) so we can replace reliable setup of Logstash with persistent queue (Kafka etc.) on his input and output with only Logstash persistent queue.

I believe we can close this issue at this point as the PQ integration WRT to pipeline & reloading is settled at this point, at least heading for GA (target 5.4). If we feel we need to re-open this issue we should create a new issue.
Thanks for your input @prehor - definitely good hints WRT upcoming multiple-pipelines support.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

simmel picture simmel  路  4Comments

behkxyz picture behkxyz  路  3Comments

dedemorton picture dedemorton  路  3Comments

molitoris picture molitoris  路  3Comments

max-wittig picture max-wittig  路  4Comments