Logstash: [DISCUSS] PQ disk size usage APIs

Created on 18 Jan 2018 · 18Comments · Source: elastic/logstash

This discuss issue purpose is to untangle the requests around PQ disk usage APIs versus what we have now and reach consensus moving forward.

What we have now

Currently PQ disk usage stats are part of the _node/stats/pipelines API. For example curl -XGET 'localhost:9600/_node/stats/pipelines?pretty' provides:

...
      "queue" : {
        "events" : 0,
        "type" : "persisted",
        "capacity" : {
          "queue_size_in_bytes" : 255,
          "page_capacity_in_bytes" : 67108864,
          "max_queue_size_in_bytes" : 1073741824,
          "max_unread_events" : 0
        },
        "data" : {
          "path" : "/Users/colin/dev/src/elasticsearch/logstash/data/queue/main",
          "free_space_in_bytes" : 33588469760,
          "storage_type" : "hfs"
        }
      }
...

queue_size_in_bytes calls Queue.getPersistedByteSize() and provides the actual number of bytes written to each pages. This number is consistent with what du on Linux would provide but may show a different number from ls because of the mmaped file sparse allocation. See this discussion for more details. We agreed that this is the correct number we want to provide as opposed to simply summing the live pages configured page_capacity.
page_capacity_in_bytes is not very relevant from a monitoring POV but provides valuable information for performance related diagnostics.
max_queue_size_in_bytes is the configured queue.max_bytes. What is interesting here is the relation between queue_size_in_bytes and max_queue_size_in_bytes which can provides a % fullness.

What is proposed

6508 proposes 2 things:
1. Introduce a page level API _node/stats/queue
2. replace queue_size_in_bytes with queue_size_on_disk which is a percent value in relation to max_queue_size_in_bytes
6884 proposes for PQ:
1. Disk allocated for PQs
2. Disk used by PQs

Discussion

(Introduce a page level API _node/stats/queue) : since queue(s) are per pipeline(s) I believe we need to keep that under the pipelines {} namespace.
(replace queue_size_in_bytes with queue_size_on_disk which is a percent value) : I don't see a problem introducing a new percent field if it makes it easier for downstream UI. (@ycombinator ?)
(Disk allocated for PQs) : Not sure what we mean here? There is the max_queue_size_in_bytes which is the max bytes on disk PQ will use, there is queue_size_in_bytes which is the current bytes on disk PQ is using. Note that new bootstrap check and Queue opening check will be done regarding these and the disk free space, see #8978
(Disk used by PQs) : I believe this is covered with queue_size_in_bytes?

For items (3) and (4) maybe the intent is about an aggregate view of all configured pipelines+queues ? @acchen97 ?

Anything missing? Comments? Suggestions?

discuss persistent queues

Source

colinsurprenant

Most helpful comment

For (3) and (4), I feel just exposing the current queue size (queue_size_in_bytes) and max queue size (max_queue_size_in_bytes) should be sufficient for the API. The percentage full metric should just be calculated...

Ha, true enough! I'm good with the UI calculating the % from these two metrics.

ycombinator on 20 Jan 2018

👍2

All 18 comments

Also @jordansissel @andrewvc you may want to look at this.

colinsurprenant on 18 Jan 2018

(replace queue_size_in_bytes with queue_size_on_disk which is a percent value) : I don't see a problem introducing a new percent field if it makes it easier for downstream UI. (@ycombinator ?)

I just checked and the UI is currently not even using queue_size_in_bytes so replacing it will certainly not break anything. If/when queue_size_on_disk is implemented, the UI can start using it. So 👍 from the UI POV.

replace queue_size_in_bytes with queue_size_on_disk which is a percent value in relation to max_queue_size_in_bytes

Three questions here:

queue_size_in_bytes refers to the absolute number of bytes of event data in the queue, right? And max_queue_size_bytes refers to the capacity of the queue?
I can definitely see the percent value being very useful to someone monitoring their Logstashs (e.g. alert me when it's above 85%). But I'm less sure if a user would ever care about the absolute queue size (what is currently queue_size_in_bytes). Is there a use case for keeping queue_size_in_bytes around? I looked at #6508 but it doesn't say why queue_size_in_bytes is not useful to the user. I'd love to understand the rationale a bit more.
Can we throw the word percent in the name of the metric somewhere, like queue_size_on_disk_percent or something like that? It'd remove any ambiguity about the units. Then again, not a huge deal if we don't want to do this.

ycombinator on 19 Jan 2018

👍1

@ycombinator

I just checked and the UI is currently not even using queue_size_in_bytes so replacing it will certainly not break anything. If/when queue_size_on_disk is implemented, the UI can start using it. So 👍 from the UI POV.

IMO we shouldn't remove queue_size_in_bytes as it represents the current on-disk byte size of the queue.

The Suyog's suggestion for queue_size_on_disk was to add a % value of queue_size_in_bytes over max_queue_size_in_bytes and I think this is a valuable metric - but we should choose another name like queue_percent_full of something.

queue_size_in_bytes refers to the absolute number of bytes of event data in the queue, right?

Yes.

And max_queue_size_bytes refers to the capacity of the queue?

The maximum queue capacity, if set in the config (a queue can be unbounded). If/when queue_size_in_bytes reaches max_queue_size_bytes the pipeline starts to back pressure.

But I'm less sure if a user would ever care about the absolute queue size (what is currently queue_size_in_bytes).

Good point - for me it boils down to also providing the "real" underlying numbers so that users of our API have a choice with their monitoring use-case. As much as a % is easy to reason about, I can totally see some users wanting to see the "real" queue size in bytes too. Not sure why we should not provide it.

Can we throw the word percent in the name of the metric somewhere

+1 ^^ queue_percent_full ?

colinsurprenant on 19 Jan 2018

The maximum queue capacity, if set in the config (a queue can be unbounded). If/when queue_size_in_bytes reaches max_queue_size_bytes the pipeline starts to back pressure.

So what will queue_percent_full report if the user is using an unbounded queue? Always 0?

Good point - for me it boils down to also providing the "real" underlying numbers so that users of our API have a choice with their monitoring use-case. As much as a % is easy to reason about, I can totally see some users wanting to see the "real" queue size in bytes too. Not sure why we should not provide it.

So then we'll provide queue_percent_full (for the relative / % value) and queue_size_in_bytes (for the absolute / "real" value)?

queue_percent_full ?

Sounds great. Thanks!

ycombinator on 19 Jan 2018

So what will queue_percent_full report if the user is using an unbounded queue? Always 0?

Good question. I assume queue_percent_full does not make sense in that situation. I could be set to 0.

So then we'll provide queue_percent_full (for the relative / % value) and queue_size_in_bytes (for the absolute / "real" value)?

+1 I think that make sense.

colinsurprenant on 19 Jan 2018

@colinsurprenant thanks for working on this.

(Introduce a page level API _node/stats/queue) : since queue(s) are per pipeline(s) I believe we need to keep that under the pipelines {} namespace.

Agreed. This should be pipeline level so we can maintain that granularity. If we need something aggregated across the entire node, we can aggregate at ES/KB layer prior to rendering in UI.

For (3) and (4), I feel just exposing the current queue size (queue_size_in_bytes) and max queue size (max_queue_size_in_bytes) should be sufficient for the API. The percentage full metric should just be calculated at the ES/KB layer, similar to what we do for other metrics in stack monitoring. @ycombinator would love you thoughts on this as well.

acchen97 on 19 Jan 2018

For (3) and (4), I feel just exposing the current queue size (queue_size_in_bytes) and max queue size (max_queue_size_in_bytes) should be sufficient for the API. The percentage full metric should just be calculated...

Ha, true enough! I'm good with the UI calculating the % from these two metrics.

ycombinator on 20 Jan 2018

👍2

Just chiming in to say this all makes sense so far to me.

andrewvc on 22 Jan 2018

👍1

Ok so far I believe we are agreeing on:

keeping the current queue.capacity API as-is:

        "capacity" : {
          "queue_size_in_bytes" : 255,
          "page_capacity_in_bytes" : 67108864,
          "max_queue_size_in_bytes" : 1073741824,
          "max_unread_events" : 0
        },

introducing a UI-side only queue_percent_full which will be calculated from queue_size_in_bytes and max_queue_size_in_bytes.

Unless I am missing something, I think the only work to do in that respect is on the UI side? If this is the case then we can close this issue and create a new UI issue that we can link in #8936

colinsurprenant on 25 Jan 2018

👍1

@colinsurprenant I'm good with your last comment. One thing, though, I'm not seeing any of those existing fields in the x-pack monitoring indices (I just double-checked by creating a pipeline with a persistent type queue and deleting my monitoring indices for a clean slate). So I think that work will need to be done as well before the UI work can be done?

ycombinator on 25 Jan 2018

@ycombinator good question - I am not too familiar with the x-pack stuff actually.

colinsurprenant on 25 Jan 2018

@ycombinator if the field now exists in the monitoring API, then we might have to update the metrics payload that's sent via monitoring in LS x-pack. I'm not sure whether additional logic is required on the ES receiving end through the monitoring endpoint. Perhaps we can first create an issue for adding another PQ graph in the monitoring UI. A new graph in the advanced tab with two lines for current and max usage could make sense, similar to the JVM Heap graph we have today. Curious on your thoughts as well.

@andrewvc thoughts on investigating this on the LS x-pack side?

acchen97 on 25 Jan 2018

Perhaps we can first create an issue for adding another PQ graph in the monitoring UI. A new graph in the advanced tab with two lines for current and max usage could make sense, similar to the JVM Heap graph we have today.

@acchen97 Just want to double check my understanding first:

Said chart would show PQ usage per node
We'd calculate this by summing up PQ usage (queue_size_in_bytes and max_queue_size_in_bytes) for all pipelines that report it, which will presumably be only those pipelines that use persistent queues.

Correct?

ycombinator on 25 Jan 2018

@ycombinator I am currently looking at adding these fields into x-pack monitoring.

colinsurprenant on 31 Jan 2018

Sweeeet, thanks @colinsurprenant!

ycombinator on 31 Jan 2018

👍1

After analyzing the current x-pack monitoring stuff and discussing with @ycombinator it seems we are basically only missing exposing the queue_size_in_bytes and max_ queue_size_in_bytes in the x-pack monitoring payload, per pipeline. I will create a PR for this shortly.

this is the current logstash_stats.pipelines field nested document:

{
  "id": "main",
...
  "queue": {
    "type": "memory",
    "events_count": 0
  },
...
}

I will simply also add queue_size_in_bytes and max_ queue_size_in_bytes in the queue object.

colinsurprenant on 31 Jan 2018

I added the said fields in x-pack monitoring, waiting for review. Once this is merged I believe we'll be able to finally close this.

colinsurprenant on 1 Feb 2018

changes in x-pack are merged and will be part of 6.3. Closing this.
@ycombinator do we have a specific issue for the UI work related to these changes?

colinsurprenant on 2 Feb 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings