Loki: Per-tenant max websocket connections.

Created on 20 Jan 2020  路  12Comments  路  Source: grafana/loki

Is your feature request related to a problem? Please describe.

When running Loki in multi-tenancy mode, we currently have no way to limit to amount of websocket connections opened by tenants. However we should be able to apply a limit as websocket connection are resources that can be exhausted by a single tenant.

Describe the solution you'd like

I would like a way to set a global limit (not per ingester nor querier) per tenant on the maximum connection of websocket opened. The user should receive a 4xx response when it tries to go over the limit of websocket. This is also a good way to let know a user that he might be leaking websocket.

We recently added a global push limit, using the ring code it seems, I think we should be able to re-use this piece in here.

/cc @pracucci @slim-bean @sandlis

Describe alternatives you've considered

Alternative is per ingester/querier limits. Simple but less interesting.

componenloki keepalive kinfeature

All 12 comments

I would like a way to set a global limit (not per ingester nor querier) per tenant on the maximum connection of websocket opened.

Remembering back few months ago when I looked at the live tailing, if I remember correctly, a tenant opens a websocket to a querier and then the querier will live tail from the ingesters (via gRPC). If I'm correct, then where should the limit apply? Should we limit the number of websocket requests hitting the queriers?

I would like a way to set a global limit (not per ingester nor querier) per tenant on the maximum connection of websocket opened.

Remembering back few months ago when I looked at the live tailing, if I remember correctly, a tenant opens a websocket to a querier and then the querier will live tail from the ingesters (via gRPC). If I'm correct, then where should the limit apply? Should we limit the number of websocket requests hitting the queriers?

Yes we should limit at queriers.

I would like a way to set a global limit (not per ingester nor querier) per tenant on the maximum connection of websocket opened.

Remembering back few months ago when I looked at the live tailing, if I remember correctly, a tenant opens a websocket to a querier and then the querier will live tail from the ingesters (via gRPC). If I'm correct, then where should the limit apply? Should we limit the number of websocket requests hitting the queriers?

Yes we should limit at queriers.

The ingestion rate global limit (which you mentions in the PR description) uses the ring just to count the number of healthy distributors and then configure a local per-distributor rate limiter based on that (each distributor local limit = global limit / current number of healthy distributors).

This strategy works given two assumptions:

  1. Requests are evenly distributed across the pool of replicas (ie. distributors)
  2. The global limit is orders of magnitude larger then the pool size (ie. 1K QPS over 10 distributors)

That being said, what's a reasonable websockets limit you would like to set per tenant? 10? 100? 1K? 10K?

between 5 and 30, depending on some tenants.

between 5 and 30, depending on some tenants.

Got it. Then the strategy used for the ingestion rate global limit doesn't work. We need start brainstorming how to do it here.

@mattmendick I think we need to assign this to someone. May be @sandlis (not sure if you have some bandwidth on your side.)? Let's get a design in doc may be and link it here.

I can have a look at it. Ingesters are the only components which have all the live tailing requests at one place. The simplest solution I can think of is ingester exposing an RPC call for finding the count of the number of live tailing connections that it has for a tenant, which would then be used by queriers and decide whether limits are being crossed. I know its not a sophisticated solution but given its simplicity, I think we can consider it. I will also see if I can think of any other more sophisticated solution.

If one live tailing requires multiple connection to ingesters, we should count that as one. Your solution could work yes, but be aware there will be a race.

Yes but I think we could live with that assuming in worst-case scenario there would be at most 1 or 2 extra websocket connections over the configured limit that would be allowed due to that race.
What do you think? Am I missing something else?

Yeah if possible we should instrument this.

@sandlis Will you work on a design doc first to make sure we're on the same page or is this straightforward enough to reference the design in your comment above?

It is straight forward enough and should not take much time to build. I will work on it and open a PR soon. Any other approach would be complex since queriers can't talk to each other so some other component would have to keep a count of open web socket connections which is an added overhead.
As rightly pointed out by Marco, we also can't go with the global average approach due to the size of the limit.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

negbie picture negbie  路  3Comments

Horkyze picture Horkyze  路  5Comments

cyriltovena picture cyriltovena  路  4Comments

gouthamve picture gouthamve  路  4Comments

pandey-adarsh147 picture pandey-adarsh147  路  4Comments