PostgREST should have a way to perform a SELECT DISTINCT query.
I suggest a new reserved query word 'distinct'.
select distinct e.g. /mytable?distinctSELECT DISTINCT ON(....) e.g. /mytable?distinct=name would be SELECT DISTINCT ON(name) * FROM mytableNote that the columns that you are distinct on will also need to have order applied.
This has certainly been a frequently requested feature. My only hesitation is making "distinct" another reserved word, meaning it would block any filters on a column that happened to be called "distinct." However, like "select," this one is a sql reserved word too so it makes sense that we reserve it as a special url parameter.
I hesitate to try messing with the url parsing and query generation part of the code myself, that's more @steve-chavez's domain. Perhaps he can estimate the difficulty of this feature.
Not sure if this should be implemented, by allowing distinct to be applied to any column unrestricted clients could potentially ddos a database.
I've bumped into a slow distinct query in postgresql a while ago and solved it by using a group by instead of distinct, remember distinct generating a more expensive seq scan, I don't have the details anymore but a quick googling suggest the problem could still persist:
https://dba.stackexchange.com/questions/93158/how-to-speed-up-select-distinct
https://dba.stackexchange.com/questions/63268/postgresql-being-slow-on-count-distinct-for-dates
As mentioned in one of the answers, you can mostly avoid doing distinct with better database-design or better queries.
So in general I think distinct is a expensive operation(similar to group by) and it should be better exposed behind a view or a stored procedure.
@steve-chavez this is a good point against this. PostgrREST should expose only safe/fast operations
For what it's worth, DISTINCT is just syntactic sugar for GROUP BY and I think in modern versions of postgres it's even implemented using the same code. (And DISTINCT ON is a postgres extension from way back that's a bit of a performance hack). So @steve-chavez's conclusion is correct.
As mentioned in one of the answers, you can mostly avoid doing distinct with better database-design or better queries.
I have many situations where I've wished for postgrest to have a distinct operator.
e.g. a table that contains events (timestamp, device, event_type, data), I want the latest event for each device subject to filters such as timestamp/event_type. So how would you allow a query to be written that accomplishes this? with my proposal it would be http://myapp.com/api/events?timestamp=lt.2017-08-08&event_type=eq.alarm&order=timestamp.desc&distinct=device
The query would have to be behind a stored procedure/view, it's definitely necessary to use distinct for some cases, but as I've mentioned before exposing distinct unrestricted poses a threat for the database, a similar argument has been made for group by support in https://github.com/begriffs/postgrest/issues/158#issuecomment-223765083.
The query would have to be behind a stored procedure/view,
When it's behind a view/stored procedure then you can no longer support the arbitrary filters that I love about postgrest.
it's definitely necessary to use distinct for some cases,
In that case I think it's a requirement that we include the functionality.
but as I've mentioned before exposing distinct unrestricted poses a threat for the database
If you are afraid of performance issues, then perhaps we can only allow distinct when the jwt claim (or other setting?) permits it?
However I think there are already performance questions around postgrest: for a large table, the default route with no filters will be a huge response that may slow the database down to build.
You can create a stored procedure that returns setof events and apply filters like in a table see the tests, the same for a view you can create it as in: create view distinct_devices as select distinct device, timestamp.. and do regular filters on that.
You can create a stored procedure that returns setof events and apply filters like in a table see the tests, the same for a view you can create it as in: create view distinct_devices as select distinct device, timestamp.. and do regular filters on that.
But the filtering needs to happen before the distinct operation
There will always be a fight between the two opposing needs expose more sql features vs protecting the database, striking the balance is hard. in this particular thread i am with @steve-chavez for now.
No if features like this http://www.databasesoup.com/2015/02/why-you-might-need-statementcostlimit.html make it into the core, we might relax a bit and expose more of the sql, in that situation i would go with implementing the groupby (while at the same time, having the default setting for cost limit to a low value).
should we close this for now? i don't see it going further in the immediate future.
This is probably my most wanted feature; and it certainly gets requested a lot in the chat room.
Perhaps we could have a toggle in the config file allow_distinct for the performance concerned?
@ruslantalpa I also think that the idea of the cost limit is a good one, though is not that reliable a max limit could be worked out, that also is a must for other features like multiple requests in a transaction, I made a related comment on "Customize query timeouts" #249, maybe we should continue there with the discussion and see if we can implement that feature independently.
@daurnimator That toggle could be easily abused by the users, I don't think it should be added to the config, this phrase summarizes my reasoning:
A well-designed system makes it easy to do the right things and annoying (but not impossible) to do the wrong things.
Users should be thinking twice before exposing expensive operations as distinct, as it is now, PostgREST should not make it easy for users to make those kind of mistakes.
@daurnimator Have you had a look at pREST? It exposes more SQL directly, at the cost of potential performance issues, so may satisfy your DISTINCT needs via its GROUP BY support.
Just chiming in to request this enhancement be implemented. Distinct and Group-by will make quite a few views unnecessary in our case.
If one's really careful and uses the format function, this can be done with a proc and dynamic sql.
Example for DISTINCT:
create function projects(dist name) returns setof projects as $_$
begin
return query execute format($$
select distinct on(%I) * from projects
$$, dist);
end; $_$ language plpgsql;
Then you can use all of pgrst filters normally:
-- filtering
curl "localhost:3000/rpc/projects?id=eq.1&dist=name&"
-- selecting
curl "localhost:3000/rpc/projects?select=id,name&dist=name"
Still, exposing DISTINCT to clients should be thought twice.
I've thought about this and came up with a solution.
The idea is this. group by and function calling in select operate on the "virtual table" returned by the where filters. In context of public apis, you never want to return more then a couple of thousand rows, so considering this, alowing group by on a few thousand rows poses no risk. The problem then becomes how do you make views that do not allow returning a lot of rows at the same time. And it's done like this
first you have a function like this one
create or replace function validate(
valid boolean,
err text,
details text default '',
hint text default '',
errcode text default 'P0001'
) returns boolean as $$
begin
if valid then
return true;
else
RAISE EXCEPTION '%', err USING
DETAIL = details,
HINT = hint,
ERRCODE = errcode;
end if;
end
$$ stable language plpgsql;
and you define the view like so
create view test.protected_books as
select id, title, publication_year, author_id
from private.books
where
validate(
(current_setting('request.path', true) != '/protected_books') or -- do not check for filters when the view is embedded
( -- check at least one filter is set
-- this branch is activated only when request.path = /protected_books
(current_setting('request.get.id', true) is not null) or
(current_setting('request.get.publication_year', true) is not null) or
(current_setting('request.get.author_id', true) is not null)
),
'Filter parameters not provided',
'Please provide at least one of id, publication_year, author_id filters'
)
;
This is the way :grin:. The first parameter of validate can be any boolean expression and you can make arbitrary complex check on the incoming requests, for example on a time series table one could check that both start/end dates are provided and the diff is no more then a month. For this to work, the request.get needs to be made available in postgrest :), i've only implemented it my commercial version along with support for group by, function calling in select like max/sum ... and window functions, though not released yet, only on my computer and it seems to work great.
So the general idea would be to protect an endpoint against unwanted, unlimited requests, by throwing an error for those.
I guess you could achieve that in a function called through pre-request (http://postgrest.org/en/latest/configuration.html#pre-request). If you throw an error there, that should be good. You would of course need the query data there as well - but maybe those could be passed in as an argument (e.g. of type json) to the pre-request function. I think this approach would be a bit more flexible regarding the "checking of query string filters", because you can look for filters where you don't know the name beforehand as well. I imagine this is not as easy with the request.get.... approach.
the request.get needs to be made available in postgrest
If that one refers to the url query string, then it would be better as request.query_string.<key> or maybe request.qs.<key>. I can see support in postgrest for that, though I'm not sure if the value(eq.some, lt.123) would be useful for other means than to check if the value is set. Unless we somehow change the pgrst prefix to the pg operator lt.123 to < 123(not worth it probably).
support for group by, function calling in select like max/sum
Support for aggregates looks cool. Personally I would prefer the pre-request approach that Wolfgang mentioned, that way the VIEWs are not tangled with pgrst logic. Looks like that could be done by conditionally checking the path in the pre-request and then checking if the filters are applied.
Come to think of it, the required filter approach looks like an extended version of pg-safeupdate.
pre-request was a design mistake imo. There's rarely a valid use-case for it that can't be implemented better in the proxy.
Having the conditions checked in the view definition, with static boolean logic (and not in an unrelated imperative function) is exactly the point, a self contained "smart view".
Query string as opposed to separate parameters can work (the query can be split up by a db function) but i don't know which way is better/easier to use, haven't thought about it
about the values, you are talking as if the programing env in pg is some crippled thing, it's trivial to write a function to split the lt.123 value and let the consuming function decide what to do with that (since it knows the context)
pre-requestwas a design mistake imo. There's rarely a valid use-case for it that can't be implemented better in the proxy.
How about checking a custom header for validity based on data in the DB? That's exactly what I am doing for a subdomain-based multi-tenancy setup. Proxy sets the subdomain as a header and redirects the request. If that client/subdomain doesn't exist, I need to throw. I wouldn't know how to do that without pre-request and only proxy-based. So definitely not a design mistake, but very much needed.
Having the conditions checked in the view definition, with static boolean logic (and not in an unrelated imperative function) is exactly the point, a self contained "smart view".
That's for sure a positive, I agree. Although you can turn that around as well: If you need multiple endpoints to be limited in a very similar way, you can reduce code duplication a lot by putting that in a pre-request function instead of the VIEW.
But anyway: When setting the query string parameters as variables, both approaches can be taken.
Query string as opposed to separate parameters can work (the query can be split up by a db function) but i don't know which way is better/easier to use, haven't thought about it
I think it was just a different name instead of the .get. part, that Steve was suggesting, but the idea was still to split the query string up into parameters already.
about the values, you are talking as if the programing env in pg is some crippled thing, it's trivial to write a function to split the
lt.123value and let the consuming function decide what to do with that (since it knows the context)
It will not be as trivial, though, once you start adding nested logic, like or=(a.eq.A,b.eq.B,...) and more complex stuff.
I thought about adding the already parsed filters for a second, but I doubt that will be any less complex.. - because this would mean we had to pass the whole logic tree in... I'm not sure whether that would be worth it, tbh, - but if we wanted to have the logic tree available, passing this as a JSON argument to pre-request seems much more likely to solve that..
Given the complexity of implementing something that really works well not only for a couple of cases, but also in general, I wonder, whether we should maybe take a completely different route to solve the original problem:
What do you guys think about adding some kind of limit on the expected cost of the query? So if we were to do something roughly along the lines of:
BEGIN;
SET request.x=... -- all kinds of variables, like we currently do (no need for .get. / .qs.)
EXPLAIN read_query;
SET pgrst.estimated_cost=<from the explain>;
SELECT pre_request_function(); -- can use the estimated cost here to throw if you'd like
read_query; -- or use the estimated cost in the query implicitly: VIEW definition throwing like Ruslan suggested.
We could even provide a sane default for pgrst.max_estimated_cost and throw immediately if we wanted to. Of course with a nice error message explaining how to increase/disable the allowed cost if needed. ;)
@wolfgangwalther Yeah, I believe that approach would also be worth exploring. Check this related issue: https://github.com/PostgREST/postgrest/issues/249#issuecomment-334629208 (some ideas around the pg_plan_filter module).
@wolfgangwalther Yeah, I would also prefer that approach. Check this related issue: #249 (comment) (some ideas around the
pg_plan_filtermodule).
Oh, great - thanks for the link. Having different limits imposed for different roles is a great idea. Doing something like that in a pre-request function with the suggested pgrst.estimated_cost would also be easily possible, I guess. I think we're better off implementing this ourselves (by just providing that variable) then using pg_plan_filter. The problems with the per-user settings you're describing in the other thread are one thing - but also it's a lot less flexible, because we would now expect users to install that extension. If installing this extension would not be expected, we would be back at where we are right now: We can't make sure to execute "safe" querys only per default- so we would, again, be stopped from adding anything like DISTINCT / GROUP BY to the query syntax.
I agree about the extension shortcoming. Some other ideas:
EXPLAIN read_query;
SET pgrst.estimated_cost=<from the explain>;
AFAICT, the EXPLAIN result cannot be used as an expression in SQL. Maybe it's possible to execute an anonymous plpgsql DO BLOCK, obtain the result inside and then set it to our GUC?
Ideally we would enforce that the cost doesn't exceed a threshold. This could be an additional config option. If this config is enabled then we can allow group by/distinct.
I wouldn't know how to do that without pre-request
i just showed how, a few comments above with that validate function, it takes a bool as a parameter ... which can be anything, any function call or even a subquery. You just reached for the familiar "check it in a function" and didn't consider that the logic can be embeded in a view definition
you can reduce code duplication a lot by putting that
again, same thing, put that logic in a function and call the same function in the definition of each view so the "condition", even though in a function, is colocated in the view
we had to pass the whole logic tree in... I'm not sure whether that would be worth it
the parameters are in an array in apiRequest var which is available in the middleware ...
It will not be as trivial, though, once you start adding nested logic, like or=(a.eq.A,b.eq.B,...) and more complex stuff.
While that is true, the thing to remember here is we are trying to make sure a view is not called without a "main" filter column, the one that filters out the most of the rows and that column/condition will not be behind a or usually, i.e. you don't need to parse the or get parameter. Just because a solution does not solve every possible user case it's not worth implementing (especially since it solves most of the use cases)
query cost
the main point is to not allow long running queries, you are complicating things with explain, most of the time you just want to "kill" the request after a certain threshold, sort of like max_execution_time in php, that is how it's implemented in every stack, you don't need to estimate (since estimates can be wrong and you can not estimate the /rpc path). This is already possible by setting it per user or even providing it as a parameter in the connection string, i.e. every request that goes through postgrest can be set to terminate after say 1s (and it returns a custom error) and the proxy can check that error and ban an ip if it detects more then say 3 in a row ...
Given the complexity of implementing something that really works well not only for a couple of cases
It's not complex, it's 1 line. It does not need to work in all the cases it needs to work in most of the cases (and it does, only or parameter is a bit harder to parse, not trivial but not rocket science, and not postgrest needs to parse it but user code in the db )
The big picture is being missed here, the "get" parameters need to be in the context, not to solve the "protect view" problem, they just need to be there so that it enables the database code to make decisions based on them, it's the same reasoning as to why headers, path, method are in the transaction context.
AFAICT, the EXPLAIN result cannot be used as an expression in SQL. Maybe it's possible to execute an anonymous plpgsql DO BLOCK, obtain the result inside and then set it to our GUC?
I think we have two choices:
parse the explain "offline" in haskell and then run the SET from there. Needs another round-trip to the server.
put that thing in a DO BLOCK and do it at once. You can't return results from the DO BLOCK, so you might have to run the EXPLAIN twice (because we are already running it for the estimated count, aren't we?).
Once we have a pg_postgrest extension to be able to define FUNCTIONs, we could put the EXPLAIN in a function that returns the result as JSON and sets the estimation variable as a side-effect.
I wouldn't know how to do that without pre-request
i just showed how, a few comments above with that validate function, it takes a bool as a parameter ... which can be anything, any function call or even a subquery. You just reached for the familiar "check it in a function" and didn't consider that the logic can be embeded in a view definition
You cited me out of context. You said it could be better done in the proxy. I said, I have no idea how to do that in the proxy alone.
The "check-function-in-view-definition" solution, obviously does not work for tables that are exposed directly. Following your suggestion I would need to create a VIEW for every table in my schema. Now, because VIEWs don't support row level security and are still only available as "security definer" in postgres, I would need to replicate all the row level securities in the VIEW definition as well.
It's just a couple of lines in pre-request.
query cost
the main point is to not allow long running queries, you are complicating things with explain, most of the time you just want to "kill" the request after a certain threshold, sort of like
max_execution_timein php, that is how it's implemented in every stack, you don't need to estimate (since estimates can be wrong and you can not estimate the /rpc path). This is already possible by setting it per user or even providing it as a parameter in the connection string, i.e. every request that goes through postgrest can be set to terminate after say 1s (and it returns a custom error) and the proxy can check that error and ban an ip if it detects more then say 3 in a row ...
You're right about the /rpc - estimating won't work there. But neither will "filtering the main columns", because you only filter on the result of the function call and that long running query has already been executed. If you do the RPC way, you will most likely pass in those values to restrict the resultset by function arguments, so you already have everything in place right now.
Estimates can be wrong, yes. But your approach to allow queries with certain filters set, while dis-allowing queries without those filters, is nothing else than a very crude estimation of the expected resultset. There is no difference conceptually, except that the postgres estimates are way more sophisticated and a lot more dynamic than the static check of filters. That seems like a big plus to me.
It makes a lot of sense to have a time limit for requests in place as well as another safe-guard - but if I understood correctly, then that is already possible with postgres tools, yes? I remember reading in the other thread (it might have actually been you writing this), that this would just kill the connection to the server, but the query would still be running on the server to completion? Maybe I missed something here or confused different concepts.
[...] the "get" parameters need to be in the context, not to solve the "protect view" problem, they just need to be there so that it enables the database code to make decisions based on them, it's the same reasoning as to why headers, path, method are in the transaction context.
I sort of agree with that in general. Still, I think, it makes sense to try to provide this metadata about the request in way, that makes it easy to use. And parsing the whole query string is not "easy to use". We are not just providing the plain HTTP request as well to let the user parse it, but we are splitting it up in headers, path, method, etc. already. So that's basically the same thing.
You cited me out of context. You said it could be better done in the proxy.
Yes that's true, I was just showing a easy solution for your specific problem that can be implemented directly in the db and does not require an additional roundtrip like the pre_request. It can also be done in a robust way in the proxy. On proxy startup you read all the subdomains an a map (this is lua ... you can connect to the db and run queries) and then for every request you check the map in memory without going to the db (+ some logic to update the map when a subdomain is added/removed). This way, not only you eliminate the pre_request roundtrip that is executed for EVERY http request, you also eliminate the additional query being executed against the subdomains table (though this complexity in the proxy is better to avoid in the start while the "condition in view" is fast enough).
I would need to create a VIEW for every table in my schema
That is the right way to use postgrest, api schema should be just views and functions, to have the separation between the model and the api and to have the freedom and power to implement custom logic in the db that is related to the api (but not the data model) , as in smart views, instead of triggers, computed columns ...
Now, because VIEWs don't support row level security
That is not true, RLS works when going through views, it's just the current_user is changing and it's always been that way
just kill the connection to the server, but the query would still be running
what you read is if the http client drops the connection then this happens, nothing to do with what i said above about query time limit.
but we are splitting it up in headers, path, method,
isn't it what I said originally? have every get parameter in a separate var?
But neither will "filtering the main columns", because you only filter on the result of the function call
You are conflating two things here, in that context i was debating query estimate vs query time limit approach, not related to "smart view that require filters", as you said, rpc already do the main filtering based on their arguments.
To reiterate, there are 2 distinct capabilities being talked about here, that both enable some safety
a) provide get parameters (which in turn enable views to become smart and protect themselves against select with no filters)
b) max_execution (works now) vs max_cost (hard to implement, requires a roundtrip and does not work on rpc) approach - which is a safeguard against expensive queries. If the max_cost would be a PG setting just like max_execution then great but it's not, it's just a lot of complexity, 25% performance drop for all requests (because of yet another roundtrip, just like pre_request).
security/resilience is a team job (proxy/postgrest/db) and trying to do everything in postgrest is the wrong direction
[...] This way, not only you eliminate the pre_request roundtrip that is executed for EVERY http request [...]
I don't think pre_request is a round-trip - I doubt the result of that call is used at all, or is it? So I would expect the actual query to be sent right after. Maybe I just misunderstand what a round-trip is in this case. If we have to wait for the pre-request to finish, before we can send the query... isn't every SET command for a variable a round-trip as well? In that case setting all those request.qs.<key> variables would be a lof of overhead.
What am I missing?
what you read is if the http client drops the connection then this happens, nothing to do with what i said above about query time limit.
Ok, great. So a query time limit already works with postgres. Perfect.
but we are splitting it up in headers, path, method,
isn't it what I said originally? have every get parameter in a separate var?
Yes, and I am saying: Given the structure with nested filters, this might not be enough. This is just a half-baked way and forces the users to parse parts of the logic tree themselves. If they need to do that anyway, then we could just pass the unparsed query string. Not much of a difference. I still don't have a better suggestion, though ;)
You are conflating two things here, in that context i was debating
query estimatevsquery time limitapproach, not related to "smart view that require filters", as you said, rpc already do the main filtering based on their arguments.
Ok. We agree that the only generic thing we can do to prevent long running RPCs would be to limit execution time. But at the end of the day the user will be responsible to prevent this from happening inside the RPC.
To reiterate, there are 2 distinct capabilities being talked about here, that both enable some safety
a) provide get parameters (which in turn enable views to become smart and protect themselves against select with no filters)
b) max_execution (works now) vs max_cost (hard to implement, requires a roundtrip and does not work on rpc) approach - which is a safeguard against expensive queries. If the max_cost would be a PG setting just like max_execution then great but it's not, it's just a lot of complexity, 25% performance drop for all requests (because of yet another roundtrip, just like pre_request).
I don't think it needs to be either max_execution or max_cost. It could be both. So we have three capabilities for protection:
a) max_execution time. Works now, but we don't have any defaults, yet. I have not looked into this in detail, so I don't know whether we can actually set any defaults from PostgREST or whether that's something the user would have to do. The only thing that will work on RPCs as well.
b) provide get parameters to enable smart views. As I said, those "smart views" are essentially just another way of cost estimation. More static, but possibly more performant, if we can't make the explain estimation work without a round-trip. Even without using smart views, just providing the get parameters for the sake of completeness, might be worth it as well. Users will find other use cases for that.
c) max_cost. IF we can make it work without a performance penalty, this seems to be better than "smart views", because the estimation changes when the data (stats) changes. I would not be very keen on tracking the data in my tables to figure out whether my smart views still provide enough protection or whether I would need to limit it more. Main advantage, however: This will work for queries to tables as well, not just views. This is absolutely critical, if we want to support distinct/group by/aggregation, because we sure don't want to make a difference between tables and views to allow this query syntax, right?
I think the best way forward would be to:
find out whether we can set any defaults for query execution time on the database side by default (e.g. through config option) to have a fallback protection in place
make the query string available, so you can use it in smart views if you want to. or in any other use case. Would still need to figure out the best way to do this.
implement some form of max cost estimation. If we can only make it work without a round-trip by having a postgres FUNCTION defined, that returns the explain output and extracts the estimated cost, then we need to do that. This would be the first step towards that pg_postgrest extension that was mentioned elsewhere already. We can pass the "max cost limit config option" to this function as well, so if the estimated cost exceeds that, the error can be thrown in that function and we don't need another round-trip for checking that value.
If we can set this up, we should have enough protection to implement distinct/groupby/aggregation/window functions in the query syntax.
I think we have two choices:
parse the explain "offline" in haskell and then run the
SETfrom there. Needs another round-trip to the server.put that thing in a
DO BLOCKand do it at once. You can't return results from theDO BLOCK, so you might have to run theEXPLAINtwice (because we are already running it for the estimated count, aren't we?).Once we have a
pg_postgrestextension to be able to defineFUNCTIONs, we could put theEXPLAINin a function that returns the result as JSON and sets the estimation variable as a side-effect.
Actually, @steve-chavez, that DO BLOCK should work. We only need the estimated row count right now, correct? We could set this as another GUC and then just return that in our main query. We don't even need to return the whole EXPLAIN to haskell... could this, in the end, actually be a performance improvement, because we are sending less data back to haskell? Probably not much, and only if we need the count at all.
I don't think pre_request is a round-trip - I doubt the result of that call is used at all, or is it?
It is. It's not about using the result (though it's used, remember you are throwing an error in your own application), it's about sending the query and waiting for the reply from the db that it was executed with no errors. Only after that the main query follows.
The SET queries are sent all at once with queries being split using ; (though it would be nicer in a single SELECT statement since it's possible)
find out whether we can set any defaults for query execution time on the database side by default (e.g. through config option)
don't add another config option when it's just a matter of adding a parameter to the db-uri connection string
This is just a half-baked way and forces the users to parse parts of the logic tree themselves
You are saying that as if that is the situation in 100% of the cases (when using the get params for protecting views) when in reality it's less then 10% in practice, maybe even less then 1%. There is no logic tree parsing, the user just checks if request.get.column is set, where column is from the root table. He does not need to check request.get.childtable.column (if this is what you mean by parse tree) since the records from the childtable are already guaranteed to be filtered by the keys from the parent table. I doubt there are lots of apps where the ui absolutely needs the main filtering condition on the root table to start with a root or or and (which is the only thing hard to parse). Just as a test, how many requests do you have where you use or on the root table and that is the main filtering condition? I'll take a "half-baked" solution that solves 99% of the use-cases and let the users do the work for the remaining 1%.
i just looked at how the new explain feature works since it seems @wolfgangwalther is suggesting to somehow use that capability.
First, it a separate roundtrip to the db (but it's worth to avoid the expensive count query). Second, there is a bug.
The docs say
To help with these cases, PostgREST can get the exact count up until a threshold and get the planned count when that threshold is surpassed which seems to imply postgrest does NOT execute the exact count after a threashold, but in fact it does, it executes the exact count every time, then checks the tableTotal and then performs the explain to get the estimated count (though it already has the exact count)
checking again to see if i missed something
EDIT: oh i see, there is some logic to change the count query in case of estimated ... but does adding a limit to the count query actually make it faster by a lot?
Actually, @steve-chavez, that DO BLOCK should work
@wolfgangwalther I was thinking something like this.
BEGIN;
SELECT current_setting('pgrst.cost', true);
DO $$DECLARE cost json;
BEGIN
EXECUTE 'EXPLAIN (FORMAT JSON) select * from projects' INTO cost;
PERFORM set_config('pgrst.cost', cost->0->'Plan'->>'Plan Rows', true);
END$$;
SELECT current_setting('pgrst.cost', true);
COMMIT;
current_setting
-----------------
(1 row)
current_setting
-----------------
1200
(1 row)
So yes, looks like that works for getting the EXPLAIN output.
So yes, looks like that works for getting the EXPLAIN output.
And it's just one additional line, I guess, to add the estimated count as pgrst.count as well.
max_execution_time
I guess the equivalent here would be statement_timeout. Which we could set on a set local for each transaction(also mentioned the idea here https://github.com/PostgREST/postgrest/issues/249#issuecomment-641498592).
max_cost (hard to implement, requires a roundtrip and does not work on rpc)
@ruslantalpa Why would that not work for function calls? pg functions have the COST(also ROWS) settings.
The SET queries are sent all at once with queries being split using ; (though it would be nicer in a single SELECT statement since it's possible)
Not sure if it's more performant, but the settings part could also be done with SELECT set_config(..), set_config(..), set_config(..).
It can also be done in a robust way in the proxy.
On proxy startup you read all the subdomains an a map (this is lua ... you can connect to the db and run queries)
team job (proxy/postgrest/db) and trying to do everything in postgrest is the wrong direction
You obviously would like to do many things on proxy since you have/sell the subzero stack :) . However with that way of thinking we wouldn't have the request.path setting for example - since for a long time the answers for this were "do a proxy_set_header on nginx to get the missing http variables". Without that there would be no "smart view" approach.
So in general I'm favor of doing the most on the db and leave the proxy as a last resort.
Not to say I'm opposed to adding the query string settings. It makes sense to add them to have a more complete http context. We need to consider that POST/PATCH also have query string vars though. I would be fine in adding them raw as they are. If later we provide a parsed version they could go in another setting(like request.qs_parsed.<key>).
It is. It's not about using the result (though it's used, remember you are throwing an error in your own application), it's about sending the query and waiting for the reply from the db that it was executed with no errors. Only after that the main query follows.
If we really do it like this - why?
Shouldn't it be enough to just send all of the following in one batch?
SET xx...;
SET yy...;
SELECT pre_request();
SELECT * FROM mytable; -- full query
And if that errors anywhere, we still get the same error thrown?
Why would we need to wait for a single one of those statements to finish before sending the others?
max_cost (hard to implement, requires a roundtrip and does not work on rpc)
@ruslantalpa Why would that not work for function calls? pg functions have the COST(also ROWS) settings.
The fact that CREATE FUNCTION has a COST parameter is exactly for the reason, why it does not work for us: You can give an arbitrary COST setting to the function, because the planner (and EXPLAIN) can't look inside the function.
See this example:
ROLLBACK;
BEGIN;
CREATE OR REPLACE FUNCTION test() RETURNS void
LANGUAGE plpgsql COST 1000 AS $$
BEGIN
RETURN;
END$$;
CREATE OR REPLACE FUNCTION test2() RETURNS void
LANGUAGE plpgsql AS $$
BEGIN
PERFORM test();
RETURN;
END$$;
EXPLAIN (FORMAT JSON) SELECT test2();
No matter what you change the COST 1000 for test() to - the expected cost of SELECT test2() will always be the same.
So if we were to do EXPLAIN SELECT rpc_call(), we would always get similiar costs back - no matter how much the function would actually be doing. Also since the estimated row count doesn't make sense on RPCs (for the same reason), we should not run the EXPLAIN at all for RPC calls.
i just looked at how the new explain feature works since it seems @wolfgangwalther is suggesting to somehow use that capability.
First, it a separate roundtrip to the db (but it's worth to avoid the expensive count query).
I think we can avoid this extra round trip to the database, with the DO BLOCK suggested by Steve, if we set the estimated row count as pgrst.count and just change the "count query" to either SELECT COUNT(*) ... or SELECT current_setting('pgrst.count', true) (or a combination of both for the "exact up to the limit, else estimated" case.
Why would that not work for function calls? pg functions have the COST(also ROWS) settings.
I was thinking (did not check) because you can not estimate how much time would take to execute a function that has pg_sleep(random()) * 100 + 1
You obviously would like to do many things on proxy since you have/sell the subzero stack
So basically in the same thread where I explain and give concrete ideas of how to safely implement group by (for which yours and mine initial response was bad idea, and functionality which i already have in my pgrst) you basically accuse me of trying to somehow hold postgrest back ... nice going.... I am not selling the code in the proxy (only), i am also selling the hosting service and my custom more advanced postgrest version. If my motivation would have been as you try to imply, i would simply not give you any pointers here on how to add functionality that i already have.
If we really do it like this - why?
Because that's how the postgres wire protocol works (certainly the hasql lib does so).
I think we can avoid this extra round trip to the database, with the DO BLOCK suggested by Steve
That work as it is now, as in you set the estimated count along with the other env vars but this blocks you from implementing prepared statements in the future because if you want to do that you can no longer have statement1; statement2; ... in the same call to the db
If we really do it like this - why?
Because that's how the postgres wire protocol works (certainly the hasql lib does so).
Ok, I see. I am in no way familiar with the hasql lib. Just skimmed the docs about it and found this (as an explanation to the "sql" parameter for some function call):
Possibly a multi-statement query, which however cannot be parameterized or prepared, nor can any results of it be collected.
This would explain, why we don't send both the SET part and the actual query together, because we still want to collect the results. But even with this limitation, we should still be able to put all the SET, the pre_request() and the DO BLOCK stuff together in one multi-statement (we don't care about any results here other than errors, which will still be thrown for sure, no?) and then have the actual query in a single statement.
Do you think that would work?
This would explain, why we don't send both the
SETpart and the actual query together, because we still want to collect the results. But even with this limitation, we should still be able to put all theSET, thepre_request()and theDO BLOCKstuff together in one multi-statement (we don't care about any results here other than errors, which will still be thrown for sure, no?) and then have the actual query in a single statement.
We don't even need to use multi-statement capabilities here. We can just put the SET stuff and the pre_request inside that DO BLOCK as well. Now - in terms of hasql - we only have one statement, the DO BLOCK, to execute.
That work as it is now, as in you set the estimated count along with the other env vars but this blocks you from implementing prepared statements in the future because if you want to do that you can no longer have
statement1; statement2; ...in the same call to the db
I'm sorry I can't follow here, yet. I don't have a clear picture of how prepared statements would be implemented, but I don't see any glaring issues that would come along with that. Is there maybe another thread where some design considerations for using prepared statements have already been made?
I think the general "request parameter restrictions" idea is great. As discussed before it's not a perfect solution, but it would give the flexibility to solve various use cases.
One use case is to restrict api users from doing select=*. This is a good safeguard for tables/views that have big text columns and getting all of them on a single request impacts performance(I've seen postgrest consumes large memory here).
So a pre-request can be used for preventing this globally:
create or replace function global_param_restrictions() returns void as $$
declare
req_select text = nullif(current_setting('request.param.select', true), '');
begin
if req_select = '*' or req_select is null then
RAISE EXCEPTION 'select=* not allowed'
USING HINT = 'Add specific columns to the select parameter';
end if;
end $$ language plpgsql;
(creating views that separates big columns is another way, but it's nice to have this alternative)
Other use cases for this could be disabling certain operators(we got this somewhere on gitter chat), some parsing on pg would be needed for this(unless make it easy and expose them in other way). This also reminds me of a previous request we got for exposing the regex operator(2016 :o, time flies..)
Regarding pre-request perf, there might be a way to make it better by including it on the main query as shown on https://github.com/PostgREST/postgrest/issues/1670#issuecomment-737311266. Can be tackled later.
Or "smart views". Declarative, but the disadvantage is that it's unwieldy to use them for queries through regular SQL:
table protected_books ;
ERROR: P0001: Filter parameters not provided
DETAIL: Please provide at least one of id, publication_year, author_id filters
HINT:
CONTEXT: PL/pgSQL function validate(boolean,text,text,text,text) line 6 at RAISE
LOCATION: exec_stmt_raise, pl_exec.c:3803
That would fail even for superusers, it'd be nice to enable the restriction only for certain roles.
Parameter restrictions can also work for tables, with policies. Same idea as for views:
create policy pgrst_select_policy
on clients
for select
to postgrest_test
using (
validate(current_setting('request.param.id', true) is not null, 'id filter must be present')
);
Unlike views, these can be queried regularly through SQL by using roles with bypassrls or superuser.
Regarding this issue. Originally it was only about distinct, group by kinda snuck by and we followed along. In general, I feel that for analytical queries the user should reach for SQL and expose VIEWs(with aggregates/group by).
So, how about closing this one after:
request.param.<key> GUC(simple implementation here: https://github.com/steve-chavez/postgrest/commit/e780889a4ddb3bb9de1ee50c71737409e388c290)?distinct query parameterThoughts?
I'm sorry I can't follow here, yet. I don't have a clear picture of how prepared statements would be implemented, but I don't see any glaring issues that would come along with that.
@wolfgangwalther Regarding preparing multiple statements separated with ;, that's indeed not possible:
PQprepare
The function creates a prepared statement named stmtName from the query string, which must contain a single SQL command
(From: https://www.postgresql.org/docs/current/libpq-exec.html)
Also tried doing that with Hasql and it erred with a message like unable to prepare multiple statements.
Not sure if that clears a previous doubt you may had. But in any case prepared statements are already implemented.
So, how about closing this one after:
脌dding the
request.param.<key>GUC(simple implementation here: steve-chavez@e780889)Alowing the
?distinctquery parameter
I'm very much against that in this form. I don't think this is done enough, to justify allowing ?distinct.
No matter whether it's hidden behind a config flag or not: once ?distinct is possible, I have to protect all my endpoints. In a complex schema this might be nearly impossible. Sure, looking at one table/view only, the smart view / smart table approach might be possible. But there will quickly be tables where either you don't know (yet) how to do it, because you don't know how the data will be distributed, or this is going to be really complex.
This makes the feature essentially useless, if you care about the ddos problem.
I am not opposed to providing the request params as a GUC, so people can use them if they want. But we need a way to make sure those queries are safe without complex logic. PostgREST should be kind of easy to use in a safe manner.
I still think the cost limit via EXPLAIN in a DO block is the right way to do it. Here's why:
It will work on tables, views and RPCs out of the box. Yes, RPCs as well - but only if they are inlined into the main query. RPCs that return SETOF <table> and are not inlined are expected to be very slow, so probably easy to ddos without any aggregation anyway...
It does not need to harm performance for regular queries: We can just decide to run the cost estimation query only if any aggregation (distinct, group by, ...) is used.
And even if the DO block is used, it can be sent together with the SELECT set_config and the pre-request. Yes it will obviously take a little more time, but it is not another network round-trip.
Regarding this issue. Originally it was only about
distinct,group bykinda snuck by and we followed along. In general, I feel that for analytical queries the user should reach for SQL and expose VIEWs(with aggregates/group by).
If we had a proper safe-guard in place, I couldn't think of a reason (other than the effort needed to implement it of course) to limit it to distinct. For analytical queries, it could be very useful, because not every aggregation would have to be pre-made in a view.
I'm sorry I can't follow here, yet. I don't have a clear picture of how prepared statements would be implemented, but I don't see any glaring issues that would come along with that.
@wolfgangwalther Regarding preparing multiple statements separated with
;, that's indeed not possible:
[...]
Not sure if that clears a previous doubt you may had. But in any case prepared statements are already implemented.
Yes, I know. The reason why I wasn't able to follow the comment I wrote about was simple: it started from wrong assumptions. As you already stated, we are (and were) already using prepared statements, so we are in extended query mode already. Prepending our main query with other queries without another network round-trip is not possible anyway. So nothing we're losing here.
I still don't understand in detail how hasql works with those prepared statements, though.
Assuming the statement is not prepared yet, I expected something like the following is happening:
BEGIN ...;
SET LOCAL ...; -- set_config...
SELECT pre_request();
PREPARE ...;
EXECUTE ...;
Is that about right? Do you think it would be possible to run the pre_request / the DO block between PREPARE and EXECUTE?
This would be great, because then we could run the cost limit EXPLAIN on the prepared query as well!
And once we solved those problems, I would like to think about whether ?distinct=... is the best way to do this. This would be another argument name lost as keyword. But maybe we could also add some annotation/hint to a column in ?select=... to include it in the distinct clause.
So, how about [...]:
* 脌dding the `request.param.<key>` GUC(simple implementation here: [steve-chavez@e780889](https://github.com/steve-chavez/postgrest/commit/e780889a4ddb3bb9de1ee50c71737409e388c290)) * Alowing the `?distinct` query parameterThoughts?
I'd like to extend your suggestion here slightly, with the goal of getting something done, without making things too complicated for the start while keeping it extensible. Here we go:
Add the request.param.<key> GUC, no question
Add the ability to add DISTINCT to the query (without going into detail how to do that, yet)
Only allow DISTINCT, if it is explicitly enabled via config option. Allow to set up the config option globally or per route similar to those suggested in https://github.com/PostgREST/postgrest/issues/1661#issuecomment-743388070. This will avoid the "once enabled, I have to secure all my endpoints" problem.
This config option could be named something like db.cost-check.<route>. Key here is: The value is a string holding a FUNCTION name (similar to the other hooks suggested in that other thread). This function returns BOOLEAN and will be added like this to the main query:
WITH ...
SELECT ...
FROM (...) _postgrest_t
WHERE cost_check()
The query planner should be smart enough (I hope.. we have to test that!) to run this function first, before running any potentially very expensive query. This function can return either true and the query will succeed or false to skip this query. We should be able to catch the "no-rows returned at all" in haskell, right? I think it would be better to not require an error to be raised from that function, so that using SQL functions is also possible. Of course there can be a custom error thrown as well.
Now everyone can set up their own cost-checks. Those could be very similar to the suggested "smart views" or completely different. In an internal setting, somebody could just return a plain true to allow all those requests. Somebody could come up with a more sophisticated query as well - maybe one that runs an explain to estimate the cost? Just an idea here ;).
We could give examples for those kind of functions in the docs.
To allow true cost estimation via explain in those functions, this can be extended later: Depending on the cost_check function's signature, different information can be passed into the function. We start with no arguments and can read the request.param... GUCs. We could extend that to a signature roughly like this:
CREATE FUNCTION cost_check_explained (query TEXT, VARIADIC params TEXT[]) RETURNS BOOLEAN
This function could get the query as the first argument and the params as remaining arguments (to be able to still pass in them in in a parametrized way). We would still have to work out the details of all of this, but this concept would not need any DO blocks before the query etc.
One other benefit of that is: You can use the same checks that were proposed in the "smart view" approach - while keeping your views easily queryable from SQL.
@wolfgangwalther I like the direction we're going :+1:
I had a similar idea for passing the generated query through a request.generated_query GUC in https://github.com/PostgREST/postgrest/issues/1378#issuecomment-522392509, but now that we have prepared statements, your function idea would be better(a GUC wouldn' work because of the placeholders).
Actually... explaining prepared statements already works:
explain (format json) execute myps(1);
[
{
"Plan": {
"Node Type": "Index Only Scan",
"Parallel Aware": false,
"Scan Direction": "Forward",
"Index Name": "projects_pkey",
"Relation Name": "projects",
"Alias": "projects",
"Startup Cost": 0.15,
"Total Cost": 8.17,
"Plan Rows": 1,
"Plan Width": 4,
"Index Cond": "(id = 1)"
}
}
]
...
And it's also working on:
Perhaps we could avoid passing the generated query somehow...
Add the ability to add DISTINCT to the query (without going into detail how to do that, yet)
Only allow DISTINCT, if it is explicitly enabled via config option. Allow to set up the config option globally or per route similar to those suggested in #1661 (comment). This will avoid the "once enabled, I have to secure all my endpoints" problem.
No rush for distinct. The points you've raised against it are fair and square. We can see how to enable DISTINCT after we do the cost restriction, which is great because it'd be transparent for the user, unlike restricting the request parameters.
And it's also working on:
Wait a sec - what's happening here for real? Is this preparing an explain statement or is this explaining a prepared statement?
I think this is the former, so it's not the same thing you did here: explain (format json) execute myps(1);.
Perhaps we could avoid passing the generated query somehow...
You know.... we can.
I just learned about current_query() and also pg_prepared_statements. Not sure if the latter is of any use for us, though.
prepare myps as select current_query(), $1 as a, $2 as b;
execute myps('hello', 'world');
This returns the execute including arguments for me. I can't believe it, I will have to test this with an RPC in PostgREST :D.
So... we're not getting the arguments for real, it seems.
When I call this:
create or replace function show_query() returns text
language sql as $$
select current_query();
$$;
through PostgREST, it returns:
"\n WITH pgrst_source AS (WITH pgrst_ignored_body AS (SELECT $1::text) SELECT \"public\".\"show_query\"() AS pgrst_scalar)\n \n SELECT\n null::bigint AS total_result_set,\n pg_catalog.count(_postgrest_t) AS page_total,\n (json_agg(_postgrest_t.pgrst_scalar)->0)::character varying AS body,\n nullif(current_setting('response.headers', true), '') AS response_headers,\n nullif(current_setting('response.status', true), '') AS response_status\n FROM (SELECT \"pgrst_source\".* FROM \"pgrst_source\" ) _postgrest_t;"
(note the $1::text)
And this:
create or replace function explain_the_world() returns json
language plpgsql as $$
declare
explained json;
begin
execute 'explain (format json) ' || current_query() into explained;
return explained;
end$$;
returns:
{"hint":null,"details":null,"code":"42P02","message":"there is no parameter $1"}
So there is a difference. When I run EXECUTE ... directly from SQL, then this execute statement is returned by current_query(). However in hasql's case, the execute is not sent as an sql statement, but directly via the frontend/backend protocol - that's why we're still getting the prepared query without arguments.
So if we can call the cost_check function just like this:
WHERE cost_check($1, $2, $3, ...) -- with all the parameters used in the query
Then we could pass those on to the explain statement.
The pg_prepared_statements would help in that case, because:
create or replace function show_prepared() returns text
language sql as $$
select pg_prepared_statements from pg_prepared_statements where statement=current_query()
$$;
This returns the row from prepared statements for the current query. This means that we could either
db-prepared-statements=false just explain current_query() directlydb-prepared-statements=true get the name for the prep statement from pg_prepared_statements through current_query() and then call it via explain execute <name>(args...), if we pass those into the function