I measured that PostgREST 0.3.2.0 uses memory proportional to 20x the size of a JSON payload during bulk inserts. This doesn't seem reasonable.
Note to self: try using weigh in tests to measure the space used.
Is this still a thing?
I remember it being a thing in #360 but that thread says it was fixed in 3.0
I'm seeing similar problems with CSV uploads as well.
@maximsch2 would you be interested in helping us track down the problem? The issue that Ruslan linked to includes information about how to run the binary with profiling enabled. If you're feeling brave you could do a similar thing and include a graph of memory allocation.
I have a proposal for avoiding the json payload processing in INSERT and clearing this out, I'll use the articles table as a reference.
Currently when receiving a payload like [{"id": 11, "body": "uno"}, {"id": 12, "body": "dos"}] we process this in Haskell to check if the json is an array/object to adequately use pg json_populate_recordset/json_populate_record functions.
This part can be done in pure SQL:
WITH payload AS(
-- This json would be replaced by $1
SELECT '[{"id": 11, "body": "uno"}, {"id": 12, "body": "dos"}]'::json AS json_data
),
vals AS(
-- With this we make sure that the query works for json arrays or a single json object
SELECT json_data AS val FROM payload WHERE json_typeof(json_data) = 'array'
UNION ALL
SELECT json_build_array(json_data) AS val FROM payload WHERE json_typeof(json_data) = 'object'
)
INSERT INTO articles(id, body) SELECT id, body from json_populate_recordset(null::articles, (select val from vals));
Now the other thing that the Haskell processing does is making sure the json has the same columns, not sure but I'm guessing the original purpose of this was to somehow protect against unintentional null values in a bulk insert? Anyway I think this would be better handled by the db not null contraints.
However this part also handles the <cols> specifying in 脤NSERT INTO table(<cols>) SELECT <cols> FROM ... and that cannot be dynamically specified in a SQL statement, this is also necessary for handling column privileges.
So, my proposal is to have an additional INSERT mode that will be triggered when the columns querystring parameter is specified like this:
POST /articles?columns=id,body
[{"id": 11, "body": "uno"}, {"id": 12, "body": "dos"}]
If there are additional columns in the json payload they will be ignored, this mode will guarantee minimum memory usage for big payloads from pgrst end.
For anyone that's interested in this being implemented, you can sponsor the development on Patreon or you can contact me directly(email in profile).
Most helpful comment
I have a proposal for avoiding the json payload processing in
INSERTand clearing this out, I'll use the articles table as a reference.Currently when receiving a payload like
[{"id": 11, "body": "uno"}, {"id": 12, "body": "dos"}]we process this in Haskell to check if the json is anarray/objectto adequately use pgjson_populate_recordset/json_populate_recordfunctions.This part can be done in pure SQL:
Now the other thing that the Haskell processing does is making sure the json has the same columns, not sure but I'm guessing the original purpose of this was to somehow protect against unintentional null values in a bulk insert? Anyway I think this would be better handled by the db not null contraints.
However this part also handles the
<cols>specifying in脤NSERT INTO table(<cols>) SELECT <cols> FROM ...and that cannot be dynamically specified in a SQL statement, this is also necessary for handling column privileges.So, my proposal is to have an additional
INSERTmode that will be triggered when thecolumnsquerystring parameter is specified like this:If there are additional columns in the json payload they will be ignored, this mode will guarantee minimum memory usage for big payloads from pgrst end.
For anyone that's interested in this being implemented, you can sponsor the development on Patreon or you can contact me directly(email in profile).