Postgrest: Bulk upsert

Created on 10 Aug 2015  路  28Comments  路  Source: PostgREST/postgrest

Hi,

Is there a way to insert or update (like pg 9.5 upsert) some rows.

I have a large CSV file with some duplicate content.

Is there a possibility to merge data into a table ?

My idea (if not already developed) is to :

  • Create a _temporary_ table (same structure than my _target_)
  • Fill my _temporary_ table
  • Merge or upsert (for pg 9.5) each rows
  • Merge content from _temporary_ to _target_
  • Delete _temporary_

My goal is to :

  • Avoid duplicates
  • Avoid table locks
  • Be ready for high availabity

Regards,

enhancement

Most helpful comment

Bulk PUT would imply you're making the request to the same URL you'd read the resulting data from, because the semantics of PUT are "take this representation and put it at this URL, so that if I do further GET requests at that URL I get this representation back". So a bulk upsert PUT would mean including the identifiers in the url in an id=in.10,20,30,... parameter to match these semantics. The size of the URL would grow with the number of records, which would not be good.

All 28 comments

You can upsert a single record like this. But not yet for bulk inserts.

How might the HTTP interface look? Here are some options.

Approach 1

If I am posting a CSV and intend an upsert maybe I could include the header

Prefer: resolution=merge-duplicates

This would switch it into upsert mode. Without the header duplicates would raise an error. Pretty straightforward, although I don't like how I just made up the resolution and merge-duplicates stuff. It's better to draw from a well known standard.

Approach 2

Use PUT for bulk upsert like it is used for single upsert. Seems OK except what url should we use for the put? REST semantics say that you PUT to the url which identifies exactly the resource which is updated (the url which you could later send a GET to retrieve the updated resource). Upserting a single element like /foo?id=eq.1 is fine since the url identifies it. But /foo identifies potentially many more records than were upserted. Would you PUT four items to e.g. /foo?id=in.[1,24,101,504]? Seems awkward.

Upsert could be done by a PATCH for me according to HTTP verbs.
I think would be ideal to send CSV (for bulk update) and JSON (for single update)

Can you explain a bit mote your situation?
Why would you (clients using your service) sending large csv files over http to be inserted in the db on a regular basis?
Is it possible that this is a one time thing and it would better be done localy with a custom script?

In fact, I want to buil and ETL engine using postgrest.

I have to grab some data around the web, and regullarly push them into a psql database

Since you are the one controlling all the aspects of this system (manage db, collect data, insert data) i think you would better try to collect/insert data in the db using your custom scripts (that maybe can use the COPY command from PG to bulk insert csv). Postgrest is more suited for systems where multiple (web) clients need to access a database. In your instance, you are the only client of the system and you do not need to go over http to insert data. If however later, you want to expose this data to multiple clients (js frontend, ios apps, ...) then postgrest will be a great choice.

I have already tried

  • COPY (from postgresql)
  • pgloader

Why am I switching to postgresql ? It use REST standard to push data

The main advantage of using this method is to keep structure, under _what is behind_ control

I have plenty of web site, and postgrest seems to be suitable for this since all my fonts are controlling data structure

Did you find a way to make it work with pgloader instead?

Is single-row upsert no longer supported in postgrest? The anchor seems gone for http://postgrest.com/api/writing/#upsertion

It was removed because upsert was kind of a hack until 9.5, but now that there is support for it in the database, it can be resurected

What's wrong with hacks that do the job?

Hey, what do you guys think about this?

Single record JSON

For PostgreSQL version >= 9.5.0: every POST generates upsert syntax by default. If you want to prevent upsert, you _explicitly_ include a Prefer: resolution=ignore-duplicates header.

For pre 9.5.0: like today, POST becomes a SQL insert. Like today, if there's already a record, regular DB error forwarding will naturally follow.

Multi record JSON / Bulk CSV

For 9.5.* we merge duplicates by default.
If the Prefer: resolution=ignore-duplicates is set, we'll abort the whole transaction if any of the records already existed.

For pre 9.5.0, no changes.

I'm concerned about making a potentially destructive action the default. Maybe it's safer to have it fail on duplicates by default. Also changing default behavior based on db version could cause surprises when people upgrade their db.

@begriffs albeit the definition of "destructive" varies from case to case, I must agree to your point from the POLA standpoint.

For me it's ok to modify the earlier proposal by flipping around the defaults.
That is, let's do that a Prefer: resolution=merge-duplicates header must be present in a POST request to switch the current INSERT behaviour into UPSERT.

Should the database not support UPSERT, let's just fail naturally.

How about that?

Yeah sounds good.

Why introduce UPSERT in POST context and complicate things, why not use PUT as before?

Because technically PUT is supposed to be applied to exactly the url from which a GET would retrieve the resource. For a bulk upsertion I'm not sure what this url should be.

A successful PUT of a given representation would suggest that a subsequent GET on that same target resource will result in an equivalent representation being sent in a 200 (OK) response. (RFC7231)

Otherwise PUT seems better in every way than some made-up header with a POST. Check out the RFC and let me know if maybe I'm interpreting it too strictly.

A few comments:
1) I think it'd be awesome to, at the very least, re-add support for upsert for individual records using the put representation. (Or post, I'll go into various considerations there below)
2) Bulk upsert would also be great...though a bulk one feels more like a post request to me than a put request.
3) I think it'd be a good idea to have a header for whatever possibility that specified what should happen on conflict-and it should take advantage of ON CONFLICT DO UPDATE or DO NOTHING or just error completely.

So what might this look like:
1) A PUT request would have to specify the resource exactly in the url, would only take one record to upsert, and would always perform ON CONFLICT DO UPDATE so let's presume we have a table with a unique constraint on columns id1 and id2 and then some value columns, a put to /endpoint? id1=eq.id1val&id2=eq.id2val with a payload of {val1:val1val, val2:val2val,...} would perform the following sql:

insert into endpoint (id1, id2, val1, val2...) 
values(id1val, id2val, val1val, val2val...)
on conflict (id1,id2) do update 
set val1= val1val,
    val2 = val2val
    ...;

if we're not in pg 9.5, then it will return an error. This one shouldn't be modifiable by headers as it should be idempotent and it seems to me that if you allow modification by headers, it's not really that idempotent, but I might be misinterpreting the standard. (You could also allow a

2 & 3) A POST request seems like the most natural solution for bulk upserts, and POST 's could also be modifiable in terms of their behaviour with prefer headers (for one or many row inserts), and I think the headers should follow Postgres' internal syntax as that's the most natural way ie Prefer: on-conflict=do-update for upsert; Prefer: on-conflict=do-nothing for do the other inserts in a bulk insert but skip duplicates and no header would mean the default behavior for Postgres (no on conflict clause at all) (could also include a Prefer: on-conflict=do-error as a way of explicitly setting that and rolling back the entire transaction if one record of a bulk insert didn't work).

requests specifying the do-update or do-nothing in versions of pg before 9.5 would throw an error.

If one felt that we really needed multi-row put requests, one could specify all of the unique combinations of the unique keys in the query string ie PUT to /endpoint?endpoint{id1,id2}=in.{id1val1,id2val1},{id1val2,id2val2},{id1val3,id2val3}... and a PUT to the base resource or to endpoint?endpoint{id1,id2}=in.payload{id1,id2} could do a redirect to the proper PUT endpoint before performing the upsert. but this seems like it's getting unnecessarily complex and I think treating the bulk upsert as a bit different than a single row upsert makes a lot of sense. The other thing that militates against bulk PUTs is this part of the RFC:

If the target resource does not have a current representation and the
PUT successfully creates one, then the origin server MUST inform the
user agent by sending a 201 (Created) response. If the target
resource does have a current representation and that representation
is successfully modified in accordance with the state of the enclosed
representation, then the origin server MUST send either a 200 (OK) or
a 204 (No Content) response to indicate successful completion of the
request.

because created or modified loses meaning when you have more than one record and the thing can be in both states. Now I don't know that we'll be able to send the 201 or 200/204 differentially given that I think Postgres returns INSERT for either case on upsert. Not sure if that means we should just do a POST for all of it or not though.

Anyway, that's all I got. hope it's helpful. I'd really love to see this implemented soon, I'm not a Haskell-er though so, can't help on that side, but happy to help with any of the Postgres bits.

@djk447 thanks for the detailed suggestion. I propose we bring back single-item PUT with the semantics you specified. It's definitely useful and well defined.

Also I think bulk PUTs are possible using multipart http responses. It packs multiple responses into one (headers response codes and all) divided by a special boundary sequence of characters. There's some relevant code back in ancient history: 70d33445db2e974e33ffbef9be52796bb981def5

But how about we deal with singular PUT first.

I would vote against that, i would suggest going with POST + header
Even the resulting sql query kind of suggests it's the right way.
Going with PUT means you are thinking "let's update some records and if they are missing, insert them"
Going with POST+header the thinking is "let's insert some records and if they already exist, update them"

The resulting query would be INSERT .... ON CONFLICT.... so at the core, you are doing an insert.
Also, the code needed for this is minimal, capture the "onconflict" header value and based on that, generate the tail of the insert query

I actually think the POST one makes more sense as well. The INSERT .... ON CONFLICT.... syntax makes the difference for me. Postgres views this as an insert first, so let's treat it that way in the HTTP, it's also simpler to implement the bulk upsert from and I think an on-conflict header of some sort comports with the Postgres side very nicely.

That said, this would be the "dumb" version of ON CONFLICT where you don't specify the conflict clause. I think it's easier/faster to implement this this way. Eventually a PUT request could be for the smarter version where you specify which index/unique constraint you want the ON CONFLICT to act on, which could have some benefits. But I think it'd be better to do that later and could be a separate feature.

how are folks doing this currently? making custom functions and then posting to /rpc/funcname? Or perhaps using INSTEAD OF triggers?

Regarding the issue, I'd personally like to see full control over the ON CONFLICT and UPDATE/DO NOTHING clauses, either as query parameters or headers.

Bulk PUT would imply you're making the request to the same URL you'd read the resulting data from, because the semantics of PUT are "take this representation and put it at this URL, so that if I do further GET requests at that URL I get this representation back". So a bulk upsert PUT would mean including the identifiers in the url in an id=in.10,20,30,... parameter to match these semantics. The size of the URL would grow with the number of records, which would not be good.

OK, revisiting this issue because @elimisteve brought it up in the chat. What about this approach:

  • Support PUT on routes when the client supplies eq filters for the primary key. That is we detect the request must match at most one row and allow it. If that test fails we return a 405 method not allowed.
  • Have POST accept Prefer: resolution=merge-duplicates as discussed above.

@begriffs was this ever implemented? I could really use upsert syntax.

I could, too, starting very soon!

I understand why this feature would be useful for you, but implementing it is not high on my personal list of priorities. If anybody wants to give it a try and send a pull request by all means go for it and I'm happy to help review your work.

There are just other things in my personal life I'd prefer to be working on right now over implementing this feature.

Is there an non-bulk upsert feature in place that I missed? Or is this issue really about upsert?

Was this page helpful?
0 / 5 - 0 ratings