Presto: Transaction-ish delete/insert in S3

Created on 9 Aug 2016 · 4Comments · Source: prestodb/presto

On S3, Presto can insert into Hive table/partition without moving files around. #5818 introduces support for transaction-ish delete followed by insert. I want to understand 2 things:

How Hive does INSERT INTO or INSERT OVERWRITE on S3?
- I tried to read about how Hive does it. It appears like Hive always create temporary directories on S3. But I could very well understood it wrong. I believe you guys know much more about running Hive and Presto on S3 than I do. Please chime in.
- This link claims that Hive will first write to a temporary location on the target table's file system in all cases (including HDFS, S3, and NFS): https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
- This link claims that Hive on Amazon EMR version generates temporary files on S3 file system. http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-hive-differences.html
How do you want to see Presto support transaction-ish delete followed by insert in S3?
- The transaction-ish delete followed by insert will not provide any traditional SQL transaction guarantee beyond read-uncommitted.
- The benefit transaction-ish delete followed by insert want to provide is:
- You never see double data.
- The amount of time you can't see any data is very short.
- The way this feature is going to work in HDFS
- BEGIN TRANSACTION
- DELETE FROM t WHERE pk='value'
  - Presto remembers the partition is deleted, but does not tell metastore about this
- INSERT INTO t ....
  - Presto writes files into temporary location as it always does
  - The query is about to finish and the Presto coordinator receives a list of all files created (in temporary directories) and all the partitions that need to be created
  - Presto remembers what partitions were DELETEd, ADDed, REPLACEd, or INSERTed INTO. For the latter three, Presto remembers where the correponding temporary directories are.
- COMMIT
  - Presto moves temporary files to their final destination.
  - Specificaly for REPLACEd partitions, the old directory will be renamed first so that the rename of the new directory will not cause a conflict.
  - Presto calls metastore APIs to populate any table/partition changes.
- Now, this process in HDFS involves a bunch of directory moves. And I know S3 people doesn't like it.

The only thing I can think of that will work on S3 without introducing file moves:

Presto can use non-conventional directory names. i.e. use different directory for new partitions in different queries even if it's the same table. For example xxx/<hash of query id and table name>/pk1=x/pk2=y. This way, Presto can implement inserts without moving files. But this means any tools that assumes conventional directory name and doesn't talk to metastore will not be able to read table/partition generated by Presto.

Is this going to be useful? Do you have better proposals?

stale

Source

haozhun

👍5

All 4 comments

@haozhun thanks for putting this together. Here is my 2 cents.

How Hive does INSERT INTO or INSERT OVERWRITE on S3?

We currently have EMR Hive 1.0 and an insert overwrite first deletes the s3 path at the beginning of the query, then a job writes the data to hdfs in a temp dir and after that another job writes that data directly to s3 (no s3 temp files are used). With inserts, a job writes the data to hdfs in a temp dir and then a subsequent job reads that and write it directly to s3.

Regarding the use of temp files, in general we don't like s3 temp files being used during writes as s3 temp directories are a mess to manage (especially on versioned buckets) and they incur additional costs, so we prefer direct writes to s3.

With some of the tools we have, similar to your proposal, whenever new data is written we create a new batchid directory on s3 and write the data to that directory, and update the metastore to point to this new batchid (batchid is a unix timestamp). This solves the temp file problem and also avoids the potential eventual consistency issues that can arise (for a discussion of some these issues please take a look at this, there section on "s3 insert overwrite"). All the tools that we use like Hive, Pig, Presto, Spark, etc. support talking to the Hive metastore to get the location information so it's working well for us so far.

nezihyigitbasi on 9 Aug 2016

👍1

@nezihyigitbasi I have a few questions on your response.

The slide you linked to talks about having a batchid subpartition. That is not exactly the same as creating a new directory with batchid in it, and then have metastore point the partition there. Can you show me some examples of the full path of files on S3?
You talked about how EMR Hive 1.0 does insert overwrite. Deleting directories at the beginning of a query sounds horrible and, at the same time, unnecessary. Why does it not defer the delete until near the end of the query?
- With the scheme you talked about (writing temporary file to HDFS), the directory of the original data in S3 does not need to be deleted until the temporary files in HDFS has been fully populated.
- The reason we are implementing this transaction-ish delete/insert is that a lot of use cases we see at Facebook replaces partitions weekly/daily. Running a delete query and then an insert query is unacceptable to them because dependent interactive queries and/or pipelines cannot be unrunnable for an extended period.

haozhun on 10 Aug 2016

👍1

We have tables doing both. For non-partitioned tables a batchid directory gets created and the table location points to that. There are other tables that has a batchid partition. An example for a non-partitioned table: s3n://hive/database.db/table/batchid=1470789481, and for a partitioned table the partition (dateint=20160506, hour=9, batchid='merged_20160506T101705_1') points to s3n://hive/database.db/table/dateint=20160506/hour=9/batchid=merged_20160506T101705_1;
I don't know why they implemented it this way, but you are right that deletion can wait until job 1 is done. I also agree that it's nice to minimize the period of time where data is unavailable.

nezihyigitbasi on 10 Aug 2016

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.