Aws-cli: s3api select-object-content support OFFSET clause for LIMIT

Created on 11 Mar 2019 · 12Comments · Source: aws/aws-cli

Hi,

Would be nice if the SQL syntax for selecting records from S3 objects would support the OFFSET clause (in conjunction with LIMIT). This way users could parallelize the processing of large S3 objects by processing subranges in separate processes.

closing-soon feature-request response-requested service-api

Source

EugeniuZ

👍7

Most helpful comment

HI, is there any chance that this would be implemented ?

darek-phorest on 9 Jan 2020

👍5

All 12 comments

@EugeniuZ - Thanks for your post. This is an interesting idea. Could you please provide an example of how you'd like this to work and the expected output?

justnance on 16 Mar 2019

Example of usage:
aws s3api select-object-content --bucket bucket --key large-file.json --expression-type SQL --expression "SELECT * FROM S3Object LIMIT 1000 OFFSET 1000" --input-serialization '{"JSON": {"Type": "LINES"}}' --output-serialization '{"JSON": {}}' out.txt

Expected output would be the line range [1001-2000] extracted from the large-file.json file.

EugeniuZ on 18 Mar 2019

👍2

I would also love to have the OFFSET functionality

chrispruitt on 8 Apr 2019

👍2

Thanks for the feedback. Marking it as a feature request pending additional review.

justnance on 12 Apr 2019

🎉3

@EugeniuZ and @chrispruitt - Please post this feature on the AWS Developer forums so our service team can collaborate with you directly. If you can post the forum link here I can forward the link to the service team to help speed up the communication. Thanks.

justnance on 13 May 2019

👎2

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

no-response[bot] on 2 Jun 2019

👎2

HI, is there any chance that this would be implemented ?

darek-phorest on 9 Jan 2020

👍5

is ther SKIP or OFFSET in s3api select-object-content

pydemo on 18 Mar 2020

Vote for reopening. This feature would be very beneficial.

donarus on 10 Jun 2020

Vote for reopening. This feature would be very beneficial.

it's hard to figure out how not to charge us for skipped data :-)

pydemo on 10 Jun 2020

Vote for reopening. This feature would be very beneficial.

it's hard to figure out how not to charge us for skipped data :-)

I'm not sure if I get your comment right :).

The pricing model is set as a combination of the size of data "scanned" and data "returned", not only the size of data "returned".

So imagine that you want to get records 9.000.000.000-9.000.000.100. What you have to do now is that you have to transfer all 9.000.000.100 records through the network and use just the last 100 records, forgetting the rest. You still have to pay for processing 9.000.000.100 records and also returning 9.000.000.100 rows.

But if you could do something like
select * from S3Object o limit 100 offset 9000000000, the amount of scanned data remains the same (so the same amount of money for Amazon), you just pay a lower amount for data "returned" ... and the benefit is that you can get the result way way way faster and you can call your query much more frequent.

donarus on 10 Jun 2020

Better late than never to this party. I need this to be able to process sections of data from S3 and I don't want to risk them running out of time trying to skip over a large number of records I don't need processed just to get to the next batch that I do. Without an offset or skip option, we're looking at scrapping this migration and keeping the process in-house.