Hi,
Would be nice if the SQL syntax for selecting records from S3 objects would support the OFFSET clause (in conjunction with LIMIT). This way users could parallelize the processing of large S3 objects by processing subranges in separate processes.
@EugeniuZ - Thanks for your post. This is an interesting idea. Could you please provide an example of how you'd like this to work and the expected output?
Example of usage:
aws s3api select-object-content --bucket bucket --key large-file.json --expression-type SQL --expression "SELECT * FROM S3Object LIMIT 1000 OFFSET 1000" --input-serialization '{"JSON": {"Type": "LINES"}}' --output-serialization '{"JSON": {}}' out.txt
Expected output would be the line range [1001-2000] extracted from the large-file.json
file.
I would also love to have the OFFSET functionality
Thanks for the feedback. Marking it as a feature request pending additional review.
@EugeniuZ and @chrispruitt - Please post this feature on the AWS Developer forums so our service team can collaborate with you directly. If you can post the forum link here I can forward the link to the service team to help speed up the communication. Thanks.
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.
HI, is there any chance that this would be implemented ?
is ther SKIP or OFFSET in s3api select-object-content
Vote for reopening. This feature would be very beneficial.
Vote for reopening. This feature would be very beneficial.
it's hard to figure out how not to charge us for skipped data :-)
Vote for reopening. This feature would be very beneficial.
it's hard to figure out how not to charge us for skipped data :-)
I'm not sure if I get your comment right :).
The pricing model is set as a combination of the size of data "scanned" and data "returned", not only the size of data "returned".
So imagine that you want to get records 9.000.000.000-9.000.000.100. What you have to do now is that you have to transfer all 9.000.000.100 records through the network and use just the last 100 records, forgetting the rest. You still have to pay for processing 9.000.000.100 records and also returning 9.000.000.100 rows.
But if you could do something like
select * from S3Object o limit 100 offset 9000000000
, the amount of scanned data remains the same (so the same amount of money for Amazon), you just pay a lower amount for data "returned" ... and the benefit is that you can get the result way way way faster and you can call your query much more frequent.
Better late than never to this party. I need this to be able to process sections of data from S3 and I don't want to risk them running out of time trying to skip over a large number of records I don't need processed just to get to the next batch that I do. Without an offset or skip option, we're looking at scrapping this migration and keeping the process in-house.
Most helpful comment
HI, is there any chance that this would be implemented ?