Aws-sdk-java-v2: [Not a bug/feature] Streaming large S3 files

Created on 22 Jul 2019 · 5Comments · Source: aws/aws-sdk-java-v2

Hello,

I am trying to process large csv files available in S3 via lambdas and have a general doubt regarding how to use SDK for such cases.

I tried processing a ~4GB file with the below code:

   S3Client client = S3Client.create();

    ResponseInputStream<GetObjectResponse> responseInputStream = client.getObject(GetObjectRequest.builder()
        .bucket(s3Entity.getBucket().getName())
        .key(s3Entity.getObject().getKey())
        .build(), ResponseTransformer.toInputStream());

    int linesRead = 0;
    try (BufferedReader reader =
        new BufferedReader(new InputStreamReader(responseInputStream, StandardCharsets.UTF_8))) {
      while (reader.readLine() != null) {
        linesRead++;
      }
    } catch (IOException e) {
      logger.log("IO Exception: "+e);
    }
    logger.log("Lines read: "+linesRead);

This works fine since it streams the records and doesn't actually brings everything in memory in one go. I verified this by running lambda on a 1 GB of memory and it worked fine. However it consumes only about ~100MB of memory and takes around ~100 seconds (time of execute the lambda functions which only contains the above code). Memory and time varies these are roughly the median values.

This is not a deal breaker but i am trying to understand if there is a better way to do this? Basically using more memory to save on time. I tried using async client but couldn't create a stream based set up for it.

I also tried varying lambda's memory and increasing the BufferSize for BufferedReader but couldn't get conclusive results (may be because internal stream is HTTP based not as slow as disk based stream?).

Any pointers here would be helpful. All the examples in documentation are focussed on working with files for S3.

AWS Java SDK version used: V2 (2.7.8)
JDK version used: 8 [Exact version not sure because of lambda' abstraction]
Operating System and version: N/A
Region: [US-WEST-2] [In case it is of any help]

guidance

Source

deepankerk

Most helpful comment

Do note that S3 will close idle connections after 6 seconds so if you are reading a line and then processing you need to make sure to keep doing reads on the socket at least once per ~6 seconds.

spfink on 23 Jul 2019

👍3

All 5 comments

If all you're doing is counting the number of lines in a file, an alternative is to use:

try (BufferedReader reader = new BufferedReader(new InputStreamReader(responseInputStream, StandardCharsets.UTF_8))) {
    linesRead = (int) reader.lines().count();
}

If this is still too slow, you can attempt to parallelize it via reader.lines().parallel().count() (I'm unsure of the implementation of Stream#count, so this may or may not make a difference).

jhg023 on 23 Jul 2019

If all you're doing is counting the number of lines in a file, an alternative is to use:
try (BufferedReader reader = new BufferedReader(new InputStreamReader(responseInputStream, StandardCharsets.UTF_8))) {
    linesRead = (int) reader.lines().count();
}
If this is still too slow, you can attempt to parallelize it via reader.lines().parallel().count() (I'm unsure of the implementation of Stream#count, so this may or may not make a difference).

Hello,

Thanks for replying. I actually intend to read each line and do some processing (like update redis/dynamo etc.). The line by line example was for demonstration that i read to access stuff line by line.

I am thinking i do my processing in a thread pool which will parallelise stuff to some extent. Something like:

String line = reader.readLine();
submitRunnabletoPool(line); // Not an actual method

Obviously this will bottlenecked on the downstream store/redis as well but do you any cons of this approach?

deepankerk on 23 Jul 2019

Do note that S3 will close idle connections after 6 seconds so if you are reading a line and then processing you need to make sure to keep doing reads on the socket at least once per ~6 seconds.

spfink on 23 Jul 2019

👍3

Do note that S3 will close idle connections after 6 seconds so if you are reading a line and then processing you need to make sure to keep doing reads on the socket at least once per ~6 seconds.

Thanks for the insight @spfink. I was able to get pretty decent speeds using runnable like i mentioned above. Please do correct me if the approach is off somehow.

Closing this since nothing else is pending here.

deepankerk on 29 Jul 2019

@spfink since you mentioned that connection will close after 6 seconds if idle, is there a way to configure this duration ?