Aws-sdk-java-v2: Simpler / Easier mechanism to read S3 Object content as a String (like v1 getObjectAsString)

Created on 30 Nov 2017 · 26Comments · Source: aws/aws-sdk-java-v2

The v1 SDK has a nice method for just grabbing the content of an S3 object as a String. It would be nice if the v2 SDK had this functionality.

Expected Behavior

String result = s3.getObjectAsString(request);

Current Behavior

Need to write a fair bit of code:

String result;
try (ResponseInputStream<GetObjectResponse> response = client.getObject(request)) {
    result = IOUtils.toString(response);
} catch (IOException e) {
    throw new UncheckedIOException(e);
}

Possible Solution

Steps to Reproduce (for bugs)

Context

Porting code from v1 to v2. Ran into it. Had to write some custom code to handle the stream instead. Was slightly annoyed by the inconvenience more than anything else.

Your Environment

OpenJDK v9
2.0.0-preview-5

Source

plombardi89

Most helpful comment

@millems: Sorry for the lack of expedient response, I was at Kubecon all last week so my attention was elsewhere. I think the s3.getObjectBytes(request).asString() design is probably fine given IDE code completion. For non-IDE users... that may end up being an annoying API docs search or result in people writing custom code. But how many people actually work on Java outside of a code completing editor? I dunno.

My vote is a yes for what it's worth.

plombardi89 on 11 Dec 2017

👍3

All 26 comments

Thanks for the report! The easiest solution would just be for us to add a StreamingResponseHandler for it like we have for files/streams:

String result = s3.getObject(request, 
                             StreamingResponseHandler.toString(StandardCharsets.UTF_8));

If it was there, do you think you would have found it?

millems on 30 Nov 2017

I likely would miss it if it were toString(charset). There's too much ambiguity in Java about what toString(charset) should do... some people use it for pretty printing an objects internal state and others use it for meaningful biz logic. Perhaps asString(charset) would be preferable?

Similarly I'm not sure I would have intuitively looked at the documentation for StreamingResponseHandler. I feel like getObjectAsString() method is a better solution because it's such a common task and I would expect the SDK to make it simple and easy. The StreamingResponseHandler solution would still likely have me googling or stackoverflowing for the answer.

plombardi89 on 1 Dec 2017

Thanks!

We're trying to balance the number of methods on our clients, and getObjectAsString opens us up to getObjectAsByteArray, getObjectAsInteger, etc.

We already have overloads for File and OutputStream on the client, so adding a separate method for String isn't entirely unfounded. We'll talk about some solutions internally and get back to this.

millems on 1 Dec 2017

Talked about it a bit with a co-worker. How do you feel about if we added it on the result of getObject and used that as our extension point for all our future conversion operations?

String result = s3.getObject(request).asString(StandardCharsets.UTF_8);
byte[] result = s3.getObject(request).asByteArray();
ByteBuffer result = s3.getObject(request).asByteBuffer();
s3.getObject(request).transferTo(myOutputStream);
s3.getObject(request).writeTo(myFile);

millems on 1 Dec 2017

👍3

I like it! Hopefully it gets added.

plombardi89 on 1 Dec 2017

🎉1

I personally don't like this approach for a couple of reasons.

Discoverability

I think this actually hurts discoverability quite a bit. By far the most discoverable location for these conveniences is the client. We should put the most common use cases directly on the client and gather other conveniences in a common location. This approach is still valid as long as we continue to have a file overload on the client and it can call down to inputStream.writeTo(file).

Disregarding the top level client methods, I still don't think this is that much more discoverable than static methods on the interface. We use static methods (builders, factory methods, etc) very heavily in V2 and customers should expect to go there first when exploring a new API. If they start typing out StreamingResponseHandler in search of a builder or something, IntelliJ will suggest the factory methods on the interface. And once customers know they are there then it's their one stop shop for any conveniences we decide to add in the future. Now hanging them off the InputStream may be just as discoverable if not a little moreso since we return an object to them and they can explore what they can do with that object. I wouldn't necessarily expect an input stream to have these methods on them though so I don't know if I'd look for them or just happen to notice them if I'm looking for something else. To me it's not an intuitive place to look for methods of this nature (more on why that is below). My major point is that it's actually unclear if this is more or less discoverable then the current approach.

Semantics of method

My main grip with this approach is the semantics/behavior of these methods. InputStream is a mutable thing that can only be used once. This opens up a lot of questions and ambiguity about the behavior of these methods. If the customer reads from the input stream ahead of time (to peek ahead or something to figure out how to process it) then the asXXXX methods will not be able to process any data they've already read unless they reset (which may not be possible unless we buffer it or something). If they call the asXXX methods multiple times what happens? For the bytes/string one it may be reasonable to memoize it in the input stream and just return the same thing, or throw an exception saying it can only be called once. If there is an exception during the read what do further calls to asXXX do? Attempt to re-read the socket again if it's still open? Memoize the exception and rethrow that? Throw another exception? Due to the nature of InputStreams I think there is a lot of potential ambiguity and complexity in the implementations of these types of methods.

Functionality

Right now these conveniences are built on top of the StreamingResponseHandler which builds in more functionality then input streams can provide. Most notable potential for retries. By putting these methods on the InputStream we pretty much remove the ability to do retries. Timeouts are another one, we can enforce the client execution timeout on the streaming response handler, we can't really do that with the input stream approach.

Consistency

Async does not return an InputStream obviously so those conveniences would need to be in a different location. Right now really the only option is on the AsyncResponseHandler interface which is the async equivalent to StreamingResponseHandler. Having both sets of convenience methods in the same logical location improves consistency and I think discoverability in the long run.

shorea on 4 Dec 2017

Discoverability: Java 9 is adding InputStream.transferTo(OutputStream). If they don't already expect there to be convenience methods on InputStream today, over time they'll start to expect them there. People definitely may be able to find or expect static methods on StreamingResponseHandler if they're really used to the SDK always having static methods on our parameter types as well (similar to request objects). People who aren't experienced with the SDK won't think of that, though.

Semantics: The semantics definitely are a bit ambiguous. The only question that's raised is "does this close the input stream?". The answer for InputStream.transferTo(OutputStream) is "no", but that's actually bad for us: they NEED to close it and people may definitely think they can just do client.getObject(foo).transferTo(outputStream) and think they're fine.

Functionality: I didn't think about losing retries. That's a good argument for at least keeping them on the StreamingResponseHandler.

Consistency: Yep, keeping consistency with async will always be tricky.

@shorea It's not clear to me what your proposal is. Are you suggesting we add s3.getObjectAsString(request) as an alias for s3.getObject(request, StreamingResponseHandler.asString())?

millems on 4 Dec 2017

If they don't already expect there to be convenience methods on InputStream today, over time they'll start to expect them there.

Then let the JDK handle that, I think we can provide a better, more consistent experience that provides more of the functionality that SDK customers expect. Worst case is we are just duplicating methods that will be added to InputStream in the future.

The only question that's raised is "does this close the input stream?

Yeah that's definitely not good. I mean Apache will auto close after all data is read but not every HTTP impl might do that. Another plus for StreamingResponseHandler as they never have to touch the InputStream and we will close it for them.

@shorea It's not clear to me what your proposal is. Are you suggesting we add s3.getObjectAsString(request) as an alias for s3.getObject(request, StreamingResponseHandler.asString())?

Not necessarily. I think that all implementations should be implementations of StreamingResponseHandler and accessible via static factory methods on the interface (we can add proper implementations too if that helps discoverability). For very common use cases we add conveniences to the client. I think we should be conservative about adding new conveniences and only do so when there is a clear need for it. So for now I'd say just add a factory method on the interface and not add it to the client.

shorea on 4 Dec 2017

I think that all implementations should be implementations of StreamingResponseHandler and accessible via static factory methods on the interface (we can add proper implementations too if that helps discoverability).

👍

For very common use cases we add conveniences to the client.

👍

kiiadi on 4 Dec 2017

Another thing I like about the static methods is that we can have different typed responses. For example, s3.getObject(request, StreamingResponseHandler.asString()) could return ResponseString<GetObjectResponse> that allows us to include additional data if we ever want. Eg. for output streams: how much data was written?

I have two things that I think we should still think about addressing.

Discoverability

I think we'd be kidding ourselves if we said people will always find it there, and we shouldn't need people to google to find it.

_Option 1_: We add s3.getObjectAsString and s3.getObjectAsBytes. We have evidence on this thread that this is something customers want, and that customers won't find it as a static method. We wouldn't add it to any other streaming methods right now. It doesn't make sense for lex, for example.

_Option 2_: We could have the static methods for people who want all of the advantages they provide (retries, auto-closing) and have convenience methods on the ResponseInputStream to convert to the other types supported by the static methods (eg. ResponseString ResponseInputStream.toResponseString()). The convenience methods would always close the stream. We'd unfortunately lose the retries on those methods, but it gives customers another chance to find it.

Simple Methods

We've added the fluent style of doing things that makes simple requests easier (eg. s3.listObjects(r -> r.bucketName("foo"))), but we didn't add them to the streaming methods for now because it would be messy in the 2-arg method forms.

After having used the S3 client recently, I've found it's jarring that I can't do it for the 1-arg streaming methods: s3.getObject(r -> r.bucketName("foo").key("bar")) because it doesn't compile.

Could we also consider adding these to the 1-arg style methods?

String response = s3.getObjectAsString(r -> r.bucketName("foo").key("bar")).string();

String response = s3.getObject(GetObjectRequest.builder().bucketName("foo").key("bar").build(),
                               StreamingResponseHandler.asString())
                    .string();

millems on 4 Dec 2017

I'm concerned about conditionally adding it for services that "it makes sense for". Depending on the data you have in your S3 object, getObjectAsString may very well throw an exception so it isn't universally applicable even in S3. If we are going to add conveniences like this, I'd like them to be added for all services. I would argue that we should just add getObjectAsBytes() which is universally applicable and going to a string is super trivial from there, and if they really want it then they can go to the factory method on StreamingResponseHandler.

Don't like Option 2 very much as there are now multiple ways to do the same thing and the advantages/disadvantages are not immediately clear.

I don't think we have conclusive evidence that this is not discoverable though. This particular customer is coming from V1 which has the getObjectAsString method so the expectation was there would be a similar method in V2 (which is something we could mention in a migration guide). I'm curious if a fresh user of V2 or someone not familiar with the V1 s3 client would be able to find it.

Simple methods

I am actually not a fan of having the lambda on the client interface itself (did we ship this already?). I'd rather keep that in the builder and the customer must at least do the top level the traditional way. If we don't contain this pattern to builder sub properties I think we are going to have a tough time staying consistent and making sure every usage of an immutable object has the two overloads. It also doubles the number of methods in our client so don't like that either. Having them call build at least once hammers home the fact that it's immutable as well.

shorea on 4 Dec 2017

I like getObjectAsBytes or something similar to get a byte[].

Would you return byte[] or ResponseBytes<GetObjectResponse> so they can get the response as well as the byte array?

The latter would still give them access to the response, and we could even do an asString(Charset) on the ResponseBytes object.

millems on 4 Dec 2017

I don't have a strong preference, either approach sounds good.

shorea on 4 Dec 2017

I could get on board with this approach - I don't like that there's a different method name, but I can't think of a good alternative that gives us the flexibility to put the bytes into some other structure.

👍

kiiadi on 4 Dec 2017

Talked a little about it in person (minus Andrew). To summarize, we have three methods:

ResponseInputStream<GetObjectResponse> getObject(GetObjectRequest)
ResponseBytes<GetObjectResponse> getObjectBytes(GetObjectRequest)
T getObject(GetObjectRequest, StreamingResponseHandler<T>)

Where (1) and (2) just use static methods on the StreamingResponseHandler to call (3).

(1) is the "simple" streaming version of the method (where the response type in java 9 will have transferTo(OutputStream) for copying)
(2) is the "simple" in-memory version of the method (where the response type has asString(Charset) for in-memory conversions)
(3) is the standard version that can do conversions on data streams with automatic SDK retry guarantees.

We still need to talk about:

Where do we support simple methods?
What do we do with the current s3.getObject(request, OutputStream/File) overloads on the client?

...but we can talk about them separately.

millems on 4 Dec 2017

👍1

👍 I like it.

1) Do you mean the Consumer things that take a builder or the no-arg?
2) I don't believe we have one that takes an OutputStream. File should remain though right? That's the more common one for S3 at least.

shorea on 4 Dec 2017

I guess it's a question of do we do:

s3Client.getObject(r -> r.bucket("foo").key("bar"), FILE);

s3Client.getObject(r -> r.bucket("foo").key("bar")).transferTo(FILE)

I think given the commonality of transferring to a file, not just S3 - Polly this would be a common use-case as well potentially, we should probably keep it as a top-level concern.

kiiadi on 4 Dec 2017

Big 👍 for keeping it top level. More discoverable and gives us retries and timeouts. Don't like the lambdas though ;)

shorea on 4 Dec 2017

I can get behind the latest comment by @millems

Per the above, definitely have file as part of getObject(request, file) not as transferTo(file).

spfink on 4 Dec 2017

Another reason for top level is symmetry with upload scenarios where the file has to be an additional parameter. Per discussion we should probably add a convenience for bytes in streaming inputs as well, would just delegate to RequestBody.of

shorea on 4 Dec 2017

Honestly, I'd rather have s3.getObject(..., OutputStream) than s3.getObject(..., Path).

Writing to an output stream is impossible with the latter, but writing to a path with the former is super easy: s3.getObject(..., Files.newOutputStream(Path)).

That also lets the customer handle the details: Do we override the file if it already exists? What if we want to append? What if we want it to fail?

millems on 4 Dec 2017

Given that InputStream will have a transferTo (even if it doesn't close the stream), I'd still prefer to have File/Path at the top level as that is what most customers are going to be looking to use.

spfink on 4 Dec 2017

That's fair. We can't do retries with OutputStream, so it would just be the same as transferTo with automatic closing.

It would need to be s3.getObject(..., Supplier<OutputStream>) to include retries, and that is complicated enough to tuck away in the StreamingResponseHandler's static methods if someone needs it.

If someone wants to append or override the file, we could provide ways of doing that via StreamingResponseHandler implementations.

millems on 4 Dec 2017

@plombardi89: What do you think about all of this? If we had the following methods:

ResponseInputStream<GetObjectResponse> getObject(GetObjectRequest)
ResponseBytes<GetObjectResponse> getObjectBytes(GetObjectRequest)
T getObject(GetObjectRequest, StreamingResponseHandler<T>)
GetObjectResponse getObject(GetObjectRequest, Path)

Do you think you may have been able to find s3.getObjectBytes(request).asString() without going to google?

millems on 4 Dec 2017

👍1

My vote is a yes for what it's worth.

plombardi89 on 11 Dec 2017

👍3

Thanks for the confirmation!

millems on 31 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings