Guava: RFE: Add a public ByteStreams.toByteArray(InputStream in, int expectedSize)

Created on 29 Oct 2016 · 13Comments · Source: google/guava

1) rename the current package-private ByteStreams.toByteArray(InputStream in, int expectedSize)

2) add public ByteStreams.toByteArray(InputStream in, long expectedSize) that does exactly this:
https://github.com/google/guava/blob/e102fa4a86dad34f375e06c8ce0f52ea264588c3/guava/src/com/google/common/io/Files.java#L159-L175

3) except that this line:
https://github.com/google/guava/blob/e102fa4a86dad34f375e06c8ce0f52ea264588c3/guava/src/com/google/common/io/Files.java#L172
should be:
return expectedSize <= 0

4) remove package-private Files.readFile(InputStream in, long expectedSize)

rationale for 2):

it is impossible to efficiently read an input stream of known size to a byte array, but sometimes it is needed. example a): efficiently reading a binary file after checking that its header is valid. only options are: reading the header, closing the file, open it again and read it fully with Files.readFile(...). or: reading the header then passing the stream to inefficient ByteStreams.toByteArray(InputStream in). example b): reading an InputStream of a zip file's ZipEntry, where again the uncompress length is known ahead of time.

both examples a) and b) are from real code i am wirting.

rationale for 3):

ZipFile entry returns -1 rather than 0 for unknown lengths:
https://docs.oracle.com/javase/8/docs/api/index.html?java/util/zip/ZipEntry.html
Given that 0 is a valid length, that makes total sense. so it also makes sense that expectedSize should handle <0 values (and 0 too) as unknown lengths.

(rationales for 1) and 4) are obvious.)

package=io status=will-not-fix type=addition

Source

Lanchon

Most helpful comment

i also want to comment on this:

In most other cases, people either:

...

...

know the exact length (in which case they can use ByteStreams.readFully)

ByteStreams.readFully is not an option. it will throw if the file is shorter than expected, but it will silently fail (misbehave) if the file length was increased under its feet, potentially returning a byte array that is not representative of any file state.

what is needed is a performance hint to a robust helper, not some brittle alternative that you cannot justify choosing solely on the grounds of increased performance. i will keep rolling my code for this as guava does not provide useful help for this case.

the fact that you actually need this yourself and do not share it is frustrating.

Lanchon on 30 Jun 2018

👍2

All 13 comments

Is there any appetite for these changes?
If so, I don't mind submitting a PR.

mlmitch on 14 Mar 2018

well, the way i presented this in the OP it would take inside of 5 minutes to implement. but a year and a half later, it is still pending. so i'd say no, no appetite.

meanwhile, my use case relies on this hack: https://github.com/DexPatcher/multidexlib2/blob/v2.2.2/src/main/java/com/google/common/io/ByteStreamsHack.java

Lanchon on 15 Mar 2018

I don't know if this will make you feel better or worse, but we dedicated a couple engineer hours to discussing this over the past couple days, and we'll probably spend a couple more on it in the next week or so. I can't speak for everyone, by my main reservation is that ZipFile seems to be the main use case: In most other cases, people either:

won't have opened the stream yet (in which case they can implement ByteSource and get this behavior for free once we make the appropriate optimization there)
won't care about allocations, or will care about allocations enough to read the file in chunks
know the exact length (in which case they can use ByteStreams.readFully)

cpovirk on 15 Mar 2018

👍1

I'll get a bit more specific than I was in #3054.
This is a stripped down version of what we want to do with a public ByteStreams::toByteArray.

public byte[] decrypt(byte[] data, Cipher cipher){
  return decryptAndOrUncompress(data, false, data.length, cipher);
}

public byte[] decryptAndUncompress(byte[] data, int finalLength, Cipher cipher){
  return decryptAndOrUncompress(data, true, finalLength, cipher);
}

private byte[] decryptAndOrUncompress(byte[] data, int estimatedLength, Cipher cipher, boolean isCompressed){
  ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(data);
  CipherInputStream cipherInputStream = new CipherInputStream(byteArrayInputStream, cipher);
  InputStream dataInputStream = isCompressed ? new InflaterInputStream(cipherInputStream) : cipherInputStream;
  return ByteStreams.toByteArray(dataInputStream, estimatedLength);
}

The incoming data arrays do not come from disk and vary in size up to multiple megabytes.
This happens on our main data flow, and the allocations have become a noticeable hot spot.

The estimated length interface is very helpful for the decrypt method.
data.length provides a good estimate for the final size but is probably wrong because of padding added during the encryption process.

By this description, I seem to fall into "won't have opened the stream yet (in which case they can implement ByteSource and get this behavior for free once we make the appropriate optimization there)".
However, this seems like a lot of overhead for reading an InputStream to a byte[].

mlmitch on 16 Mar 2018

Thanks for the details. While I would still recommend the ByteSource approach in most cases, it does feel wrong to tell you to implement its sizeIfKnown method to return a size that's known to likely be wrong, so it's probably the wrong approach for your case. The size issue also explains why readFully isn't a good fit for you.

cpovirk on 16 Mar 2018

@cpovirk FWIW, even for ZipFile, the ByteSource approach works: ZipFile has a getInputStream(ZipEntry) method that returns an InputStream for reading that entry's content, so a ByteSource wrapping a ZipFile + ZipEntry can be created.

cgdecker on 16 Mar 2018

I don't know if this will make you feel better or worse, but we dedicated a couple engineer hours to discussing this over the past couple days...

thank you. definitely better.

Lanchon on 19 Mar 2018

Are there any updates on this ticket?

mlmitch on 27 Mar 2018

Based on the responses I got when we discussed this internally earlier this year, I'm going to close this. As @cpovirk said above, the main issue with this is that the use cases for this are very limited, especially given that ZipFile (and even the decryption use case above, though it might not be ideal there) can be handled using ByteSource.

For what it's worth, I will note that I made some changes in 25.0 (d0d5bd73d86efa5f3e4989ca589c9c7fc0b232b2) that should improve the performance of ByteStreams.toByteArray in general... it was doing some pretty suboptimal stuff.

cgdecker on 27 Jun 2018

@cgdecker You marked #3054 as a duplicate since it was contained in this issue.

Can you reconsider making
static byte[] toByteArray(InputStream in, long expectedSize)
public in ByteStreams.java?

It is a static utility method that provides useful functionality as is. I can't think of a reason for it not to be public.

mlmitch on 28 Jun 2018

too bad you consider something that both you and i need and use (you use it internally) as being of no use. IMHO implementing ByteSource to do the work of what should be a static method call is obscure, verbose, error prone, and plain ridiculous. and although i aplaude the efforts on ByteStreams.toByteArray, its performance will never come close to allocating an array once of the expected size. thanks!

Lanchon on 30 Jun 2018

btw, you already implemented all of what i proposed here in 6cd80a9443ab2d8f15c8bff60b289e3b5d94a9d0 (including deleting Files.readFile) except for the check

     // some special files may return size 0 but have content, so read 
     // the file normally in that case 
     return expectedSize <= 0 
         ? ByteStreams.toByteArray(in) 
         : ByteStreams.toByteArray(in, expectedSize);

and making the new method ByteStreams.toByteArray(InputStream in, long expectedSize) public.

so i rest my case: you actually needed all this, and i need it too.

Lanchon on 30 Jun 2018

i also want to comment on this: