Azure-sdk-for-java: File upload getting failed for file size larger than 200-500MB

Created on 7 Dec 2020 · 17Comments · Source: Azure/azure-sdk-for-java

azure-storage-blob SDK version=12.7.0
Java Version:- OpenJDK11

To upload a file on azure storage, we're using below code but we came to know that it fails to upload file more than 200-500MB file size.
BlobClient blobClient = AzureHelper.getBlobContainerClient(AzureHelper.getBlobServiceClient(serviceEndpoint,
account, key), container).getBlobClient(destFile);

        bin = prepareEncryptStream(encryptionFlag, is, parametersMap);

                    blobClient.upload(bin, length, true);

        BlobProperties prop = blobClient.getProperties();
        responseMap.put(HybridConstants.PARAM_FILE_SIZE, String.valueOf(prop.getBlobSize()));

        byte[] contentMd5 = prop.getContentMd5();

        String md5Value = RepositoryHelper.getInstance().convertMD5ByteToHexEncodedString(contentMd5);

I think its causing due to blobClient.upload method behavior is asynchronous

Can anyone help me to sort out this large file size upload issue?
Also please take a note that after uploading a file, we getContentMd5() of uploaded file from blob properties metadata.
So if we take any other solution like BlobRequestOptions then how we get ContentMd5() of file uploaded in chunk to confirm that it returns same MD5 value similar to single upload?

So we want a solution which should handle large file size issue and also content-MD5 of that file.

Client Storage customer-reported question

Source

harisingh-highq

Most helpful comment

Ah, so storage has 2 concepts of md5

transactionalMd5 - this is the md5 of the data sent in the request (so for a chunked upload, it would be the md5 of the chunk) and the service will validate this md5 to make sure there were no corruptions over the network, but the service will not store this md5. The SDK will actually compute this for you if you set BlobParallelUploadOptions.computeMd5.
BlobHttpHeaders.contentMd5 - this is file based (just stored with the file and not really checked by the service) and returned when you call getProperties. This is the dummy md5 that @rakhmvi pointed out.

@harisingh-highq I think the problem in your code is that you are closing the input stream before passing it to the client. I think you need to keep the IS open and reset it before passing it to the SDK.

gapra-msft on 9 Dec 2020

👍2

All 17 comments

@harisingh-highq

Thank you for reporting this issue.

Could you please paste the stack trace or describe the error you are encountering? The upload method should be able to handle large uploads by chunking the data you provide.

gapra-msft on 7 Dec 2020

Hi @gapra-msft
in my case I use similar code as @harisingh-highq has to upload buffered stream data to Azure. I've noticed that files whose size is greater than 300 MB have empty MD5 value in their properties. This value is mandatory in my case. Could you suggest me how to fill md5?
Kind regards

rakhmvi on 8 Dec 2020

👍1

Hi @gapra-msft
FYI, larger file is uploaded successfully with above code but as @rakhmvi commented we are getting null value in getContentMd5() of files which size is larger than approax 255MB with no any exception thrown.

This content-MD5 value is mandatory in our case.
Also I checked a case like if I upload a file in azure portal and edit it after upload then its content-MD5 value also getting changed which is right and expected behavior.

So now for larger file > 256MB upload, how we get getContentMd5() value from its metadata?

Can you please help us into this issue?
Thanks

harisingh-highq on 8 Dec 2020

@harisingh-highq and @rakhmvi Thank you for clarifying the issue.

According to the Rest Doc, Get blob properties only returns the contentMd5 in the following situations

If the Content-MD5 header has been set for the blob, this response header is returned so that the client can check for message content integrity.
In version 2012-02-12 and newer, Put Blob sets a block blob’s MD5 value even when the Put Blob request doesn’t include an MD5 header.

So the service will compute the md5 if the blob is small enough to fit in a single put blob request (which I think used to be around 256MB). It looks like you will have to compute the md5 yourself if the file is any larger and you need to get the md5 back.

gapra-msft on 8 Dec 2020

Hi @gapra-msft thanks for quick responding on this thread. Actually me, @harisingh-highq and @rakhmvi all are working on one project where we stuck at this stage for fetching MD5 for larger size files.

One question I have for your last comment that you suggest that we should calculate MD5 and need to pass in upload call for large file i.e. >256 MB right?

Now seems it is strange because MD5 generally we use for making sure that data is not corrupted on storage. Here if I pass MD5 while upload, and something corruption happened on storage in future and then If I check metadata of MD5 then I will get same MD5 which I sent in past so how can we detect that file is corrupted or not?

niravravalhighq on 9 Dec 2020

Hi @gapra-msft , thank you for your efforts.

Azure calculates MD5 value for file whose size is less than maxSingleUploadSize value (default is 256MB) of ParallelTransferOptions. If the size of the data is less than or equal to this value, it will be uploaded in a single put rather than broken up into chunks. And so for the single put Azure checks MD5 value which is passed via metadata and actual one being calculated on Azure side. In case values are various then the file is not stored.
However any value (even dummy) can be passed to metadata for a file whose size is larger than maxSingleUploadSize value (default is 256MB) of ParallelTransferOptions and the file is successfully stored with this MD5 (even dummy).

@gapra-msft Could you confirm or clarify above?

rakhmvi on 9 Dec 2020

Hi @gapra-msft
Thank you for your quick response.

Apart from @niravravalhighq comment, we stuck with calculate MD5 logic also on our side.

Now to calculate MD5 on input stream, we need to read that input stream while doing MD5 calculation.
And as far as I know once we read input stream then we can't re-read same input stream as its set to empty.

This is MD5 calculation logic as per below:-

public String getFileChecksumForInputStream(InputStream is,byte[] byteArray) throws IOException {
MessageDigest digest = null;
StringBuilder sb = new StringBuilder();
try {
digest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
logger.error(ExceptionUtilityHelper.accessExceptionStackTrace(e));
}
if(digest!=null){
int bytesCount = 0;
// Read input stream update in message digest
while ((bytesCount = is.read(byteArray)) != -1) {
digest.update(byteArray, 0, bytesCount);
}
is.close();
byte[] bytes = digest.digest();
for (int i = 0; i < bytes.length; i++) {
sb.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
}
}
String sbString = sb.toString();
sb.setLength(0);
return sbString;
}

After execution of above code, I pass same input stream to azure upload API like as below:-

blobClient.upload(is, length, true);

It throws below exception as stream is already closed and we try to re-read it twice.

Note:- We have strict requirement to not to store input stream to local temp file. We directly play with request input stream.

Can you please help us to solve this issue like how can we use same input stream
in MD5 calculation and upload API parameter as well?

Thank you. Your help would be appreciable.

harisingh-highq on 9 Dec 2020

Ah, so storage has 2 concepts of md5

transactionalMd5 - this is the md5 of the data sent in the request (so for a chunked upload, it would be the md5 of the chunk) and the service will validate this md5 to make sure there were no corruptions over the network, but the service will not store this md5. The SDK will actually compute this for you if you set BlobParallelUploadOptions.computeMd5.
BlobHttpHeaders.contentMd5 - this is file based (just stored with the file and not really checked by the service) and returned when you call getProperties. This is the dummy md5 that @rakhmvi pointed out.

gapra-msft on 9 Dec 2020

👍2

@gapra-msft
Thank you for your valuable response.

FYI, even if I remove close() operation and reset() stream before passing it to SDK, its also not working

It throws below exception

Also if I set it mark(0) position, it is also not working. Because once stream is read we can't read it again by just reset its pointer.

2nd way:-

I also tried another way to copy input stream to ByteArrayOutputStream for future read but it also loads full file to memory
so in case of large file stream, it causing memory issue

In above code, baos.toByteArray(); loads whole stream into JVM memory.

3rd way:-
I tried with PipedOutputStream and PipedInputStream which sync read & write parallelly but in this solution it stuck to
pout.write(byteBuffer.array(), 0, len); line after reading some bytes.
BlobClient blobClient = AzureHelper.getBlobContainerClient(AzureHelper.getBlobServiceClient(serviceEndpoint, account, key), container).getBlobClient(destFile);

        PipedOutputStream pout = new PipedOutputStream();
        PipedInputStream pin = new PipedInputStream(pout, 1024*4*6);

        bin = prepareEncryptStream(encryptionFlag, is, parametersMap, pin);

        ReadableByteChannel channel = Channels.newChannel(bin);

        MessageDigest plainDigest =MessageDigest.getInstance("MD5");
        int readBufferSize = 1024*4; //Read 4096 bytes in one channel.read 
        java.nio.ByteBuffer byteBuffer = java.nio.ByteBuffer.allocate(readBufferSize);//4KB default buffer
        int len;
        ByteArrayOutputStream baos = new ByteArrayOutputStream(readBufferSize);
        while((len=channel.read(byteBuffer))>=0)
        {
            int retryCnt = 0;
            if(len!=0){
                /*plain input stream hash value calculation*/
                plainDigest.update(byteBuffer.array(), 0, len);
                pout.write(byteBuffer.array(), 0, len);
            }
            byteBuffer.clear();
        }
        if(plainDigest !=null){
            byte[] plainBytes = plainDigest.digest();
            StringBuilder plainHash = new StringBuilder();
            for (int i = 0; i < plainBytes.length; i++) {
                plainHash.append(Integer.toString((plainBytes[i] & 0xff) + 0x100, 16).substring(1));
            }
            responseMap.put(HybridConstants.PARAM_ENCRYPTED_FILE_HASH_VALUE, plainHash.toString());
        }

        BufferedInputStream bis = new BufferedInputStream(pin, readBufferSize);

        blobClient.upload(bis, length, true);

Now can you please provide us solution to overcome this MD5 calculation & upload file simultaneously issue?
Thank you

harisingh-highq on 10 Dec 2020

Hi @gapra-msft ,

transactionalMd5 - this is the md5 of the data sent in the request (so for a chunked upload, it would be the md5 of the chunk) and the service will validate this md5 to make sure there were no corruptions over the network, but the service will not store this md5. The SDK will actually compute this for you if you set BlobParallelUploadOptions.computeMd5.

As BlobParallelUploadOptions has no setComputeMd5 in12.7 version of sdk, does 12.7 versions sdk guarantee that data can not be corrupted during the upload process?
Kind regards

rakhmvi on 11 Dec 2020

@gapra-msft
I have a query regarding content-MD5 value set in metadata using setHttpHeadersWithResponse as like below:-
blobClient.setHttpHeadersWithResponse(new BlobHttpHeaders() .setContentMd5(Hex.decodeHex(dummyString)) .setContentType("application/octet-stream"), requestConditions).subscribe( response -> System.out.printf("Set HTTP headers completed with status %d%n", response.getStatusCode()));

PLEASE TAKE A NOTE OF THE BELOW IMP QUESTIONS:-

Suppose if I am able to set content-md5 value successfully in blob metadata with above code, then if I will edit file manually from azure storage explorer, the content-md5 value will be changed in blob properties accordingly?(For larger file > 256MB size)
Here if I pass content-md5 while upload, and something corruption happened on storage in future and then If I check metadata of MD5 then I will get same MD5 which I sent in past so how can we detect that file is corrupted or not?
(We should get real time md5 value to ensure data integrity)
Also as @rakhmvi asked in above question like is it validate set content-md5 and actual content-md5 to ensure there is no corruption during upload process?

harisingh-highq on 11 Dec 2020

Thanks for your questions,
@rakhmvi

_As BlobParallelUploadOptions has no setComputeMd5 in12.7 version of sdk, does 12.7 versions sdk guarantee that data can not be corrupted during the upload process?_
No, the SDK did not guarantee no corruptions during a network transfer, but we exposed this feature so customers can be sure their data was transferred successfully over the network.

@harisingh-highq

_Suppose if I am able to set content-md5 value successfully in blob metadata with above code, then if I will edit file manually from azure storage explorer, the content-md5 value will be changed in blob properties accordingly?(For larger file > 256MB size)_
If your large file has some content-md5 and you modify the file, the service will not modify the content-md5 accordingly. It is up to the person updating the file to call setHttpHeaders and update the content-md5. Alternatively you can also set the BlobHttpHeaders.contentMd5 when calling upload or commitBlockList (so it doesnt have to be a separate network request)

_Here if I pass content-md5 while upload, and something corruption happened on storage in future and then If I check metadata of MD5 then I will get same MD5 which I sent in past so how can we detect that file is corrupted or not?
(We should get real time md5 value to ensure data integrity)_
The BlobHttpHeaders.content-md5 value is just a convenient place to store your data's expected md5 value. The value stored on the service provides no guarantees that it is the md5 of the data in storage. To detect if the file is corrupted, you will have to download the file, compute the md5 and compare it to the expected md5 of the data.
_Also as @rakhmvi asked in above question like is it validate set content-md5 and actual content-md5 to ensure there is no corruption during upload process?_
I think I answered above. The computeMd5 parameter is to ensure successful transfer over the wire and is not stored by the service.

As far as your code snippets that fail, I think you need to use mark with a large enough value that you will be able to reset to the end of the stream. When you say mark(0) - I think you are trying to say you want to come back to position 0. How mark works is the mark is set at the stream's current position and the int passed is how many bytes you can read before you can't reset back to the mark. I hope that makes sense.

gapra-msft on 11 Dec 2020

👍1

@gapra-msft
Thank you for your continuous response.

Finally we would like to update you that we're able to calculate content-md5 during upload file with other way and also set it in metadata successfully.

Now as per your previous comment, we're worried about data integrity issue in azure service that it works fine in case of smaller files but found limitation in case of larger files.

The BlobHttpHeaders.content-md5 value is just a convenient place to store your data's expected md5 value. The value stored on the service provides no guarantees that it is the md5 of the data in storage.

(For smaller file < 256MB, content-md5 value is in sync with real time data of blob. But in case of larger files > 256MB, content-md5 value is not in sync with real time data of blob.)

Also I've query regarding our finalize solution as per below:-

As you can see highlighted area like we've used BlobAsyncClient for azure blob client connection
and BlobOutputStream for writing to file for upload file.

So is there any difference between BlobAsyncClient v/s BlobClient in term of performance, concurrency, n/w glitch, multi thread, async behavior, upload time diff etc?

Can you please clarify this? I look forward to hearing from you.
Thank you.

harisingh-highq on 15 Dec 2020

@harisingh-highq

Great to hear that you got your solution working!

**_Now as per your previous comment, we're worried about data integrity issue in azure service that it works fine in case of smaller files but found limitation in case of larger files.

The BlobHttpHeaders.content-md5 value is just a convenient place to store your data's expected md5 value. The value stored on the service provides no guarantees that it is the md5 of the data in storage.

(For smaller file < 256MB, content-md5 value is in sync with real time data of blob. But in case of larger files > 256MB, content-md5 value is not in sync with real time data of blob.)_**

I would like to clarify the service behavior. The service will automatically compute the md5 of data that is uploaded in a single put request (around 256MB but this limit has increased recently) since the service can be sure that the information there is the entire content of the blob, and the blob is essentially immutable. However, for larger blobs when a user calls multiple stage blocks and commit block list, the content of the blob can change, requiring recalculation of the MD5. This is expensive, and the storage service does not support this functionality.

_So is there any difference between BlobAsyncClient v/s BlobClient in term of performance, concurrency, n/w glitch, multi thread, async behavior, upload time diff etc?_
In general, the BlobClient simply wraps the BlobAsyncClient and blocks on the respective call to make it synchronous. In Reactor (async operations) there are nice ways to handle concurrency and multi threaded scenarios, and you can use it if you are familiar with/comfortable learning the style of programming.

But the BlobOutputStream API is inherently a sync API that uses async buffered upload under the hood whether you get there from a BlobClient or a BlobAsyncClient. The following code snippets get the same BlobOutputStream (The second snippet is the recommended way to get a BlobOutputStream (only because OutputStream is a sync operation))
BlobOutputStream.blockBlobOutputStream(blobAsyncClient, blobOptions, Context.NONE);
blobClient.getBlockBlobClient.getBlobOutputStream(blobOptions);

gapra-msft on 15 Dec 2020

Hi @gapra-msft
As per our above conversation, we are able to upload file with MD5 calculation for large file size successfully.

Now we're facing a new issue in file upload with apache http server only in case of large file size
Please refer #18700

Please help us to sort out this issue?

harisingh-highq on 20 Jan 2021

Hi @harisingh-highq

Thanks for posting this new issue. I can take look at it and respond on that thread. Since this issue seems to have been resolved, could we close this issue?

gapra-msft on 20 Jan 2021

Hi @gapra-msft
Yes, you can close this issue as this issue is specific to contentMD5 calculation for large file upload case and it is resolved now.
Thanks a lot

harisingh-highq on 5 Feb 2021

👍1

Was this page helpful?

0 / 5 - 0 ratings