Azure-storage-azcopy: AzCopy run inside a cronned script outputs 0-byte blobs when piped

Created on 24 May 2019  路  15Comments  路  Source: Azure/azure-storage-azcopy

Which version of the AzCopy was used?

azcopy version 10.1.2

Which platform are you using? (ex: Windows, Mac, Linux)

Ubuntu 18.04.2

The problem

  • When a bash script containing azcopy is run via cron and azcopy is set to copy something from Blob Storage and the result is piped (to gzip / tar / dd / etc.), the Blob is not read properly (0 bytes are returned). Additionally, if SAS token includes Write permission, the remote Blob in Storage Account gets overwritten and becomes a 0-byte-long one.

  • When a bash script containing azcopy is run via cron and azcopy is set to copy something from Blob Storage and the result is not piped, the problem does not occur.

  • When a bash script containing azcopy is run manually, the problem does not occur in either case (pipe or no pipe).

Details

There is a following script:

#!/bin/sh
azcopy cp "path/blob?sas_token" /target/dir
azcopy cp "path/blob?sas_token" | dd of=/target/dir
  • Running it manually via ./script.sh results in both files being written properly.

  • Running it via cron 0 0 * * * root /usr/local/sbin/script.sh >> /var/log/output.log 2>&1 results in first file being written properly, the second one is written with a length of 0 bytes. If SAS token includes Write permission, the remote Blob in Storage Account gets overwritten and becomes a 0-byte-long one.

Have you found a mitigation/solution?

Unfortunately not.

All 15 comments

Thanks for reporting this.

I guess it's worth noting that in a very similar environment (although not identical - see below), copying data to Storage Account (not from) in a "piped cron" works perfectly: azcopy version 10.1.1 + Ubuntu 16.04.6

0 0 * * * root /usr/local/sbin/script.sh >> /var/log/output.log 2>&1 
#!/bin/sh
tar cf - directory --absolute-names | pigz | azcopy cp path/blob?sas_token

Thanks @schybbkoh for the detailed description!

It looks like the tool was somehow tricked into thinking there's data coming in from the stdin(here). We'll investigate and update this thread accordingly.

Experiencing the same thing with a script that copies data from Azure to stdout which is piped to another command, then the whole script invoked from a Java app. I created a simple Python/shell script (running on Ubuntu 18.04) that does something similar:

testazcopy.sh

#!/usr/bin/env bash
azcopy copy 'https://mystorageaccount.blob.core.windows.net/container/foo.txt?sas_token' | cat > foo.txt

testazcopy.py

from subprocess import Popen, PIPE
Popen("./testazcopy.sh", stdin=PIPE, shell=True)

Running the Python script will not only emit a zero byte "foo.txt", but it will also truncate the file in the storage account to zero bytes as well. I think this is because azcopy sees stdin as a pipe and thinks it should do a PUT rather than a GET.

Not sure if this explains the problem within the context of cron though.

Hi @ampedandwired, thanks for your feedback! I've logged this item for further investigation.

Just sharing a related anecdote. I somehow got a bunch of files corrupted today, in production. Uploaded files across all environments of Azure Storage using azcopy and then synced back to our central Sharepoint-synced server, so not exactly sure at which stage this happened. Definitely didn't use cron. But ended up with hundreds of files with correct size and content-type but zeroes inside! Couldn't have faked it myself if I wanted to! And even more files were completely fine. Thankfully, one dev had a recent backup of the entire folder and I used it to restore. Couldn't figure out what it was though. I used v10 but not sure which build as I already updated azcopy.exe from the official online location. 馃お

Thanks @rsheptolut thanks for the insights!

Is it possible that one of the jobs got cancelled mid-way? There's a known issue (which we are solving) that cancellation is not cleaning up the in-flight transfers correctly.

(Tagged with needs-retest not because it exactly needs retesting by the project team, but really just to mean, "needs active investigation".) First step is to hear back from @rsheptolut re Ze's question above.

@zezha-msft Yes. I closed the window, probably while running the azcopy sync CLOUD -> LOCAL command, at one point.

This is a script that I run manually after changing a couple of files, it syncs changes in my local folder to Azure Storage. Then on another PC I run the opposite to sync Azure Stotage to a local dir, which then automatically syncs with SharePoint with OneDrive sync. Or it can go the other way around when someone else changed files and synced them to SharePoint. We have a lot of "security"-induced workflows like this - have to use an intermediate PC though RDP because of the corporate firewall. And I might have ran those scripts a couple times in different combinations to make sure everything is synced. At some point I definitely remember getting tired of waiting and closing the cmd window, to resume later. The expectation there was that I can cancel it at any time and it will leave the files in a consistent state - all files intact, some just not synced yet, but I can always rerun the script (the script simply runs azcopy sync for several containers) to finish the sync. Closed the script because for some reason it started doing hunderds of file transfers, even though I only changed a couple of files. Dates may have changed, but not contents or md5. For some reason, I was assuming that azcopy sync looks at md5 (which I put there with put-md5 for all files) to determine if the file is updated, rather than simply relying on dates. Would be cool if azcopy started relying on md5 instead, but I digress. Another reason for closing it was I'm temporarily on a 4mbit connection now and needed to take my internet back urgently for something: it's impossible to do anything on a poor connection when you have azcopy running, it somehow exhausts the entire connection. And then I might or might not have rerun the script to finish the sync: I was pretty tired that night. That's the entire story and the reasoning. Sorry that I don't remember the exact proper repro steps instead, and thank you for looking into it!

Hi @rsheptolut, thanks for the detailed explanation!

Ah yes, if the tool is killed while running, the inflight transfers would not complete, and the destination files could be left in a corrupt state.

That's a great insight though. We'll think about how we can make the experience better. Perhaps we could keep the incomplete files with a suffix ".partial", and rename them to the real name once download completes.

@zezha-msft Something like that would be neat!

Actually though, it would make sense if stdin for cron was a pipe. Perhaps we should use a flag for pipe up/download so we don't get confused by this. Thoughts, @zezha-msft?

Closing due to age of issue. Please let us know if you are still experiencing difficulties, we can reopen if needed.

Personally I'm not experiencing difficulties anymore since I've moved to a more reliable cloud provider ;) Whether the issue occurs anymore or got fixed in one of the commits since the issue got reported that I do not know. If anyone reading this struggles with this please do comment.

@shybbko sorry to hear that 馃槩

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wilsonge picture wilsonge  路  4Comments

IGx89 picture IGx89  路  4Comments

wahalulu picture wahalulu  路  3Comments

Mmdixon picture Mmdixon  路  3Comments

DavidLafond picture DavidLafond  路  5Comments