BuildTools had a DownloadFromAzure task. This task is useful for folks, so it should be ported into Arcade.
Required for how ToF infra – ToF regularly downloads huge amounts of logs from blob storage and had done a lot of work to improve perf on DownloadFromAzure. To quote Tomas Rylek:
We went to great lengths to apply parallelization to speed up our downloads / uploads using the MaxClients parameter of UploadToAzure / DownloadFromAzure (please note that we鈥檙e routinely uploading and downloading tens of thousands of items in the ProjectN ToF infra) and I鈥檓 wondering whether such a thing can be easily replicated by msbuild-side parallelization of the DownloadFile task.
@jonfortescue , can you add some more details to this issue? Links to where this is needed / how this is used would help.
@chcosta Hopefully that helps!
That's generally useful. I'm also interested in where it's actually being used in the ToF infra, who the people that care are, and where else the task is being used so that whoever is doing the port can make an informed design decision if they decide they need / want to change anything.
@trylek can probably add more context on that.
@chcosta I think the problem has two aspects - correctness and performance. The DownloadFile API I found in today Arcade SDK that seems to be a loose equivalent of BuildTools "DownloadFromAzure" constructs a new instance of HttpClient for each file download. AFAIK this is not recommended and we hit issues in our ToF infra where some instances of the DownloadFromAzure task were intermittently failing with the bug
E:\LabController\work\Reporting.1554\ToFOnHelix\OrchestrateMultiLegRun.proj(106,9): error : System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: Only one usage of each socket address (protocol/network address/port) is normally permitted 13.88.144.240:443
which disappeared after I modified the DownloadFromAzure task to use a single HttpClient instance for all file downloads within the task.
On the performance side, emitting the results report for a single ToF leg typically involves downloading thousands of files from Azure; for composite reports aggregating results for a larger number of legs it can easily end up downloading tens of thousands of files. I'm worried that it would be hard to match internal parallelization in DownloadFromAzure just by msbuild-side parallelization of the individual DownloadFile task executions.
@chcosta, @jonfortescue Please let me know if you need additional info from me to move forward with this PR which I believe to be general goodness and a prerequisite for onboarding Project N Infrastructure to Arcade SDK. I hoped to be able to start prototyping the switch-over last weekend during my stay in Redmond but I had to focus on other tasks and it also wouldn't make much sense before we reach an agreement on this issue.
In general I believe this is the only real blocker for our switch-over to Arcade SDK. There are two more minor issues I'm aware of - the lack of ability to explicitly specify ContentType for the individual items in UploadToAzure and a nice-to-have regarding parallelization of large file uploads to Azure but these are of lesser priority and generally not blocking.
Thanks a lot
Tomas
@trylek - sorry, I can't find your PR. I presume you're looking at https://github.com/dotnet/arcade/blob/master/src/Microsoft.DotNet.Arcade.Sdk/src/DownloadFile.cs? Do you think your proposed changes should be incorporated into the existing DownloadFile.cs?
@markwilkie Sorry about the confusion - I was under the impression that the discussion on this task somewhat petered out so that there was no agreement whether porting the DownloadFromAzure task is desirable in general. If you agree this is fine, I can easily prepare the porting PR - I am worried that it would be harder to "fix" DownloadFile as appropriate and it also wouldn't address the parallelization problem I described above.
@trylek - yes, adding another "downloady" task should be done cautiously as it adds confusion as to which one is supported, or which is best to use. Do you know of any particular technical reason we shouldn't settle one one implimintation?
@markwilkie That sounds reasonable to me. I tried to explain why I believe DownloadFile in its current form to be insufficient in my above response and in our earlier e-mail exchange:
1) Having each file download construct its own HttpClient instance is a discouraged practice and it is known to cause non-deterministic failures upon download of a large number of files.
2) Having an msbuild task that only supports downloading a single file doesn't allow for internal parallelism within the task - spinning up the task many times from msbuild is likely to have much higher overhead and worse performance characteristics.
In light of this fact I would assume we can basically consider two options:
a) Generalize the DownloadFile to support download of multiple files, allowing for sharing of the same HttpClient by the individual file downloads.
b) Replace the DownloadFile task with the [more general] DownloadFromAzure task.
In practice both boil down to the same thing (modulo potential name variations). Please let me know which approach you find the most appropriate - I guess that, without agreeing on that first, it doesn't make much sense to try producing an actual PR.
So far I'd suggest improving DownloadFile.cs
@tmat, @weshaggard, @ericstj, @natemcmaster, @chcosta, @mmitche, @jcagme - any objections to this work before @trylek creates a PR?
MSBuild already has a built-in task for downloading files. Seems like we should use that one if possible. It was added in MSBuild 15.8 (.NET Core SDK 2.1.400+)
https://github.com/Microsoft/msbuild/blob/vs15.8/src/Tasks/DownloadFile.cs
@natemcmaster Hmm, according to my review of its source code, the built-in msbuild task has exactly the same two problematic characteristics as the existing Arcade DownloadFile task - it constructs a new HttpClient instance for each individual download and it doesn't support internal parallelization across multiple file downloads.
it constructs a new HttpClient instance for each individual download and it doesn't support internal parallelization across multiple file downloads.
Hey, @jeffkl @AndyGerlicher would take a PR to MSBuild to fix this? 鈽濓笍
We're definitely open to any PR on DownloadFile. I implemented it as a pretty basic task expecting more powerful functionality in the future. Any changes made now won't be available until the next version of Visual Studio is released which is due out next year last year. But don't let that discourage the contribution!
Thanks @jeffkl for your quick and supportive response. I guess that it might be a reasonable compromise to implement the necessary improvements to DownloadFile both in the Arcade SDK and in the msbuild repo for now. Once the new mbuild ships, we should be able to just delete the Arcade SDK version and cleanly switch over to the stock msbuild task.
OK, so let's proceed towards an initial design of the DownloadFile change. If the task is about to optionally support multiple file downloads, I believe it would be the easiest to do it as follows:
We add an optional extra parameter to the task,
public ITaskItem[] Files;
The semantics will be as follows:
1) When the Files parameter is empty ("legacy" behavior), DownloadFile will just pull down the one file specified by the "Uri" parameter to the location specified by "DestinationPath".
2) When Files is not empty, the task spins up a parallel loop over the array, downloading the individual files using the same HttpClient instance constructed upfront. The ITaskItem will support the following metadata:
4) By default, ItemSpec will specify the "relative file name" - this will be appended to "Uri" (using the / character) and to DestinationPath (using Path.Combine) to yield the source URI's and destination paths. Caveat: this would require a somewhat more tricky definition and implementation if we wanted to let Uri contain the "?" query part (e.g. in the presence of SAS tokens).
5) Uri / DestinationPath - these would work as item-specific overrides to the "global" Uri and DestinationPath. In the presence of both, they should be probably concatenated in a manner similar to (3).
Thanks
Tomas
fyi @ericstj, @weshaggard, and @tmat (others are already on this issue)
@trylek , just checking in to see where this falls on your priority list.
@chcosta: Well, this likely depends on the priority of migration to Arcade in general. Based on initial Mark's e-mails I was under the impression that most teams were expected to switch to Arcade by the end of 2018, that was why I started the initial discussions around my trip to Redmond in October. Frankly speaking, seeing how complicated it would be to reach consensus and get anything moving even in such a simple matter as DownloadFromAzure, I have for now resigned and returned to my primary feature work on the cross-plat AOT compiler.
Looks like the next step is to make sure we've got the right requirements/contract and then go from there.
Most helpful comment
Thanks @jeffkl for your quick and supportive response. I guess that it might be a reasonable compromise to implement the necessary improvements to DownloadFile both in the Arcade SDK and in the msbuild repo for now. Once the new mbuild ships, we should be able to just delete the Arcade SDK version and cleanly switch over to the stock msbuild task.