Conan: [feature request] support concurrent uploading of multiple packages

Created on 31 Aug 2018  路  15Comments  路  Source: conan-io/conan

I've set up my Jenkins CI server to build multiple variants of my Conan package and also for multiple operating systems, compilers and build types. Basically, Jenkins master first distributes build tasks to various nodes (Android, iOS, Linux, MacOS, Windows) that build packages for different build types (debug and release) and possibly with additional options that we use. After every node has completed successfully, the built packages are then collected to the _upload_ node (using Jenkins' stash mechanism) and uploaded to the Artifactory CE server*.

This upload takes quite a long time, even though the server is in the local network. While uploading, Conan prints messages like this:

Uploading package 18/126: 2eaedbc5ad229d9273445b93769c9ea68c9c6a90
Compressing package...                                                
Requesting upload permissions...                                      
Requesting upload permissions...Done!                                 

Uploading conanmanifest.txt                                           

Uploading conaninfo.txt                                               

Uploading conan_package.tgz                                           

Uploading package 19/126: 344a09a718e938372f9956231c2138516dcd122e
Compressing package...                                                
Requesting upload permissions...                                      
Requesting upload permissions...Done!                                 

Uploading conanmanifest.txt                                           

Uploading conaninfo.txt                                               

Uploading conan_package.tgz                                           

As you can see, it requests upload permission for each package separately and performs package compression sequentially. Would it be possible to use multiple thread workers to perform package compression in parallel and utilize all available CPU cores? Also, would it be possible to request upload permission once, i.e. for multiple packages and then upload multiple files in the single request? I think this strategy would drastically increase package deploy performance, especially in cases when your Conan server is not in the local network.

*The reason I do it that way, instead of letting each node directly upload its packages to Artifactory is the following: I don't want to have _broken_ packages uploaded in my server, i.e. packages that have been built successfully on some platforms, but not on other. Having that would mean that user on that platform would download the recipe from the server and Conan will try to build the binaries locally and that will fail, wasting user's time and nerves.

high ux help wanted medium queue feature look into

Most helpful comment

I'm having a look at this issue and the parallel uploading should be easy to implement, just a little refactor in the CmdUpload class. My proposal follows these guidelines (recap from @memsharded message):

  • parallel must be optional (opt-out?), and the number of threads configurable (defaulted to cpu_count()).
  • start uploading binaries only after its recipe has already been uploaded
  • when working in parallel, logs and outputs will show in _fancy way_ (tbd) after all tasks have been completed

Option 1: full parallelization

  • parallelize across references and binaries
  • confirmation step (if needed): once we start to show the progress bars (#3536) we cannot request user input through the console, so in the parallel run we need to ask for confirmation for each package before starting the upload process... or show just a confirm all message. What do you think?

Option 2: parallelize each reference

  • one reference after another, parallelize binaries for each reference
  • confirmation step: as it is now, before each recipe we ask the user for confirmation (if needed)

Maybe we can vote: 鉂わ笍 for option 1, and 馃帀 for option 2.

@memsharded, did you check that Artifactory allows concurrent uploads for the same recipe?

All 15 comments

This would be the complement of https://github.com/conan-io/conan/issues/1012

I think this feature request makes sense. We have to check the complexity, but I'd say it should be easier than https://github.com/conan-io/conan/issues/1012, so might make sense to start with this one.

Some tips:

  • Parallelize after the recipe has been uploaded.
  • Number of concurrent threads to parallelize should be configurable in conan.conf, can be defaulted to cpu_count().
  • Maybe the most complicated thing would be to have a coherent output, the concurrent progress bars, etc.

I am not assigning a milestone yet, not an inmediate priority and next release is already full, but maybe some contributor willing to invest in a intermediate feature, this would be a good one.

At the moment, my workaround is to use conan upload --skip-upload --all to force compression on Jenkins nodes that have actually built the packages to enforce at least some level of parallelism. Each node does compression of its package sequentially, though, but the final step that performs the upload is now much faster because it does not have to perform compression. And even better, this approach works around an annoying bug in Jenkins.

However, if there would be a way to obtain upload permission once and then do a batch upload, that would definitely improve upload speed because of much less overhead. It would also enable parallel upload of packages.

Maybe the most complicated thing would be to have a coherent output, the concurrent progress bars, etc.

About that, have you seen the concurrent progress bars of docker pull command?

Is it possible to combine asyncio with Conan?

@uilianries asyncio is only python 3.4+, I think we don't want to wait until deprecate python 2 for this feature. Furthermore async programming is a different paradigm, I'd say more oriented to even programming, I think in this case is more a traditional parallelism.

@DoDoENT Yes, docker pull output is basically what we want, we might use it as a reference, thanks for the pointer.

However, if there would be a way to obtain upload permission once and then do a batch upload, that would definitely improve upload speed because of much less overhead. It would also enable parallel upload of packages.

There is one check_credentials() call per package upload, which might be optimized (cc/ @lasote), but I am not sure what you mean that speed will improve or that doing it once would enable the parallel upload. Parallelization can be done with this step repeated for every binary (maybe different permissions for different binaries? it is possible), or extracting the check, before launching the parallel upload, but the effort is to be done in parallelizing.

However, if there would be a way to obtain upload permission once and then do a batch upload, that would definitely improve upload speed because of much less overhead. It would also enable parallel upload of packages.

As memsharded commented, the permissions checking shouldn't be an issue. Do you have some traces or data showing the delays in it?

As memsharded commented, the permissions checking shouldn't be an issue. Do you have some traces or data showing the delays in it?

No, I don't - the measurement was very subjective. I only noticed that every package is first waiting for upload permissions and then starts the upload and every other package is waiting for that particular package to be uploaded first and I thought that maybe there is a way to somehow parallelize that.

... but I am not sure what you mean that speed will improve or that doing it once would enable the parallel upload.

I meant that accumulated upload time will reduce, as there will be less overhead per package if permissions are obtained for the entire batch at once and less waiting in the queue if multiple workers perform concurrent uploads of the binaries.

I will change the name of the issue so that it will better describe what I meant.

Perfect renaming, thanks.

Yes, I don't think that the permissions is the blocker, just the compression and transfer, and parallelizing those should alleviate the issue. Maybe we need to check that Artifactory actually allows concurrent uploads over different package binaries, but I think it does.

I'm having a look at this issue and the parallel uploading should be easy to implement, just a little refactor in the CmdUpload class. My proposal follows these guidelines (recap from @memsharded message):

  • parallel must be optional (opt-out?), and the number of threads configurable (defaulted to cpu_count()).
  • start uploading binaries only after its recipe has already been uploaded
  • when working in parallel, logs and outputs will show in _fancy way_ (tbd) after all tasks have been completed

Option 1: full parallelization

  • parallelize across references and binaries
  • confirmation step (if needed): once we start to show the progress bars (#3536) we cannot request user input through the console, so in the parallel run we need to ask for confirmation for each package before starting the upload process... or show just a confirm all message. What do you think?

Option 2: parallelize each reference

  • one reference after another, parallelize binaries for each reference
  • confirmation step: as it is now, before each recipe we ask the user for confirmation (if needed)

Maybe we can vote: 鉂わ笍 for option 1, and 馃帀 for option 2.

@memsharded, did you check that Artifactory allows concurrent uploads for the same recipe?

Yes, opt-out, and threads configurable (there is already conf for cpu_count()), so I will use it, not create a new one.

The output has to be interactive, to show progress, not only after

Strong yes for Option 1, this is more important than the parallelization only at the reference level.

I'd probably gather user confirmations for all references, before starting the process, but keep the current granularity in which the user can decide which packages are uploaded.

The output has to be interactive, to show progress, not only _after_

I think that we cannot combine progress bars with output at the same time. We have to wait for all the progress bars to finish before showing any additional output. Maybe I didn't explain myself well.

An update on this: I've been implementing a threading module (here is my branch) and I've ended up re-inventing concurrent.futures with some syntactic sugar 馃槙

Things learned so far:

  • concurrent.futures should be enough without the need of asyncio, and we can start using it right now given the backport that exists (py2 here)
  • requests.Session() has some concerns about multithreading: https://github.com/requests/requests/issues/1871, so they recommend one session per thread
  • There is also a requests-futures library, in case we just want to parallelize upload/download.
  • There is a lot of state flowing from one Conan class to another:

    • not pickable, so we cannot use processes to isolate state

    • using threads we can share state, but we cannot be sure if data is read-only, so maybe we create issues related to concurrency.

The roadmap should be really ambitious to walk this path:

  1. [hard] A major refactor to provide free functions with simple type arguments at certain parts of the execution flow, better if not collateral effects, without user interaction, or at least without writing to shared resources (database, registry file, cli,...)
  2. [easy] Wrapper over concurrent.futures to opt-out parallelism. Creating the concept of a dummy-future that gets executed in the main thread as soon as it is created should be enough.
  3. [mid] Express Conan internal tasks in terms of futures
  4. [mid] Handle concurrent output to user cli, progress bars,...

linking #5502

Was this page helpful?
0 / 5 - 0 ratings