Fmriprep: freesurfer recon-all -parallel option?

Created on 19 Jul 2019  路  4Comments  路  Source: nipreps/fmriprep

I can't find this in older issues, but: shouldn't all of the recon-all call include the new-parallel switch prior to -openmp? I dont' see that being used in the log files, and this might speed up things further...

https://surfer.nmr.mgh.harvard.edu/fswiki/ReleaseNotes

Parallelization: a new flag was introduced which enables two forms of compute parallelization that significantly reduces the runtime. As a point of reference, using a new-ish workstation (2015+), the recon-all -all runtime is just under 3 hours. When the -parallel flag is specified at the end of the recon-all command-line, it will enable 'fine-grained' parallelized code, making use of OpenMP, embedded in many of the binaries, namely affecting mri_em_register and mri_ca_register. By default, it instructs the binaries to use 4 processors (cores), meaning, 4 threads will run in parallel in some operations (manifested in 'top' by mri_ca_register, for example, showing 400% CPU utilization). This can be overridden by including the flag -openmp after -parallel, where is the number of processors you'd like to use (ex. 8 if you have an 8 core machine). Note that this parallelization was introduced in v5.3, but many new routines were OpenMP-parallelized in v6. The other form of parallelization, a 'coarse' form, enabled when the -parallel flag is specified, is such that during the stages where left and right hemispheric data is processed, each hemi binary is run separately (and in parallel, manifesting itself in 'top' as two instances of mris_sphere, for example). Note that a couple of the hemi stages (eg. mris_sphere) make use of a tiny amount of OpenMP code, which means that for brief periods, as many as 8 cores are utilized (2 binaries running code that each make use of 4 threads). In general, though, a 4 core machine can easily handle those periods. Be aware that if you enable this -parallel flag on instances of recon-all running through a job scheduler (like a cluster), it may not make your System Administrator happy if you do not pre-allocate a sufficient number of cores for your job, as you will be taking cycles from other cores that may be running jobs belonging to other cluster users.

documentation

Most helpful comment

Thank you for the explanation. This is probably the clearest insight into these two switches, it might be worth adding the formula and your mention of diminishing returns to the documentation.

All 4 comments

We do our own coarse parallelization of recon-all that doesn't suffer from some of the limitations of the -parallel flag. They get at some of the reasons it can be problematic at the end of that block, which would have consequences for how we account for the number of threads FreeSurfer uses and thus actually inhibit overall concurrency.

Thanks! in that case, is there any consensus on when you max out on benefit for --nthreads and --omp-nthreads? That is, assuming unlimited resources, where do you start getting diminishing returns, and is there a ratio of nthreads to omp-nthreads that makes sense? (i.e., 1:1, 2:1, etc...)

--omp-nthreads stops improving significantly at 8, so our default formula is max(1, min(8, nthreads - 1). We generally don't recommend changing it unless you have a desire to profile and see if the optimal is actually a little different. The nthreads - 1 is to allow one small job at a time to run while large OMP jobs are running, which can shorten the queue quite a lot.

I don't really know what the upper bound of nthreads is. It's going to depend on your memory constraints, which will affect how many large jobs you can run simultaneously, but I'd guess after 3-4x omp-nthreads, you end up with a significant amount of idle time.

Another consideration is whether to let us parallelize several subjects in one fMRIPrep call, in which case throwing more cores and RAM will tend to pay off, or to submit one subject per call, in which case constraining each to an integer fraction of system resources and letting the batch queue do the rest is probably optimal. From the perspective of minimizing idleness, I suspect one or two subjects per call is best, but we haven't done a thorough investigation here.

Thank you for the explanation. This is probably the clearest insight into these two switches, it might be worth adding the formula and your mention of diminishing returns to the documentation.

Was this page helpful?
0 / 5 - 0 ratings