Fmriprep: --nthreads 1 --omp-nthreads 1 apparently leading to deadlock

Created on 5 Sep 2019 · 29Comments · Source: nipreps/fmriprep

From https://github.com/poldracklab/fmriprep/issues/1738#issuecomment-528358469:

In my limited testing so far, this does seem to prevent the error from popping up, but also seems to cause fmriprep to just stall at [Node] Running "raw_sources" ("nipype.interfaces.utility.wrappers.Function") for about 18 hours now.

From https://github.com/poldracklab/fmriprep/issues/1738#issuecomment-528382919:

Okay, my fault. I still had --nthreads 1 --omp-nthreads 1 in my command when testing that and it seems to interact poorly. Now with that removed it does seem to run and does get past the previous error location. I'm going to run it a few more times to ensure that it continues getting past that portion of the processing each time.

This function is found here:

def _bids_relative(in_files, bids_root):
    from pathlib import Path
    if not isinstance(in_files, (list, tuple)):
        in_files = [in_files]
    in_files = [str(Path(p).relative_to(bids_root)) for p in in_files]
    return in_files

Which should not cause a stall.

Source

effigies

Most helpful comment

If the task terminates and you have an HTML report that reports no errors, then you're set. But I agree, we should print a clear completion message.

effigies on 9 Sep 2019

👍2

All 29 comments

It sat on [Node] Running "raw_sources" ("nipype.interfaces.utility.wrappers.Function") for about 18 hours before I killed it. When I ran the same with -vvvv it was stuck issuing a bunch of messages like
190905-14:21:51,684 nipype.workflow DEBUG: No resources available so there definitely seems to be a loop it's stuck in.

mangstad on 5 Sep 2019

Yeah, that will be the execution loop. Here's the section:

        stats = (len(self.pending_tasks), len(jobids), free_memory_gb,
                 self.memory_gb, free_processors, self.processors)
        if self._stats != stats:
            tasks_list_msg = ''


            if logger.level <= INFO:
                running_tasks = [
                    '  * %s' % self.procs[jobid].fullname
                    for _, jobid in self.pending_tasks
                ]
                if running_tasks:
                    tasks_list_msg = '\nCurrently running:\n'
                    tasks_list_msg += '\n'.join(running_tasks)
                    tasks_list_msg = indent(tasks_list_msg, ' ' * 21)
            logger.info(
                '[MultiProc] Running %d tasks, and %d jobs ready. Free '
                'memory (GB): %0.2f/%0.2f, Free processors: %d/%d.%s',
                len(self.pending_tasks), len(jobids), free_memory_gb,
                self.memory_gb, free_processors, self.processors,
                tasks_list_msg)
            self._stats = stats


        if free_memory_gb < 0.01 or free_processors == 0:
            logger.debug('No resources available')
            return

Before you see the "No resources available" forever, there should be something indicating Currently running or at least [MultiProc] Running X tasks, and Y jobs ready. Do you see that at all?

effigies on 5 Sep 2019

     [Node] Running "raw_sources" ("nipype.interfaces.utility.wrappers.Function")
190905-14:21:13,654 nipype.workflow DEBUG:
     Progress: 1792 jobs, 1/1/26 (done/running/ready), 1/1790 (pending_tasks/waiting).
190905-14:21:13,654 nipype.workflow DEBUG:
     Tasks currently running: 1. Pending: 1.
190905-14:21:13,678 nipype.workflow INFO:
     [LegacyMultiProc] Running 1 tasks, and 26 jobs ready. Free memory (GB): 249.80/250.00, Free processors: 0/1.
                     Currently running:
                       * fmriprep_wf.single_subject_NDARINVZZZNB0XC_wf.func_preproc_ses_baselineYear1Arm1_task_mid_run_01_wf.func_derivatives_wf.raw_sources
190905-14:21:13,679 nipype.workflow DEBUG:
     No resources available

And then repeat no resources available forever.

mangstad on 5 Sep 2019

Thanks for that. That truly is strange. And you're using a singularity environment? Could you provide your full command?

effigies on 5 Sep 2019

singularity run /tmp/fmriprep-1.4.1.simg $PWD --fs-license-file=/home/slab/license.txt --participant-label=sub-NDARINVZZZNB0XC --nthreads 1 --omp-nthreads 1 -vvvv --use-plugin /tmp/fmriprep/workaround.yml output participant

where the workaround.yml file is what was suggested in the other issue thread.

mangstad on 5 Sep 2019

👍1

/cm/shared/singularity/bin/singularity run \
    --cleanenv \
    -B /data:/data -B /data1:/data1 -B /data2:/data2 -B /data3:/data3 -B /cm/shared:/cm/shared \
    /cm/shared/singularity/images/fmriprep-1.4.1.simg \
    --fs-license-file /cm/shared/freesurfer-6.0.1/license.txt \
    --participant_label 000 \
    --output-spaces MNI152NLin6Asym:res-2 anat func fsaverage \
    --n_cpus 1 \
    --nthreads 1 \
    --omp-nthreads 1 \
    --mem-mb 8192 \
    --use-plugin /data/ddrucker/workaround.yml \
    --use-syn-sdc \
    --fs-no-reconall \
    -w  /data/ddrucker/testdatafordaniel/work \
    /data/ddrucker/testdatafordaniel \
    /data/ddrucker/testdatafordaniel/derivatives \
    participant

dmd on 5 Sep 2019

Okay. I think I have an idea of what's going on, which is that your working directories are subdirectories of your input dataset (bids_root) in that function. This is a case that is not very well handled by nipype, which is designed with the rough expectation that inputs/outputs will be on slower, more permanent storage, and working directories on fast, more volatile storage, like a scratch filesystem.

To test this, could you specify your working directories (-w) to some separate path?

@dmd, if you want to keep this in that filesystem, something like: -w /data/ddrucker/fmriprep-work should be sufficiently separated.

@mangstad Any -w that isn't a subdirectory of $PWD ought to do.

Unfortunately, working directories refer to input and output files with absolute paths, so you can't just move them and pick up where you left off. But you should definitely delete or move them out of the input dataset before starting again.

effigies on 5 Sep 2019

Will try that. Do you think this is only the --nthreads 1 hanging issue, or is this potentially also the problem with the file error in our other issue #1738

mangstad on 5 Sep 2019

I have a suspicion that it could be the root of both, but I'm frankly not positive that it's really causing either issue. It would certainly be convenient, though.

effigies on 5 Sep 2019

Doesn't seem to fix the issue. I was running the command in /tmp/fmriprep, so this time I ran singularity run /tmp/fmriprep-1.4.1.simg $PWD --fs-license-file=/home/slab/license.txt --participant-label=sub-NDARINVZZZNB0XC --nthreads 1 --omp-nthreads 1 -vvvv --use-plugin /tmp/fmriprep/workaround.yml -w /tmp/fmriprep_work output participant
and still get the same hanging on raw_sources with repeating no resources available messages.

mangstad on 5 Sep 2019

Okay. Could you try:

singularity run \
    --cleanenv \
    /tmp/fmriprep-1.4.1.simg \
    /tmp/fmriprep output participant \
    --fs-license-file=/home/slab/license.txt \
    --participant-label=sub-NDARINVZZZNB0XC \
    --nthreads 1 --omp-nthreads 1 \
    -vvvv \
    --use-plugin /tmp/fmriprep/workaround.yml \
    -w /tmp/fmriprep_work

The --cleanenv should make sure there's no pollution from other Python installations. I've also reordered the positional arguments to all precede flag arguments, just because we sometimes run into issues where those cause problems, though I doubt that's significant here. And could you post the full output to a gist?

effigies on 5 Sep 2019

Oh, and try running it from $HOME, just in case there's something about $PWD being the data directory that's causing problems.

effigies on 5 Sep 2019

Hmm, that seems to have done it
https://gist.github.com/mangstad/ca237e5e2e51a4d923437ac7f06a5e64
Definitely gets beyond the raw_sources at least. Killed it, but it was running mcflirt at a later node by then.

mangstad on 5 Sep 2019

Okay. It it seems likely to have been a path resolution error caused by some combination of PWD and nipype working directory. Hopefully @oesteban's recent work in nipype will make this less fragile in 1.5.0+, but we'll need to see.

@dmd Any luck on your side?

effigies on 5 Sep 2019

No luck. This is what I have now, running from $HOME:

/cm/shared/singularity/bin/singularity run \
    -B /data:/data -B /data1:/data1 -B /data2:/data2 -B /data3:/data3 -B /cm/shared:/cm/shared \
    --cleanenv \
    /cm/shared/singularity/images/fmriprep-1.4.1.simg \
    /data/ddrucker/testdatafordaniel \
    /data/ddrucker/testdatafordaniel/derivatives \
    participant \
    --fs-license-file /cm/shared/freesurfer-6.0.1/license.txt \
    --participant_label 000 \
    --output-spaces MNI152NLin6Asym:res-2 anat func fsaverage \
    --n_cpus 1 \
    --nthreads 1 \
    --omp-nthreads 1 \
    --mem-mb 8192 \
    --use-plugin /data/ddrucker/workaround.yml \
    --use-syn-sdc \
    --fs-no-reconall \
    -vvvv \
    -w  /data/ddrucker/work

and I'm still stuck in the 'No resources available' loop.

dmd on 5 Sep 2019

I think the no resources available is maybe just a constant monitoring message. I continue to get it repeatedly when a process is running, presumably just because nipype can't start another process. But check to see what step of the processing you're actually at that it is popping up. When I scrolled back up the output I saw that I was no longer stuck on the raw_sources node, but it was running mcflirt or something after that.

mangstad on 5 Sep 2019

@dmd Can you post your log as a gist?

effigies on 5 Sep 2019

Here you go, all 42203 lines of it:

https://gist.githubusercontent.com/dmd/786ec447126f0d35a91d11d45e0e71ff/raw/47d294f27aad26b666bf4f26d948d4c87bd36872/gistfile1.txt

dmd on 6 Sep 2019

👍1

That actually looks fine. You're 30 minutes into T1w-MNI normalization, which takes ~30 minutes with 8 cores. If you check top or htop, you should see an antsRegistration process fully utilizing one core.

effigies on 6 Sep 2019

Ok, it finished after another few days. Now I just need to figure out how to make it take a reasonable amount of time, I guess.

dmd on 7 Sep 2019

I would recommend not touching omp-nthreads and just set nthreads and mem-gb to the limits you want to impose on the workflow as a whole. With 8GB, you should be able to use 4-8 threads effectively.

effigies on 7 Sep 2019

Should we continue using workaround.yml?

dmd on 7 Sep 2019

If you're running into BrokenProcessPool errors, then yes. If not, it exposes you to indefinite hanging if a process is killed by the kernel for requesting too much RAM.

effigies on 7 Sep 2019

@effigies how do I know fmriprep completed successfully? It ran after fixing the above issues. It eventually stopped over the weekend, but there's no like "fmriprep finished for sub-XXXXX" or similar message. My standard out and error were going to different places unfortunately, so not sure how things line up, but the last error was the same as in issue #1587 so it seems not critical. I do have html summary in the subject's folder and it says no errors to report. Guess I was just expecting the equivalent of a big flashing sign saying it completed.

mangstad on 9 Sep 2019

If the task terminates and you have an HTML report that reports no errors, then you're set. But I agree, we should print a clear completion message.

effigies on 9 Sep 2019

👍2

We seem to be succeeding now if we (1) DO use the workaround.yml referenced above, and do NOT use omp-nthreads, e.g.:

/cm/shared/singularity/bin/singularity run \
    -B /data:/data -B /data1:/data1 -B /data2:/data2 -B /data3:/data3 -B /cm/shared:/cm/shared \
    --cleanenv \
    /cm/shared/singularity/images/fmriprep-1.4.1.simg \
    /data/ddrucker/testdatafordaniel \
    /data/ddrucker/testdatafordaniel/derivatives \
    participant \
    --fs-license-file /cm/shared/freesurfer-6.0.1/license.txt \
    --participant_label 000 \
    --output-spaces MNI152NLin6Asym:res-2 anat func fsaverage \
    --n_cpus 4 \
    --nthreads 4 \
    --mem-mb 8192 \
    --use-plugin /data/ddrucker/workaround.yml \
    --use-syn-sdc \
    -vvvv \
    -w  /data/ddrucker/work

dmd on 9 Sep 2019

Glad to hear it. Just so you know, --nthreads and --n-cpus are the same option. We initially used --nthreads, but people got it confused with --omp-nthreads (because not all threaded applications are OpenMP...) so we moved to --n-cpus. And for some reason we have underscored versions of a lot of our options, so --n_cpus also works.

I'm going to go ahead and close this. I think the remaining todos are:

#1737
#1760
#1761
#1762
#1763

@dmd @mangstad Please open new issues if there are things that I haven't addressed. And if you have the time and inclination to submit a PR, we'd be happy for the help!

effigies on 9 Sep 2019

Chris,
This got a bit fragmented, and it's now unclear to me whether the workaround.yml thing:
`plugin: LegacyMultiProc plugin_args: {maxtasksperchild: 1, memory_gb: 250, n_procs: 16, raise_insufficient: false}

is still required in 1.5.0, or if that was addressed in some way.

dmd on 18 Sep 2019

@dmd I don't believe workaround.yml was necessary for your case. However, it can sometimes still be necessary if BrokenProcessPool errors are hit with the standard MultiProc plugin.

Edit: I'll note that these errors are much less common since we've upgraded the Docker container to use Python 3.7, which has better memory handling options, so the default MultiProc is almost always the better choice over LegacyMultiProc.

effigies on 18 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings