Fmriprep: --nthreads 1 --omp-nthreads 1 apparently leading to deadlock

Created on 5 Sep 2019  路  29Comments  路  Source: nipreps/fmriprep

From https://github.com/poldracklab/fmriprep/issues/1738#issuecomment-528358469:

In my limited testing so far, this does seem to prevent the error from popping up, but also seems to cause fmriprep to just stall at [Node] Running "raw_sources" ("nipype.interfaces.utility.wrappers.Function") for about 18 hours now.

From https://github.com/poldracklab/fmriprep/issues/1738#issuecomment-528382919:

Okay, my fault. I still had --nthreads 1 --omp-nthreads 1 in my command when testing that and it seems to interact poorly. Now with that removed it does seem to run and does get past the previous error location. I'm going to run it a few more times to ensure that it continues getting past that portion of the processing each time.

This function is found here:

def _bids_relative(in_files, bids_root):
    from pathlib import Path
    if not isinstance(in_files, (list, tuple)):
        in_files = [in_files]
    in_files = [str(Path(p).relative_to(bids_root)) for p in in_files]
    return in_files

Which should not cause a stall.

Most helpful comment

If the task terminates and you have an HTML report that reports no errors, then you're set. But I agree, we should print a clear completion message.

All 29 comments

It sat on [Node] Running "raw_sources" ("nipype.interfaces.utility.wrappers.Function") for about 18 hours before I killed it. When I ran the same with -vvvv it was stuck issuing a bunch of messages like
190905-14:21:51,684 nipype.workflow DEBUG: No resources available so there definitely seems to be a loop it's stuck in.

Yeah, that will be the execution loop. Here's the section:

        stats = (len(self.pending_tasks), len(jobids), free_memory_gb,
                 self.memory_gb, free_processors, self.processors)
        if self._stats != stats:
            tasks_list_msg = ''


            if logger.level <= INFO:
                running_tasks = [
                    '  * %s' % self.procs[jobid].fullname
                    for _, jobid in self.pending_tasks
                ]
                if running_tasks:
                    tasks_list_msg = '\nCurrently running:\n'
                    tasks_list_msg += '\n'.join(running_tasks)
                    tasks_list_msg = indent(tasks_list_msg, ' ' * 21)
            logger.info(
                '[MultiProc] Running %d tasks, and %d jobs ready. Free '
                'memory (GB): %0.2f/%0.2f, Free processors: %d/%d.%s',
                len(self.pending_tasks), len(jobids), free_memory_gb,
                self.memory_gb, free_processors, self.processors,
                tasks_list_msg)
            self._stats = stats


        if free_memory_gb < 0.01 or free_processors == 0:
            logger.debug('No resources available')
            return

Before you see the "No resources available" forever, there should be something indicating Currently running or at least [MultiProc] Running X tasks, and Y jobs ready. Do you see that at all?

     [Node] Running "raw_sources" ("nipype.interfaces.utility.wrappers.Function")
190905-14:21:13,654 nipype.workflow DEBUG:
     Progress: 1792 jobs, 1/1/26 (done/running/ready), 1/1790 (pending_tasks/waiting).
190905-14:21:13,654 nipype.workflow DEBUG:
     Tasks currently running: 1. Pending: 1.
190905-14:21:13,678 nipype.workflow INFO:
     [LegacyMultiProc] Running 1 tasks, and 26 jobs ready. Free memory (GB): 249.80/250.00, Free processors: 0/1.
                     Currently running:
                       * fmriprep_wf.single_subject_NDARINVZZZNB0XC_wf.func_preproc_ses_baselineYear1Arm1_task_mid_run_01_wf.func_derivatives_wf.raw_sources
190905-14:21:13,679 nipype.workflow DEBUG:
     No resources available

And then repeat no resources available forever.

Thanks for that. That truly is strange. And you're using a singularity environment? Could you provide your full command?

singularity run /tmp/fmriprep-1.4.1.simg $PWD --fs-license-file=/home/slab/license.txt --participant-label=sub-NDARINVZZZNB0XC --nthreads 1 --omp-nthreads 1 -vvvv --use-plugin /tmp/fmriprep/workaround.yml output participant

where the workaround.yml file is what was suggested in the other issue thread.

/cm/shared/singularity/bin/singularity run \
    --cleanenv \
    -B /data:/data -B /data1:/data1 -B /data2:/data2 -B /data3:/data3 -B /cm/shared:/cm/shared \
    /cm/shared/singularity/images/fmriprep-1.4.1.simg \
    --fs-license-file /cm/shared/freesurfer-6.0.1/license.txt \
    --participant_label 000 \
    --output-spaces MNI152NLin6Asym:res-2 anat func fsaverage \
    --n_cpus 1 \
    --nthreads 1 \
    --omp-nthreads 1 \
    --mem-mb 8192 \
    --use-plugin /data/ddrucker/workaround.yml \
    --use-syn-sdc \
    --fs-no-reconall \
    -w  /data/ddrucker/testdatafordaniel/work \
    /data/ddrucker/testdatafordaniel \
    /data/ddrucker/testdatafordaniel/derivatives \
    participant

Okay. I think I have an idea of what's going on, which is that your working directories are subdirectories of your input dataset (bids_root) in that function. This is a case that is not very well handled by nipype, which is designed with the rough expectation that inputs/outputs will be on slower, more permanent storage, and working directories on fast, more volatile storage, like a scratch filesystem.

To test this, could you specify your working directories (-w) to some separate path?

@dmd, if you want to keep this in that filesystem, something like: -w /data/ddrucker/fmriprep-work should be sufficiently separated.

@mangstad Any -w that isn't a subdirectory of $PWD ought to do.

Unfortunately, working directories refer to input and output files with absolute paths, so you can't just move them and pick up where you left off. But you should definitely delete or move them out of the input dataset before starting again.

Will try that. Do you think this is only the --nthreads 1 hanging issue, or is this potentially also the problem with the file error in our other issue #1738

I have a suspicion that it could be the root of both, but I'm frankly not positive that it's really causing either issue. It would certainly be convenient, though.

Doesn't seem to fix the issue. I was running the command in /tmp/fmriprep, so this time I ran singularity run /tmp/fmriprep-1.4.1.simg $PWD --fs-license-file=/home/slab/license.txt --participant-label=sub-NDARINVZZZNB0XC --nthreads 1 --omp-nthreads 1 -vvvv --use-plugin /tmp/fmriprep/workaround.yml -w /tmp/fmriprep_work output participant
and still get the same hanging on raw_sources with repeating no resources available messages.

Okay. Could you try:

singularity run \
    --cleanenv \
    /tmp/fmriprep-1.4.1.simg \
    /tmp/fmriprep output participant \
    --fs-license-file=/home/slab/license.txt \
    --participant-label=sub-NDARINVZZZNB0XC \
    --nthreads 1 --omp-nthreads 1 \
    -vvvv \
    --use-plugin /tmp/fmriprep/workaround.yml \
    -w /tmp/fmriprep_work

The --cleanenv should make sure there's no pollution from other Python installations. I've also reordered the positional arguments to all precede flag arguments, just because we sometimes run into issues where those cause problems, though I doubt that's significant here. And could you post the full output to a gist?

Oh, and try running it from $HOME, just in case there's something about $PWD being the data directory that's causing problems.

Hmm, that seems to have done it
https://gist.github.com/mangstad/ca237e5e2e51a4d923437ac7f06a5e64
Definitely gets beyond the raw_sources at least. Killed it, but it was running mcflirt at a later node by then.

Okay. It it seems likely to have been a path resolution error caused by some combination of PWD and nipype working directory. Hopefully @oesteban's recent work in nipype will make this less fragile in 1.5.0+, but we'll need to see.

@dmd Any luck on your side?

No luck. This is what I have now, running from $HOME:

/cm/shared/singularity/bin/singularity run \
    -B /data:/data -B /data1:/data1 -B /data2:/data2 -B /data3:/data3 -B /cm/shared:/cm/shared \
    --cleanenv \
    /cm/shared/singularity/images/fmriprep-1.4.1.simg \
    /data/ddrucker/testdatafordaniel \
    /data/ddrucker/testdatafordaniel/derivatives \
    participant \
    --fs-license-file /cm/shared/freesurfer-6.0.1/license.txt \
    --participant_label 000 \
    --output-spaces MNI152NLin6Asym:res-2 anat func fsaverage \
    --n_cpus 1 \
    --nthreads 1 \
    --omp-nthreads 1 \
    --mem-mb 8192 \
    --use-plugin /data/ddrucker/workaround.yml \
    --use-syn-sdc \
    --fs-no-reconall \
    -vvvv \
    -w  /data/ddrucker/work

and I'm still stuck in the 'No resources available' loop.

I think the no resources available is maybe just a constant monitoring message. I continue to get it repeatedly when a process is running, presumably just because nipype can't start another process. But check to see what step of the processing you're actually at that it is popping up. When I scrolled back up the output I saw that I was no longer stuck on the raw_sources node, but it was running mcflirt or something after that.

@dmd Can you post your log as a gist?

That actually looks fine. You're 30 minutes into T1w-MNI normalization, which takes ~30 minutes with 8 cores. If you check top or htop, you should see an antsRegistration process fully utilizing one core.

Ok, it finished after another few days. Now I just need to figure out how to make it take a reasonable amount of time, I guess.

I would recommend not touching omp-nthreads and just set nthreads and mem-gb to the limits you want to impose on the workflow as a whole. With 8GB, you should be able to use 4-8 threads effectively.

Should we continue using workaround.yml?

If you're running into BrokenProcessPool errors, then yes. If not, it exposes you to indefinite hanging if a process is killed by the kernel for requesting too much RAM.

@effigies how do I know fmriprep completed successfully? It ran after fixing the above issues. It eventually stopped over the weekend, but there's no like "fmriprep finished for sub-XXXXX" or similar message. My standard out and error were going to different places unfortunately, so not sure how things line up, but the last error was the same as in issue #1587 so it seems not critical. I do have html summary in the subject's folder and it says no errors to report. Guess I was just expecting the equivalent of a big flashing sign saying it completed.

If the task terminates and you have an HTML report that reports no errors, then you're set. But I agree, we should print a clear completion message.

We seem to be succeeding now if we (1) DO use the workaround.yml referenced above, and do NOT use omp-nthreads, e.g.:

/cm/shared/singularity/bin/singularity run \
    -B /data:/data -B /data1:/data1 -B /data2:/data2 -B /data3:/data3 -B /cm/shared:/cm/shared \
    --cleanenv \
    /cm/shared/singularity/images/fmriprep-1.4.1.simg \
    /data/ddrucker/testdatafordaniel \
    /data/ddrucker/testdatafordaniel/derivatives \
    participant \
    --fs-license-file /cm/shared/freesurfer-6.0.1/license.txt \
    --participant_label 000 \
    --output-spaces MNI152NLin6Asym:res-2 anat func fsaverage \
    --n_cpus 4 \
    --nthreads 4 \
    --mem-mb 8192 \
    --use-plugin /data/ddrucker/workaround.yml \
    --use-syn-sdc \
    -vvvv \
    -w  /data/ddrucker/work

Glad to hear it. Just so you know, --nthreads and --n-cpus are the same option. We initially used --nthreads, but people got it confused with --omp-nthreads (because not all threaded applications are OpenMP...) so we moved to --n-cpus. And for some reason we have underscored versions of a lot of our options, so --n_cpus also works.

I'm going to go ahead and close this. I think the remaining todos are:

  • #1737
  • #1760
  • #1761
  • #1762
  • #1763

@dmd @mangstad Please open new issues if there are things that I haven't addressed. And if you have the time and inclination to submit a PR, we'd be happy for the help!

Chris,
This got a bit fragmented, and it's now unclear to me whether the workaround.yml thing:
` plugin: LegacyMultiProc plugin_args: {maxtasksperchild: 1, memory_gb: 250, n_procs: 16, raise_insufficient: false}

is still required in 1.5.0, or if that was addressed in some way.

@dmd I don't believe workaround.yml was necessary for your case. However, it can sometimes still be necessary if BrokenProcessPool errors are hit with the standard MultiProc plugin.

Edit: I'll note that these errors are much less common since we've upgraded the Docker container to use Python 3.7, which has better memory handling options, so the default MultiProc is almost always the better choice over LegacyMultiProc.

Was this page helpful?
0 / 5 - 0 ratings