So, I'm trying to run FMRIprep on 63 participants, using SLURM on our HPC - the errors I get are somewhat different for each subject. One time it failed on bold_transform, one_time it failed on 'os.rename' of a file during another part - this is seemingly random and changes on each run. However, every single time it comes down to a 'file not found error' in the working directory, for a JSON file specifically (named _[hashcode].json). From digging around for similar errors, there seems to be some suggestion that it's a problem with threading - such that a process on one thread is beginning before a necessary prior process was finished on another thread.
Is there are solution to this other than single threading (which I am trying now on a per-process basis using --omp-nthreads=1 and will report back the results of).
Thanks a lot for your help.
I can attach example output logs once this coming run fails - I accidentally removed the logs and work-dir before submitting this issue.
--omp-nthreads=1 should have no effect, since that affects multithreading within a node, while this problem would be a race condition in the scheduler. --n-cpus 1 would serialize the workflow.
When you have the logs, please also submit your full command. See the bug report template for useful information to include.
I'm not sure if this would do anything to resolve the error, but I've been running on an HPC with SLURM and I find that the SINGULARITY_TMPDIR and SINGULARITY_CACHEDIR variables need to be set preceding the command singularity exec command to run fMRIPrep via Singularity.
Alternatively, you can use the sbatch command to run fMRIPrep separately for each participant in an array. I'd recommend running one participant per node, at least initially, to see what the completion time is with your HPC setup.
--edited after I read the full title of the issue again and realized you are using singularity
Thanks both, it's running now (I just got some very strange error with 1 subject but others are still running, see here for 1st subject error: https://pastebin.com/3efCztHu)
My fmriprep/singularity command is as follows:
singularity run --cleanenv \
/home/users/hassan.bassam/images/fmriprep-20.2.1.simg \
--skip_bids_validation \
--participant-label sub-$1 \
--omp-nthreads 1 \
--mem 32000 --bold2t1w-dof 6 \
--medial-surface-nan \
--output-spaces anat func fsaverage fsaverage6 fsaverage5 \
MNI152NLin2009cAsym:res-native MNI152NLin6Asym:res-native \
--fs-license-file /home/users/hassan.bassam/license.txt --notrack \
--use-syn-sdc --write-graph --work-dir /home/users/hassan.bassam/nifti/work \
/home/users/hassan.bassam/nifti/Nifti /home/users/hassan.bassam/fmriprepped participant
Just to be clear, my actual folder containing the BIDS-validated data is in nifti/Nifti.
I'm running everything on my home directory as you can see - I have plenty of disk quota on that (2TB).
I'm sorry for the lack of clarity - a bit careless by me - I am running singularity and running each participant on a single node each using a job array with SLURM.
I currently did not do any binding of singularity folders or anything as it seems not necessary if everything is contained within $HOME.
@pcamach2 - I should export those as environment variables in the lines preceding the singularity call? Can I use any directory for these or do they need to be related to the fmriprep working/output directory in some way?
see here for 1st subject error: https://pastebin.com/3efCztHu)
That's a templateflow error. @mgxd @oesteban Does it mean anything to you?
Glad to see that it is running again!
Any directory will work, in my experience, as long as you're declaring the work-dir in your singularity run command. If you have a persistent home directory with a bashrc or bash_profile on your HPC, you may be able to export these environmental variables there and not worry about them. I do not have this option currently so I'm not sure how --cleanenv would behave with this method. For the sake of keeping better track of what cache and tmp files I can safely remove after processing I make directories with the participant ID and session ID in the name within a parent singularity_tmp directory. Then the cache and tmp directory variables can be set to these respective folders preceding singularity run (e.g. SINGULARITY_TMPDIR=/path/to/singularity_tmp/subject_session_stmp SINGULARITY_CACHEDIR=/path/to/singularity_tmp/subject_session_scache). This may not be the ideal way to do things, but it has worked for me so far.
see here for 1st subject error: https://pastebin.com/3efCztHu)
That's a templateflow error. @mgxd @oesteban Does it mean anything to you?
That is a weird one - it is complaining about a template directory that isn't even used.
Were you running 2 (or more) subjects concurrently? If the TemplateFlow skeleton was updated between processes, that could explain it. Otherwise, can you try rerunning that subject and see if it reoccurs.
Thanks again all. The templateflow errors are new (likely because I wasn't asking for explicit output spaces in previous runs as posted above). It seems to have caused issues on around half of the 63 subjects I ran...
Running fMRIPrep on sub-85107
/usr/local/miniconda/lib/python3.7/site-packages/bids/layout/validation.py:46: UserWarning: The ability to$
warnings.warn("The ability to pass arguments to BIDSLayout that control "
Traceback (most recent call last):
File "/usr/local/miniconda/bin/fmriprep", line 6, in <module>
from fmriprep.cli.run import main
File "/usr/local/miniconda/lib/python3.7/site-packages/fmriprep/cli/run.py", line 4, in <module>
from .. import config
File "/usr/local/miniconda/lib/python3.7/site-packages/fmriprep/config.py", line 96, in <module>
from templateflow import __version__ as _tf_ver
File "/usr/local/miniconda/lib/python3.7/site-packages/templateflow/__init__.py", line 31, in <module>
update(local=True, overwrite=False, silent=True)
File "/usr/local/miniconda/lib/python3.7/site-packages/templateflow/conf/__init__.py", line 52, in update
return _update_s3(TF_HOME, local=local, overwrite=overwrite, silent=silent)
File "/usr/local/miniconda/lib/python3.7/site-packages/templateflow/conf/_s3.py", line 20, in update
retval = _update_skeleton(skel_file, dest, overwrite=overwrite, silent=silent)
File "/usr/local/miniconda/lib/python3.7/site-packages/templateflow/conf/_s3.py", line 59, in _update_sk$
current_files = [s.relative_to(dest) for s in dest.glob("**/*")]
File "/usr/local/miniconda/lib/python3.7/site-packages/templateflow/conf/_s3.py", line 59, in <listcomp>
current_files = [s.relative_to(dest) for s in dest.glob("**/*")]
File "/usr/local/miniconda/lib/python3.7/pathlib.py", line 1093, in glob
for p in selector.select_from(self):
File "/usr/local/miniconda/lib/python3.7/pathlib.py", line 553, in _select_from
for p in successor_select(starting_point, is_dir, exists, scandir):
File "/usr/local/miniconda/lib/python3.7/pathlib.py", line 510, in _select_from
entries = list(scandir(parent_path))
OSError: [Errno 5] Input/output error: '/home/users/hassan.bassam/.cache/templateflow/tpl-MNIInfant/cohort$
Finished running fMRIPrep on sub-85107
@mgxd I am running 62 subjects concurrently, but all on different nodes. Are they interfering with each other through this cache? Seems to be mentioning an infant atlas which I don't think I have asked for at all! no infants among my data..
Thanks again all, @pcamach2 I did as you did for a run too and am running this alongside the version without this, will reply here tomorrow with what has happened with the other jobs.
@hassan-bassam, hope things are going well! Has the processing run through successfully?
@hassan-bassam Please try exporting the following environment variable:
export SINGULARITYENV_TEMPLATEFLOW_AUTOUPDATE=0
singularity run --cleanenv \
/home/users/hassan.bassam/images/fmriprep-20.2.1.simg \
--skip_bids_validation \
--participant-label sub-$1 \
--omp-nthreads 1 \
--mem 32000 --bold2t1w-dof 6 \
--medial-surface-nan \
--output-spaces anat func fsaverage fsaverage6 fsaverage5 \
MNI152NLin2009cAsym:res-native MNI152NLin6Asym:res-native \
--fs-license-file /home/users/hassan.bassam/license.txt --notrack \
--use-syn-sdc --write-graph --work-dir /home/users/hassan.bassam/nifti/work \
/home/users/hassan.bassam/nifti/Nifti /home/users/hassan.bassam/fmriprepped participant
@mgxd the autoupdate is triggered, and the zipfile seems to fail to expand, possibly because of some existing folder and/or permission issues (the templateflow folder was created in another way and then make non-writable?)
@pcamach2
Sorry for the slow reply - the last week or two have been very busy. As it turns out your solution I think fixed the previous problem I was having, and now the only problem was the templateflow bug, which I initially got past for a first run-through simply by re-running failed participants (it was like a 70% success rate and seemingly random). Edit: bit rude of me - thanks so much for all your help!
I now need to re-run all the data, so I will do that with the suggested fix from @oesteban and update! Thanks!
@hassan-bassam No worries! Happy to help!
You may already have this in your debugging workflow, but one more thing that might be worth trying if errors persist would be to delete the cache and tmp directory contents - after the runs for those participants are finished and the contents have been backed up elsewhere with less storage constraints - so that any error-generating files from prior failed runs do not interfere with further testing.
@pcamach2
I had already got this set up with the following lines:
SINGULARITY_CACHEDIR=/scratch/users/hassan.bassam/cache/sub-$1
SINGULARITY_TMPDIR=/scratch/users/hassan.bassam/tmp/sub-$1
if [ -d "$SINGULARITY_CACHEDIR" ]; then rm -Rf $SINGULARITY_CACHEDIR/; fi
if [ -d "$SINGULARITY_TMPDIR" ]; then rm -Rf $SINGULARITY_TMPDIR/; fi
Before running fmriprep, which should do the job I hope. Thanks again!