Related to https://github.com/nextflow-io/nextflow/issues/286 - I keep hitting the same problem, with nextflow throwing various 'unbound variable' errors. Our cluster setup has a bunch of unbound variables spread across multiple scripts on many machines; I started to try to fix them up with our sysadmin but in the end just changed process.shell as suggested:
$ cat ~/.nextflow/config
process.shell = ['/bin/bash','-e']
This works, with my .command.sh scripts now starting like this:
$ head -n1 work/5a/c18c14383839ee788020fec00b40c9/.command.sh
#!/bin/bash -e
But now I'm having problems with directives. For example, I'm trying to run khmer, which requires a python virtualenv:
beforeScript "source /biol/programs/khmer/khmerEnv/bin/activate"
This fails:
Command wrapper:
/etc/bashrc: line 81: PS1: unbound variable
/biol/programs/khmer/khmerEnv/bin/activate: line 57: PS1: unbound variable
The top of the .command.run script looks like this:
$ cat work/5a/c18c14383839ee788020fec00b40c9/.command.run
#!/bin/bash
# NEXTFLOW TASK: qualityFilterPE (11_AGCCTT)
set -e
set -u
The /etc/bashrc error is what I was seeing before changing the process.shell configuration. It doesn't happen if I remove the beforeScript directive, but I'm not sure why it's coming up again - does nextflow create a subshell to run source?
But the real problem is the use of PS1 in activate, which is generated automatically by python virtualenv. So fixing this would require changing how virtualenv works. Arguably this is what should happen, but...
Is there something else I can configure to get around this - preventing nextflow using set -u in .command.run for example? (I'd prefer not to have to set PS1 to empty for non-interactive use if I can avoid it.)
Yes, NF creates a sub-shell to run the task, but is happening because the beforeScript is sourced in the wrapper script ie. .command.run.
You can try to put the source /biol/programs/khmer/khmerEnv/bin/activate on top of your command script instead of using beforeScript.
Let me know if this solve the issue.
Thanks - OK, sourcing activate in the script works. However, I just noticed that I'm also getting the /etc/bashrc error at the start of every .command.log:
$ head -n1 */*/.command.log
==> 29/e680093a539c9c8c8d1c5e77d47979/.command.log <==
/etc/bashrc: line 81: PS1: unbound variable
==> 2a/3f952765e2d14d1a2b9be37f41e109/.command.log <==
/etc/bashrc: line 81: PS1: unbound variable
==> 80/7bab72003988b1081950e8b7c0739f/.command.log <==
/etc/bashrc: line 81: PS1: unbound variable
==> af/1170db1f75c32572832e3eea5cb02f/.command.log <==
/etc/bashrc: line 81: PS1: unbound variable
==> bf/434ee832334b80cb2e64164b64568d/.command.log <==
/etc/bashrc: line 81: PS1: unbound variable
==> dc/42900b7ff92c01b985f2c9d74b5a9d/.command.log <==
/etc/bashrc: line 81: PS1: unbound variable
==> df/5e540274fe100da998976c059c6d27/.command.log <==
/etc/bashrc: line 81: PS1: unbound variable
==> e2/2435f69b5a6ed6c098606ab6079a73/.command.log <==
/etc/bashrc: line 81: PS1: unbound variable
All the other output files appear to be fine and the tasks complete successfully, but I'd prefer to fix this if possible. Is there a way to get around these messages?
I fear you need to define it to an empty value. It's a very bad practice to use unbound variables.
OK, fair enough - but just out of interest, what creates the .command.log file? What is starting the subshell, separate from the definition of process.shell?
The .command.log is created by the top NF process that launches .command.run, that in turns launches .command.sh (when using the local executor).
If you are using a batch scheduler, it is created by the latter.
I'm closing this issue because there's no more feedback. Feel free to comment/reopen if needed.
So I've been getting similar errors. Specifically, commands that fail always have this line in their .command.log file:
"/n/sw/fasrcsw/apps/lmod/lmod/init/bash: line 87: PS1: unbound variable"
I've tried putting exportPS1="" ; before the command in NF's script section, but that doesn't seem to have done the trick either. Any ideas?
Can you change into work dir of the failing task and execute the following command:
bash -x .command.run
Then include here the printed output.
Of course. When I run that command, it initially kicks this out:
+ LMOD_PKG=/n/sw/fasrcsw/apps/lmod/lmod
+ LMOD_DIR=/n/sw/fasrcsw/apps/lmod/lmod/libexec/
+ LMOD_CMD=/n/sw/fasrcsw/apps/lmod/lmod/libexec/lmod
+ export LMOD_PKG
+ export LMOD_CMD
+ export LMOD_DIR
+ '[' : '!=' : ']'
+ '[' '' ']'
+ '[' 4 -ge 3 ']'
+ '[' -r /n/sw/fasrcsw/apps/lmod/lmod/init/lmod_bash_completions ']'
+ . /n/sw/fasrcsw/apps/lmod/lmod/init/lmod_bash_completions
++ complete -F _module module
++ complete -F _ml ml
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.begin
+ '[' -f /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.env ']'
+ [[ -n '' ]]
+ rm -f 262598492008449892556870034539879592913.3.stderr
+ rm -f 262598492008449892556870034539879592913.3.orcaout
+ ln -s /n/home04/tantrev/pure/calculations/orca/output/ground_state/262598492008449892556870034539879592913.3/262598492008449892556870034539879592913.3.stderr 262598492008449892556870034539879592913.3.stderr
+ ln -s /n/home04/tantrev/pure/calculations/orca/output/ground_state/262598492008449892556870034539879592913.3/262598492008449892556870034539879592913.3.orcaout 262598492008449892556870034539879592913.3.orcaout
+ set +e
+ COUT=/n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.po
+ mkfifo /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.po
+ CERR=/n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.pe
+ mkfifo /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.pe
+ tee1=5617
+ tee2=5618
+ pid=5619
+ wait 5619
+ tee .command.err
+ tee .command.out
+ /bin/bash /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.run.1
/n/sw/fasrcsw/apps/lmod/lmod/init/bash: line 87: PS1: unbound variable
And then when I CTRL+C, it kicks this out.
What is invoking /n/sw/fasrcsw/apps/lmod/lmod/init/bash? Is it in your ~/.bashrc or ~/.profile file ?
I have no idea. When I try running which bash in my regular user account, I get back /bin/bash. I can't find anything in my .bashrc or .bash_profile that would obviously be the culprit.
My .bashrc file is as follows:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Shell settings
export HISTCONTROL=ignoredups
# Editor settings
export EDITOR=vi
export VISUAL=vi
# Limits
ulimit -c 0
ulimit -s unlimited
source new-modules.sh
module load gcc/6.1.0-fasrc01 openmpi/2.0.2.40dc0399-fasrc01
export QCSCRATCH=${HOME}/scratch
export PATH=/n/home04/tantrev/pure/orca_4_0_0_2_linux_x86-64:$PATH
export LD_LIBRARY_PATH=/n/sw/terachem-1.9/TeraChem/lib:$LD_LIBRARY_PATH
module load cuda/6.5-fasrc02
export QCPLATFORM=LINUX_Ix86_64
export QCSCRATCH=/scratch/tantrev
export QCMPI=mpich
export QCRSH=ssh
export QCFILEPREF=$QCLOCALSCR
export PATH="/usr/bin:$PATH"
And my .bash_profile is:
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/bin
export PATH
Can you try to edit the file .command.run and modify the line
/bin/bash /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.run.1
with
/bin/bash --norc /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.run.1
then execute as before bash -x .command.run
Sure thing.
Here's the immediate output after the modification:
+ LMOD_PKG=/n/sw/fasrcsw/apps/lmod/lmod
+ LMOD_DIR=/n/sw/fasrcsw/apps/lmod/lmod/libexec/
+ LMOD_CMD=/n/sw/fasrcsw/apps/lmod/lmod/libexec/lmod
+ export LMOD_PKG
+ export LMOD_CMD
+ export LMOD_DIR
+ '[' : '!=' : ']'
+ '[' '' ']'
+ '[' 4 -ge 3 ']'
+ '[' -r /n/sw/fasrcsw/apps/lmod/lmod/init/lmod_bash_completions ']'
+ . /n/sw/fasrcsw/apps/lmod/lmod/init/lmod_bash_completions
++ complete -F _module module
++ complete -F _ml ml
+ set -e
+ set -u
+ NXF_DEBUG=0
+ [[ 0 > 1 ]]
+ trap on_exit EXIT
+ trap on_term TERM INT USR1 USR2
+ NXF_SCRATCH=
+ [[ 0 > 0 ]]
+ touch /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.begin
+ '[' -f /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.env ']'
+ [[ -n '' ]]
+ rm -f 262598492008449892556870034539879592913.3.stderr
+ rm -f 262598492008449892556870034539879592913.3.orcaout
+ ln -s /n/home04/tantrev/pure/calculations/orca/output/ground_state/262598492008449892556870034539879592913.3/262598492008449892556870034539879592913.3.stderr 262598492008449892556870034539879592913.3.stderr
+ ln -s /n/home04/tantrev/pure/calculations/orca/output/ground_state/262598492008449892556870034539879592913.3/262598492008449892556870034539879592913.3.orcaout 262598492008449892556870034539879592913.3.orcaout
+ set +e
+ COUT=/n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.po
+ mkfifo /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.po
+ CERR=/n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.pe
+ mkfifo /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.pe
+ tee1=27231
+ tee2=27232
+ pid=27233
+ wait 27233
+ tee .command.out
+ tee .command.err
+ /bin/bash --norc /n/home04/tantrev/pure/work/32/49f154e88485c954a21e0c52e767bc/.command.run.1
/n/sw/fasrcsw/apps/lmod/lmod/init/bash: line 87: PS1: unbound variable
And here's the output after CTRL+C.
Ok, no difference. However I think it can be solved adding the following entry in the nextflow.config file:
env.PS1=''
Thank you! I'll try that and get back to you soon. Sometimes it takes a little bit for the error to manifest itself.
So I'm afraid I'm still getting errors. Here's what the .command.log file is still saying, after the NF config modification:
/n/sw/fasrcsw/apps/lmod/lmod/init/bash: line 87: PS1: unbound variable
The odd thing is that it only happens with some jobs, not all of them...
Are you using a cluster or a local execution ?
Just a SLURM cluster through "fasrc".
The fact that you are getting this problem only for some jobs suggests me that there's something odd (the missing PS1 variable) only for certain nodes in the cluster.
I would suggest to try to ask for help to your sysadmins and post here any progress on this issue.
Turns out I just had a faulty script that was hanging - the PS1 error wasn't actually affecting anything. Sorry for the confusion, thanks for all your help!
So, unfortunately, it turns out there's still a problem. For some reason, even when jobs execute just fine, NF is reporting them as having "FAILED". But the expected output is produced and the ".exitcode" is zero. The PS1 error is the only error in the logfiles, however. Is it possible this PS1 error is the root cause of such behavior?
It looks a different problem. Open a new issue reporting the NF stdout and .nextflow.log and eventually the code causing the problem.
Just for reference for the next person that finds this, here's the complete code snippet I used in my Nextflow process to fix the original issue (based on the discussion here and in the linked thread)
script:
"""
export PS=\${PS:-''}
export PS1=\${PS1:-''}
source venv/bin/activate
"""
Most helpful comment
Just for reference for the next person that finds this, here's the complete code snippet I used in my Nextflow process to fix the original issue (based on the discussion here and in the linked thread)