I have not used HZDR hemera for a while. Today tried to submit a 4-node job to k80 queue, but got
Your job has requested more processes than the ppr for
this topology can support:
App: /bigdata/hplsim/external/andriy27/cheese_01/input/bin/cuda_memtest.sh
Number of procs: 32
PPR: 8:node
Please revise the conflict and try again.
the code and profiles are updated with the latest dev branch.
Seen email from Henrik yesterday about defining number of CPUs on hemera, but I'm not sure if this is already accounted in the profiles..
PS I also confirm seeing #3047
@hightower8083, the one from September 2nd was just about soft- and hard limits for the number of processes per user on hypnos.
Nevertheless, @henrikschulz could that be connected to the announcement from August 5th?
... Beginning next Monday I will cancel all jobs which are submitted onto several nodes, but with a number of tasks less than 40. ...
This issue is not related to my e-mail from Aug., 5th. In order to analyze this problem, I need more information:
Is the error reproducible?
I have just tried running our standard LWFA with 32.cfg on Hemera k80, so should be the same configuration node-wise. I have not encountered the issue described. (and the issue is related to CUDA memtest that runs before picongpu, so different physics does not matter here).
@hightower8083 can you please check that the right .tpl file was used? The tbg directory inside your output directory has copies of the .cfg and .tpl used in submit.cfg and submit.tpl.
OK, i've taken a straight path -- rebuilt everything and now it works properly..
so basically no idea what could have caused this issue 馃, hence, sorry for bother. Anyway, thanx for the prompt reaction, guys!
@hightower8083 not sure if you have time and/or motivation for further investigation.
In case you do please look if there is a difference between the failing and passing runs in the submit.cfg and submit.tpl files. (Or just attach all 4 here)
Iirc, I once ran into an issue with the same workflow as you describe (setup created from an older version, then the code updated, rebuild the setup) with using the default value of -t in my tbg command. Turned out, the .tpl file was changed in the repository but since I did not re-run pic-create my setup still had the old version, which ended up being used.
Right, that's a common one where an infinite number of different errors can come up. ^^
It's always a good idea to start from a fresh example then.
Most helpful comment
OK, i've taken a straight path -- rebuilt everything and now it works properly..
so basically no idea what could have caused this issue 馃, hence, sorry for bother. Anyway, thanx for the prompt reaction, guys!