Good evening!
There was an issue that submit_restart.start script didn't go to the right queue.
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2616#issuecomment-392493125
Now this issue is solved and there is another one with the same script.
Something is wrong with ADIOS. Here is the error message.
stderr.txt
Is the checkpoint you tried to restart create with the same PIConGPU version (source code).
The development in the dev branch is very agile therefore checkpoints can be incompatible with older dev versions.
Read as: please re-create the checkpoint by running the simulation with the latest source code updates (re-compile, resubmit) again. Then restart from it.
Yes, I did that already before starting the issue
You can also just skip ADIOS as the checkpoint engine and rely fully on HDF5.
In order to do so, recreate the checkpoint and restart from it with this option added in both .cfg files: --checkpoint.backend hdf5 --checkpoint.restart.backend hdf5
If you still have issue with ADIOS or HDF5 please post:
picongpu.profile?.cfg?.tplstdout and stderr from the broken run@NastasiaM I tested the setup I send to you with adios and all (simulation and restart) running well. Do you still have issue? If so please read https://github.com/ComputationalRadiationPhysics/picongpu/issues/2618#issuecomment-394344273
I am testing it now, without adios, I can try one more time with adios too... And I will send you result of the test
@psychocoderHPC I tested the setup you gave me and, as I see, it always starts from the beginning of the simulation. I am definitely doing something wrong, But, the restart itself is working nicely. The last simulation is in /bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_7 and input is in $HOME /paramSets/2018_05_17_dev_colloidalMelting_TF. Or I can send you files you need
IN run ColloidalMelting_TF_I=6.3_afterRene_7 you do not activated the restart in the cfg.
#TBG_restart="--checkpoint.restart --checkpoint.restart.backend adios" is commented out. If you remove the hashtag please add the variable TBG_restart to TBG_plugins
I did like that once and than the simulation was crashing-restarting every 2 minutes
As I wrote you must remove the hashtag before #TBG_restart="--checkpoint.restart --checkpoint.restart.backend adios" and add the variable to TBG_plugins
Please post
cfg Good morning!
Here it is!
$HOME/paramSets/2018_05_17_dev_colloidalMelting_TF
tbg -s qsub -c etc/picongpu/0004gpus.cfg -t k20_restart.tpl /bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_8
I am not sure if you send the wrong fule or used the wrng tbg command but the file postfix is txt and in the tbg command you used cfg
Am 7. Juni 2018 12:10:49 MESZ schrieb NastasiaM notifications@github.com:
Good morning!
Here it is!$HOME/paramSets/2018_05_17_dev_colloidalMelting_TF
tbg -s qsub -c etc/picongpu/0004gpus.cfg -t k20_restart.tpl
/bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_8--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2618#issuecomment-395367905
Github doesn't like strange file formats so I converted it to .txt
I asked @PrometheusPi he was the developer of the restart.tpl and get the information that you need to remove --checkpoint.restart from you cfg file. This option is set by the tpl if needed.
Should I remove only --checkpoint.restart or both --checkpoint.restart and --checkpoint.restart.backend adios? and what about !TBG_restart in plugins?
please keep The backend --checkpoint.restart.backend adios
I tried like that and simulation just didn't restart
Sorry, I checked and simulation restarted, but it crashed a couple of minutes later
@NastasiaM we need the stderr, stdout and last used .cfg and .tpl files attached to help you.
Here it is!
stderr.txt
stdout.txt
Also, I want to add that this simulation is behaving weird, a couple of times restarted one crashed after a couple of minutes, and now the restarted simulation is running but new h5 files are not appearing.
@NastasiaM Your restart k20_restart.tpl is broken. There is a variable missing in the mpiexec line. I am not sure if there are more mistakes in. Marco will post the correct file soon.
ping @n01r
There you go!
I just ran a simulation for an hour with restarts from HDF5 every 500 steps.
k20_restart.txt (overwrite your k20_restart.tpl with it)
It works! And it passed this 22 000 step point, where it was crashing before!
Yuppie!
Thanks!
Most helpful comment
It works! And it passed this 22 000 step point, where it was crashing before!
Yuppie!
Thanks!