Picongpu: ADIOS not working properly with restart script

Created on 1 Jun 2018 · 26Comments · Source: ComputationalRadiationPhysics/picongpu

Good evening!
There was an issue that submit_restart.start script didn't go to the right queue.
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2616#issuecomment-392493125
Now this issue is solved and there is another one with the same script.
Something is wrong with ADIOS. Here is the error message.
stderr.txt

plugin machinsystem

Source

NastasiaM

Most helpful comment

It works! And it passed this 22 000 step point, where it was crashing before!
Yuppie!
Thanks!

NastasiaM on 8 Jun 2018

🎉2

All 26 comments

Is the checkpoint you tried to restart create with the same PIConGPU version (source code).
The development in the dev branch is very agile therefore checkpoints can be incompatible with older dev versions.

psychocoderHPC on 4 Jun 2018

👍1

Read as: please re-create the checkpoint by running the simulation with the latest source code updates (re-compile, resubmit) again. Then restart from it.

ax3l on 4 Jun 2018

Yes, I did that already before starting the issue

NastasiaM on 4 Jun 2018

👍1

You can also just skip ADIOS as the checkpoint engine and rely fully on HDF5.
In order to do so, recreate the checkpoint and restart from it with this option added in both .cfg files: --checkpoint.backend hdf5 --checkpoint.restart.backend hdf5

ax3l on 4 Jun 2018

If you still have issue with ADIOS or HDF5 please post:

the file you sourced to load the modules e.g. picongpu.profile
the used config file ?.cfg
the used template file for the system ?.tpl
stdout and stderr from the broken run

psychocoderHPC on 4 Jun 2018

@NastasiaM I tested the setup I send to you with adios and all (simulation and restart) running well. Do you still have issue? If so please read https://github.com/ComputationalRadiationPhysics/picongpu/issues/2618#issuecomment-394344273

psychocoderHPC on 6 Jun 2018

I am testing it now, without adios, I can try one more time with adios too... And I will send you result of the test

NastasiaM on 6 Jun 2018

👍1

@psychocoderHPC I tested the setup you gave me and, as I see, it always starts from the beginning of the simulation. I am definitely doing something wrong, But, the restart itself is working nicely. The last simulation is in /bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_7 and input is in $HOME /paramSets/2018_05_17_dev_colloidalMelting_TF. Or I can send you files you need

NastasiaM on 6 Jun 2018

IN run ColloidalMelting_TF_I=6.3_afterRene_7 you do not activated the restart in the cfg.
#TBG_restart="--checkpoint.restart --checkpoint.restart.backend adios" is commented out. If you remove the hashtag please add the variable TBG_restart to TBG_plugins

psychocoderHPC on 6 Jun 2018

I did like that once and than the simulation was crashing-restarting every 2 minutes

NastasiaM on 6 Jun 2018

As I wrote you must remove the hashtag before #TBG_restart="--checkpoint.restart --checkpoint.restart.backend adios" and add the variable to TBG_plugins

psychocoderHPC on 6 Jun 2018

Please post

your used cfg
the path to the folder where you used the cfg.
the tbg command you used to start the simulation

psychocoderHPC on 6 Jun 2018

Good morning!
Here it is!

0004gpus.txt

$HOME/paramSets/2018_05_17_dev_colloidalMelting_TF

tbg -s qsub -c etc/picongpu/0004gpus.cfg -t k20_restart.tpl /bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_8

NastasiaM on 7 Jun 2018

I am not sure if you send the wrong fule or used the wrng tbg command but the file postfix is txt and in the tbg command you used cfg

Am 7. Juni 2018 12:10:49 MESZ schrieb NastasiaM notifications@github.com:

Good morning!
Here it is!

0004gpus.txt

$HOME/paramSets/2018_05_17_dev_colloidalMelting_TF

tbg -s qsub -c etc/picongpu/0004gpus.cfg -t k20_restart.tpl
/bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_8

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2618#issuecomment-395367905

psychocoderHPC on 7 Jun 2018

Github doesn't like strange file formats so I converted it to .txt

NastasiaM on 7 Jun 2018

I asked @PrometheusPi he was the developer of the restart.tpl and get the information that you need to remove --checkpoint.restart from you cfg file. This option is set by the tpl if needed.

psychocoderHPC on 7 Jun 2018

Should I remove only --checkpoint.restart or both --checkpoint.restart and --checkpoint.restart.backend adios? and what about !TBG_restart in plugins?

NastasiaM on 7 Jun 2018

please keep The backend --checkpoint.restart.backend adios

psychocoderHPC on 7 Jun 2018

I tried like that and simulation just didn't restart

NastasiaM on 7 Jun 2018

Sorry, I checked and simulation restarted, but it crashed a couple of minutes later

NastasiaM on 7 Jun 2018

@NastasiaM we need the stderr, stdout and last used .cfg and .tpl files attached to help you.

ax3l on 8 Jun 2018

Here it is!
stderr.txt
stdout.txt
Also, I want to add that this simulation is behaving weird, a couple of times restarted one crashed after a couple of minutes, and now the restarted simulation is running but new h5 files are not appearing.

NastasiaM on 8 Jun 2018

@NastasiaM Your restart k20_restart.tpl is broken. There is a variable missing in the mpiexec line. I am not sure if there are more mistakes in. Marco will post the correct file soon.

psychocoderHPC on 8 Jun 2018

ping @n01r

ax3l on 8 Jun 2018

There you go!
I just ran a simulation for an hour with restarts from HDF5 every 500 steps.
k20_restart.txt (overwrite your k20_restart.tpl with it)