Picongpu: ADIOS not working properly with restart script

Created on 1 Jun 2018  路  26Comments  路  Source: ComputationalRadiationPhysics/picongpu

Good evening!
There was an issue that submit_restart.start script didn't go to the right queue.
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2616#issuecomment-392493125
Now this issue is solved and there is another one with the same script.
Something is wrong with ADIOS. Here is the error message.
stderr.txt

plugin machinsystem

Most helpful comment

It works! And it passed this 22 000 step point, where it was crashing before!
Yuppie!
Thanks!

All 26 comments

Is the checkpoint you tried to restart create with the same PIConGPU version (source code).
The development in the dev branch is very agile therefore checkpoints can be incompatible with older dev versions.

Read as: please re-create the checkpoint by running the simulation with the latest source code updates (re-compile, resubmit) again. Then restart from it.

Yes, I did that already before starting the issue

You can also just skip ADIOS as the checkpoint engine and rely fully on HDF5.
In order to do so, recreate the checkpoint and restart from it with this option added in both .cfg files: --checkpoint.backend hdf5 --checkpoint.restart.backend hdf5

If you still have issue with ADIOS or HDF5 please post:

  • the file you sourced to load the modules e.g. picongpu.profile
  • the used config file ?.cfg
  • the used template file for the system ?.tpl
  • stdout and stderr from the broken run

@NastasiaM I tested the setup I send to you with adios and all (simulation and restart) running well. Do you still have issue? If so please read https://github.com/ComputationalRadiationPhysics/picongpu/issues/2618#issuecomment-394344273

I am testing it now, without adios, I can try one more time with adios too... And I will send you result of the test

@psychocoderHPC I tested the setup you gave me and, as I see, it always starts from the beginning of the simulation. I am definitely doing something wrong, But, the restart itself is working nicely. The last simulation is in /bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_7 and input is in $HOME /paramSets/2018_05_17_dev_colloidalMelting_TF. Or I can send you files you need

IN run ColloidalMelting_TF_I=6.3_afterRene_7 you do not activated the restart in the cfg.
#TBG_restart="--checkpoint.restart --checkpoint.restart.backend adios" is commented out. If you remove the hashtag please add the variable TBG_restart to TBG_plugins

I did like that once and than the simulation was crashing-restarting every 2 minutes

As I wrote you must remove the hashtag before #TBG_restart="--checkpoint.restart --checkpoint.restart.backend adios" and add the variable to TBG_plugins

Please post

  • your used cfg
  • the path to the folder where you used the cfg.
  • the tbg command you used to start the simulation

Good morning!
Here it is!

0004gpus.txt

$HOME/paramSets/2018_05_17_dev_colloidalMelting_TF

tbg -s qsub -c etc/picongpu/0004gpus.cfg -t k20_restart.tpl /bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_8

I am not sure if you send the wrong fule or used the wrng tbg command but the file postfix is txt and in the tbg command you used cfg

Am 7. Juni 2018 12:10:49 MESZ schrieb NastasiaM notifications@github.com:

Good morning!
Here it is!

0004gpus.txt

$HOME/paramSets/2018_05_17_dev_colloidalMelting_TF

tbg -s qsub -c etc/picongpu/0004gpus.cfg -t k20_restart.tpl
/bigdata/hplsim/external/mukhar40/ColloidalMelting_TF_I=6.3_afterRene_8

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/ComputationalRadiationPhysics/picongpu/issues/2618#issuecomment-395367905

Github doesn't like strange file formats so I converted it to .txt

I asked @PrometheusPi he was the developer of the restart.tpl and get the information that you need to remove --checkpoint.restart from you cfg file. This option is set by the tpl if needed.

Should I remove only --checkpoint.restart or both --checkpoint.restart and --checkpoint.restart.backend adios? and what about !TBG_restart in plugins?

please keep The backend --checkpoint.restart.backend adios

I tried like that and simulation just didn't restart

Sorry, I checked and simulation restarted, but it crashed a couple of minutes later

@NastasiaM we need the stderr, stdout and last used .cfg and .tpl files attached to help you.

Here it is!
stderr.txt
stdout.txt
Also, I want to add that this simulation is behaving weird, a couple of times restarted one crashed after a couple of minutes, and now the restarted simulation is running but new h5 files are not appearing.

@NastasiaM Your restart k20_restart.tpl is broken. There is a variable missing in the mpiexec line. I am not sure if there are more mistakes in. Marco will post the correct file soon.

ping @n01r

There you go!
I just ran a simulation for an hour with restarts from HDF5 every 500 steps.
k20_restart.txt (overwrite your k20_restart.tpl with it)

It works! And it passed this 22 000 step point, where it was crashing before!
Yuppie!
Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ax3l picture ax3l  路  4Comments

sbastrakov picture sbastrakov  路  3Comments

bussmann picture bussmann  路  4Comments

ax3l picture ax3l  路  4Comments

berceanu picture berceanu  路  4Comments