Picongpu: spash::DCExeption

Created on 16 Jan 2020  路  6Comments  路  Source: ComputationalRadiationPhysics/picongpu

I get this error in various circumstances. Sometimes as the solver runs (first file) and sometimes when it just starts (second file). In both cases the model was built after removing the current .build folder. What can it be?

first-error.txt
second-error.txt

This error seems to appear randomly. A new run on exactly the same files starts after a first failure.

plugin question

Most helpful comment

Concerning the storage size. With PIConGPU one could generate lots of output very quickly with 3d output. I am not sure this is what led to this particular issue, but indeed a problem you might face on this machine. In this case it might make sense to avoid or limit the 3d output with doing postprocessing with our plugins on-the-fly during the simulation, and outputting only the postprocessed results and not raw data.

All 6 comments

Hello @cbontoiu .
Unfortunately, from this output I don't have a good idea. In the first file, the exception arises when trying to create a new hdf5 data set, which can happen for several reasons: somehow wrong parameters end up being used (due to a bug in PIConGPU, or e.g. a mistake in your .cfg), or there are the two usual suspects: no disk space and a filesystem issue.
In the second please pay attention to the warning on the second line of the output. It looks like you modified the setup after building PIConGPU (which may, or may not contributed to the exception). It is indeed strange that the error appears randomly. For the lack of better ideas, in case we choose to suspect HDF5/libSplash, perhaps ADIOS can be tried as the output backend.

@cbontoiu What kind of filesystem do you use? Is it a HPC syste mwith lustre or gpfs or your local machine?
If it is a local workstation could you please post the output of df -h

@cbontoiu What kind of filesystem do you use? Is it a HPC syste mwith lustre or gpfs or your local machine?
If it is a local workstation could you please post the output of df -h

Thank you for your reply,
It is a local machine and here is the output of the command

Filesystem Size Used Avail Use% Mounted on
udev 7.8G 0 7.8G 0% /dev
tmpfs 1.6G 1.8M 1.6G 1% /run
/dev/sda1 110G 62G 43G 60% /
tmpfs 7.9G 112M 7.8G 2% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
tmpfs 1.6G 48K 1.6G 1% /run/user/1000
/dev/sdb1 29G 281M 29G 1% /media/cristi/cheita

if storage space could have caused this exception, I can confirm that I use a small disk and a few times it happened that it was filled up with data.

Aside question: Is it possible for us from Cockroft Institute/Liverpool University to access a GPU cluster in Dresden or elsewhere in order to run PIConGPU? It will help us a lot and of course we are happy to pay for this service. What would you recommend? What about the Amazon Cloud?
https://aws.amazon.com/nvidia/

Regards.
Cristian

@cbontoiu I have no idea about Dresden clusters availability. What is often done for large simulation campaigns is submitting applications for clusters available to academia. E.g. PRACE has a number of such clusters. All this, of course, take quite some time to get.

Concerning the storage size. With PIConGPU one could generate lots of output very quickly with 3d output. I am not sure this is what led to this particular issue, but indeed a problem you might face on this machine. In this case it might make sense to avoid or limit the 3d output with doing postprocessing with our plugins on-the-fly during the simulation, and outputting only the postprocessed results and not raw data.

Hi Christian,
we had collaborators working with us to share some on-boarding computing from our sites before, please contact the group lead at HZDR, Michael Bussmann, by mail to see if this might work here.

Otherwise, there are also smaller international places with GPUs in Europe where one can apply for, PRACE was mentioned above and is rather large scale, but e.g. individual labs such as FZ Juelich in Helmholtz, CINECA in Italy, etc. might have some programs where international/European researchers can apply for moderate-size compute time.

We haven't used much cloud computing due to our access to various HPC systems ourselves for production. Nevertheless, I saw offers from AWS, Google cloud and others at SC19 again that provide quite some decent HPC setups that one can self-configure. They definitely do have support-folks that are specifically for HPC-users and provide the usual software stack that our software relies on (MPI, schedulers, etc.), so maybe really contact them and ask about their services and pricing.

Alternatively, you can also compile PIConGPU with its omp2b (OpenMP) backend and run it on CPU clusters. CPUs are just traditionally an order of magnitude slower in time-to-solution (but provide more RAM per HPC node). So if you do have already access to CPU clusters locally, just compile PIConGPU for CPU:
https://picongpu.readthedocs.io/en/0.4.3/usage/basics.html#usage-basics-build

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ax3l picture ax3l  路  3Comments

berceanu picture berceanu  路  3Comments

mikewang2000 picture mikewang2000  路  3Comments

berceanu picture berceanu  路  4Comments

ax3l picture ax3l  路  4Comments