Picongpu: Hypnos: CUDA 9 driver Restart Hangs

Created on 29 Oct 2017  路  8Comments  路  Source: ComputationalRadiationPhysics/picongpu

With the new driver for CUDA9, I encountered multiple hangs while restarting PIConGPU from a hdf5 checkpoint using libSplash.
The error message was:

Error in `[...]/picongpu/bin/picongpu': corrupted size vs. prev_size: 0x0000000005082cf0

pointing to a double free or not allocated memory.
It seems to be random hanging during restart: sometimes everything goes fine, sometimes it just stops working and does not crash.

I compiles for the k80 architecture using 37, activated blocking kernel and set SPLASH_VERBOSE=100. Noting worked.
The last entry before the error message came from libsplash:

[1,62]<stderr>:[SPLASH_LOG:62] Entry 'particles/e/particlePatches/offset/x' (17) is of type: UInt64
[1,62]<stderr>:[SPLASH_LOG:62] readCompleteDataSet
[1,62]<stderr>:[SPLASH_LOG:62] DCDataSet::read (x)
[1,62]<stderr>:[SPLASH_LOG:62] 
[1,62]<stderr>: ndims         = 1
[1,62]<stderr>: logical_size  = (64,1,1)
[1,62]<stderr>: physical_size = (64,1,1)
[1,62]<stderr>: dstBuffer     = (64,1,1)
[1,62]<stderr>: dstOffset     = (0,0,0)
[1,62]<stderr>: srcSize       = (64,1,1)
[1,62]<stderr>: srcOffset     = (0,0,0)
[1,62]<stderr>:
[1,8]<stderr>:*** Error in `[...]/picongpu/bin/picongpu': corrupted size vs. prev_size: 0x0000000004266de0 ***
[1,8]<stderr>:[kepler021:16502] *** Process received signal ***
[1,8]<stderr>:[kepler021:16502] Signal: Aborted (6)
[1,8]<stderr>:[kepler021:16502] Signal code:  (-6)

(qdel was needed to stop the job)

This might be a bug in libspalsh, but since this only started to occur after the cuda9 driver update, I expect it to be something in PIConGPU.

I am working on the latest release and with various setups (LWFA, TWTS).

Any suggestions how to test this further?

cuda bug plugin machinsystem outdatewontfix

All 8 comments

Another simulation failed after particles/e/particlePatches/offset/x again.

EDIT: simulation count hanging after above read: 2

Workaround for now:

Because a hanging simulation can block the cluster for hours to days, I implemented the following workaround.
Right before calling picongpu, I execute the follwong script in the background ~/checkRestart.sh &:

#!/bin/bash

sleep 900

grep "initialization time" output > /dev/null

if [ $? -eq 1 ]
then
mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -npernode 8 -n 64 killall -9 picongpu 2>/dev/null
fi

It waits for 15 minutes and then kills picongpu if initialization failed (hung up).
Due to the use of `k80_profileRestart.tpl, the job is resubmitted and another attempt for restarting is submitted.

Thanks for the report.

Since we have not tested the fresh Hypnos CUDA 9 chain yet, do not use it for production yet please. The currently recommended picongpu.profile is in our manual (using CUDA 8) and under etc/picongpu/hypnos-hzdr.

The issue could come from the MPI(IO) or HDF5 layer of the new toolchain you experimented with. It could also just be a missing CUDA awareness issue again in MPI. Please report the full set of modules / picongpu.profile file you used so others can follow. It's likely an environment issue, less likely a libSplash issue and unlikely a PIConGPU issue.

Which commit do you use? 0.3.1?

@ax3l Sorry for the misleading title - I am using CUDA8 just with the new (CUDA9 capable) drivers on the k80 on hypnos. So there is no way to avoid this in production runs right now.

I am loading the following modules:

Currently Loaded Modulefiles:
  1) gcc/4.9.2                     4) cuda/8.0                      7) hdf5-parallel/1.8.15
  2) cmake/3.7.2                   5) openmpi/1.8.6.kepler.cuda80   8) libsplash/1.6.0
  3) boost/1.62.0                  6) pngwriter/0.5.6               9) editor/emacs/23.1

I am running on the following PIConGPU versions:

  • Backports for 0.3.1-rc2 #2165 8491da8d9f0cb14d11400bec7536a42c72b10480 (without any changes)
  • Backport fix #2139 to 0.3.0 30903b10ff775465ff4e55f27cd0d5ff5f42b017 with various changes on the TWTS pulse

There is currently an issue open for the IT team as well (number 7047). They just gave the following reply:
(in German)

Zus盲tzlich zum Nvidia-Treiber 384.81 haben wir auf den k20- und k80-Knoten die Ubuntu-Release auf 14.04.5 aktualisiert. Weiterhin ist es so, dass Nvidia Ubuntu 14.04 eigentlich nicht mehr unterst眉tzt, d.h. der Installer f眉r den Treiber ist f眉r Version 16.04, lief jedoch fehlerfrei durch.

Ich vermute den Fehler aber eher im Bereich HDF5 bzw. libsplash. Ich w眉rde zun盲chst OpenMPI und die in der Chain nachfolgenden Module auf Basis von cuda/8.0 nochmal neu 眉bersetzen. 

Folgende Module wurden jetzt f眉r gcc/4.9.2 und cuda/8.0 neu 眉bersetzt:

openmpi/1.8.6.kepler.cuda80
hdf5-parallel/1.8.15
libsplash/1.6.0
pngwriter/0.5.6
boost/1.62.0

脛ndert sich dadurch etwas am Fehlerverhalten? 

Ah that makes sense, thanks.

Can you try to use the actual latestet release, 0.3.1? There have been three further backports to the release branch after rc2 (but I do not see from a first glance something that affects you, but it's easier to support that instead of a pre-release).

The error you report is a malloc.c stdlib issue and looks a lot like a driver/MPI/CUDA issue. We can only try to reproduce this with our default examples or smaller projects (outside PIConGPU).

Unfortunately, this really is a system acceptance test that needs to be done first by the IT before driver roll-outs and if they do not have the coverage we can not jump in and provide it from the user side. We will still try to debug your issue as well as possible since we now have to deal with it. Maybe they also just need to rebuild their openmpi-cuda8 module in case it links statically against some CUDA driver libs (dunno).

With the newly compiled modules, the setup still hangs during restart.

This was an issue with the last cluster update of hypnos (@HZDR cluster).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bussmann picture bussmann  路  4Comments

HighIander picture HighIander  路  4Comments

ax3l picture ax3l  路  4Comments

cbontoiu picture cbontoiu  路  3Comments

sbastrakov picture sbastrakov  路  3Comments