picongpu 🚀 - `splash::DCException` in the lwfa example when turning on hdf5 output

Thx for the report! Is that the first error that is thrown?
We had some interesting "can not open file" errors last week on our local filesystem. But maybe it's something in Splash or the way we use it...

Can you reproducible trigger this to dig deeper? Can you try to increase libSplash verbosity as described in https://github.com/ComputationalRadiationPhysics/libSplash/issues/160 ? On what system are you running? We are talking libSplash 1.7.0 (yes from output)?

Can you post your mpiexec.tpl and is your OpenMPI compiled CUDA aware?

ax3l on 29 Oct 2018

Steps to reproduce:

After following the spack install, I do:

export SPLASH_VERBOSE=99
pic-create $PIC_EXAMPLES/LaserWakefield $HOME/picInputs/h5LWFA
pic-build -b "cuda:60"

then increase resolution and activate hdf5 output in 8.cfg (see 8.cfg.hdf5.patch ) and submit with

tbg -s bash -c etc/picongpu/8.cfg -t etc/picongpu/bash/mpiexec.tpl $SCRATCH/runs/lwfa_wh5

Here is the full PIConGPU output with verbose libSplash.

The simulation _always_ crashes in the middle (17/43G) of writing $SCRATCH/runs/lwfa_wh5/simOutput/h5/simData_600.h5. The previous .h5 files are succesfully written, but this one is only about half the size.

Turning off hdf5 output in 8.cfg makes the simulation complete successfully.

Additional info:

The system is a supermicro SYS-4028GR-TRT2 with dual Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz and 512 GB DDR4 RAM. It has 8 Tesla P100-PCIE@16GB (see nvidia-smi output) and is running Ubuntu 16.04.5 LTS. Full specs here. The system has no batch system installed.

List of installed spack packages and mpiexec.tpl. I have [email protected] and [email protected].

Regarding CUDA awareness, here is the OpenMPI info. It is _not_ CUDA aware, since it was configured as --without-cuda.

berceanu on 29 Oct 2018

❤1 👍1

cc @psychocoderHPC maybe we have still a race condition or mem violation in our HDF5 plugin? Could also be in the MPI (or MPI-IO) stack itself, but looks interesting...

@berceanu if you recklessly switch hdf5 output with adios output, does that work? (we share little, recently added code to make both multi-plugins).
What's the HDF5 version you are using? (spack spec picongpu)

ax3l on 1 Nov 2018

I installed ADIOS 1.13.1 and activated it in 8.cfg (see 8.cfg.adios.patch here). The simulation completes successfully in 1h, producing 1.8T of output :)

I tried both hdf5 1.10.3 and 1.10.4 and they crash with the above splash::DCException, at the same point in the simulation.

berceanu on 4 Nov 2018

👍1

cc @psychocoderHPC when you are back at work today, can you take a look at this as well? Could this be something on our side or the MPI level? This is a single-node machine.

ax3l on 5 Nov 2018

@berceanu Thanks for the report with all the useful information. Is it possible that you compile HDF5 with debug symbols compiler option: -g and reproduce the crash. It is only possible to find the source of the issue if we know why it crash within HDF5.

@ax3l I looked into the issue but do not see anything which can be the reason for the problem.

psychocoderHPC on 5 Nov 2018

👍1

I think debug symbols won't help here for HDF5, although good to have them still.

But in order to dig deeper, we should output inside libSplash why the Exception is thrown.

I think it is this one:
https://github.com/ComputationalRadiationPhysics/libSplash/blob/v1.7.0/src/DCDataSet.cpp#L585-L586

We could first take the return code or internal HDF5 error stack of this H5Dwrite and put it into the HDF5 error interface (Example 12):

if (H5Dwrite(dataset, this->datatype, dsp_src, dataspace, dsetWriteProperties, data) < 0)
{
    H5Eprint(H5E_DEFAULT, stderr);

    throw DCException(getExceptionString("write: Failed to write dataset"));
}

With this patch, we might get the underlying HDF5 error printed.

ax3l on 5 Nov 2018

@berceanu Thanks for the report with all the useful information. Is it possible that you compile HDF5 with debug symbols compiler option: -g and reproduce the crash. It is only possible to find the source of the issue if we know why it crash within HDF5.

Is it possible to add this compiler option via Spack and if so, how?

@ax3l I looked into the issue but do not see anything which can be the reason for the problem.

iirc, you guys have access to P100 nodes on Hemera, would it be useful to reproduce the issue there?

berceanu on 5 Nov 2018

Is it possible to add this compiler option via Spack and if so, how?

yes, just do spack install picongpu cppflags="-g".

access to P100

That would help, but I can prepare you a little something to install the patched libSplash in spack. Give me a minute.

ax3l on 5 Nov 2018

😄1 👍1

In your $SPACK_ROOT edit var/spack/repos/builtin/packages/libsplash/package.py and add the line

    version('printHDF5ErrorStackOnWrite', branch='topic-printHDF5ErrorStackOnWrite', git='https://github.com/ax3l/libSplash.git')

In your src/spack-repo/packages/picongpu/package.py, loosen the constrain on libSplash:

    depends_on('[email protected],printHDF5ErrorStackOnWrite', when='+hdf5')

Then just do

#           add what you usually put here, compilers, etc.  ---v
spack install picongpu ^libsplash@printHDF5ErrorStackOnWrite

To load,

#           add what you usually put here, compilers, etc.  ---v
spack load picongpu ^libsplash@printHDF5ErrorStackOnWrite

ax3l on 5 Nov 2018

Do I need to uninstall the current version of PIConGPU or will this just install another one alongside it, and I can select which one I want to load each time?

berceanu on 5 Nov 2018

This will install another PIConGPU alongside. You can always spack uninstall <spec> when it gets to convoluted with parallel installations later on.

(As a side note, it will reuse most of the dependencies in this case under the hood.)

ax3l on 5 Nov 2018

👍1

@berceanu did you get more HDF5 output that way? :)

ax3l on 6 Nov 2018

@ax3l Should I also set SPLASH_VERBOSE=99?

berceanu on 7 Nov 2018

Yes, keep it all on so we have the full info.

ax3l on 7 Nov 2018

👍1

OK, so I did

spack install picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected]

and then

spack load picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected]
export SPLASH_VERBOSE=99

followed by the usual pic-create, pic-build and tbg.

The crash happens in the same place. Here is the full output with verbose libSplash from topic-printHDF5ErrorStackOnWrite branch.

berceanu on 7 Nov 2018

👍1

The interesting part is here:

HDF5-DIAG: Error detected in HDF5 (1.10.4) MPI-process 1:
  #000: H5Dio.c line 336 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 828 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dmpio.c line 893 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #003: H5Dmpio.c line 816 in H5D__chunk_collective_io(): couldn't finish linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #004: H5Dmpio.c line 1167 in H5D__link_chunk_collective_io(): couldn't finish MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 2057 in H5D__final_collective_io(): optimized write failed
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 426 in H5D__mpio_select_write(): can't finish collective parallel write
    major: Low-level I/O
    minor: Write failed
  #007: H5Fio.c line 165 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #008: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #009: H5Faccum.c line 826 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #010: H5FDint.c line 258 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #011: H5FDmpio.c line 1844 in H5FD_mpio_write(): file write failed
    major: Low-level I/O
    minor: Write failed
terminate called after throwing an instance of 'splash::DCException'
  what():  Exception for DCDataSet [x] write: Failed to write dataset

berceanu on 7 Nov 2018

👍1

Thanks for the details. I am currently looking into the issue.

psychocoderHPC on 8 Nov 2018

@berceanu can you please post the output of

spack spec picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected]

so we know what dependencies, e.g. for MPI were taken?

ax3l on 8 Nov 2018

@berceanu which filesystem are you writing to? Local or network? What's the FS type, EXT4?

ax3l on 8 Nov 2018

As a side-test, can you switch to MPICH instead of the spack-default OpenMPI and verify the bug is seen there as well?

spack install picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected] ^mpich

spack load picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected] ^mpich
export SPLASH_VERBOSE=99

Could still be a memory violation on our side and we are getting lucky with another impl., but maybe it's also just a HDF5 or MPI-IO bug.

ax3l on 8 Nov 2018

@berceanu Could you please post the PIConGPU output with the splash verbose in formation from the last crashed run where you already posted the call stack.

output is already in https://github.com/ComputationalRadiationPhysics/picongpu/issues/2777#issuecomment-436685063

psychocoderHPC on 8 Nov 2018

👍1

so we know what dependencies, e.g. for MPI were taken?

here are the dependencies

berceanu on 8 Nov 2018

👍1

@berceanu which filesystem are you writing to? Local or network? What's the FS type, EXT4?

local filesystem /dev/sdc1, partition type ext4

berceanu on 8 Nov 2018

👍1

As a side-test, can you switch to MPICH instead of the spack-default OpenMPI and verify the bug is seen there as well?

==> Installing mpich
[...]
==> Error: ProcessError: Command exited with status 1:
    '/home/andrei/src/spack/var/spack/stage/mpich-3.2.1-q7xvsy7ehr7ztxrowim7mdm7t4x4wsz6/mpich-3.2.1/configure' '--prefix=/home/andrei/src/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-7.3.0/mpich-3.2.1-q7xvsy7ehr7ztxrowim7mdm7t4x4wsz6' '--enable-shared' '--with-pm=hydra' '--with-pmi=yes' '--enable-romio' '--without-ibverbs' '--with-device=ch3:nemesis:tcp'

1 error found in build log:
     422    checking for type of weak symbol alias support... pragma weak
     423    checking whether __attribute__ ((weak)) allowed... yes
     424    checking whether __attribute__ ((weak_import)) allowed... yes
     425    checking whether __attribute__((weak,alias(...))) allowed... yes
     426    checking for multiple weak symbol support... yes
     427    checking for shared library (esp. rpath) characteristics of CC... done (results in src/env/cc_shlib.conf)
  >> 428    configure: error: F90 and F90FLAGS are replaced by FC and FCFLAGS respectively in this configure, please unset F90/F90FLAGS and set FC/FCFLAGS instead and rerun configure again.

berceanu on 8 Nov 2018

👍1

Hm, can't reproduce the MPICH install issue. Can you unset F90 and unset F90FLAGS before spack install ...?

ax3l on 8 Nov 2018

Yeah, that seems to work. Now I get

==> Error: FetchError: All fetchers failed for libsplash-printHDF5ErrorStackOnWrite-wj27hgbb5j4qcajb526a26gy3hcin4n3

Guess it's because it got merged?
So what do I need to use instead of ^libsplash@printHDF5ErrorStackOnWrite ?

berceanu on 8 Nov 2018

Upsi, I deleted that branch already... Sorry, can you just change @printHDF5ErrorStackOnWrite with @develop? (cmd line and package.py)
(P.S.: branch restored for convenience as well, I wasn't thinking.)

ax3l on 9 Nov 2018

👍1

Running with MPICH:

Running program...
[mpiexec@ServerS] match_arg (utils/args/args.c:159): unrecognized argument am
[mpiexec@ServerS] HYDU_parse_array (utils/args/args.c:174): argument matching returned error
[mpiexec@ServerS] parse_args (ui/mpich/utils.c:1596): error parsing input array
[mpiexec@ServerS] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1648): unable to parse user arguments
[mpiexec@ServerS] main (ui/mpich/mpiexec.c:149): error parsing parameters

Here's the spec.

berceanu on 9 Nov 2018

Ah, your .tpl file is tuned for OpenMPI I guess, just remove the -am argument maybe (there is no infiniband that needs tweaking here).

ax3l on 11 Nov 2018

$ tbg -s bash -c etc/picongpu/8.cfg -t etc/picongpu/bash/mpiexec.tpl $SCRATCH/runs/mpich

Running program...
[proxy:0:0@ServerS] HYDU_create_process (utils/launch/launch.c:75): execvp error on file /Date2/andrei/runs/mpich/tbg/openib.conf (Permission denied)

$ ls -lsa /Date2/andrei/runs/mpich/tbg/openib.conf
4 -rw-r--r-- 1 andrei andrei 232 nov  9 18:34 /Date2/andrei/runs/mpich/tbg/openib.conf

berceanu on 11 Nov 2018

Yes, the -am .../openib.conf option needs to be skipped including the argument :) Sorry, was too short in my description.

ax3l on 12 Nov 2018

Oh, my bad, should've checked the mpirun manpage :)

Running program...
[mpiexec@ServerS] match_arg (utils/args/args.c:159): unrecognized argument mca
[mpiexec@ServerS] HYDU_parse_array (utils/args/args.c:174): argument matching returned error
[mpiexec@ServerS] parse_args (ui/mpich/utils.c:1596): error parsing input array
[mpiexec@ServerS] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1648): unable to parse user arguments
[mpiexec@ServerS] main (ui/mpich/mpiexec.c:149): error parsing parameters

I could remove the --mca mpi_leave_pinned 0 switch but wouldn't that affect https://github.com/ComputationalRadiationPhysics/picongpu/issues/2782#issuecomment-435053460 ?

berceanu on 12 Nov 2018

These last few messages made me realise that, even though the docs specify OpenMPI 1.7+ / MVAPICH2 1.8+ or similar as dependencies, the file etc/picongpu/bash/mpiexec.tpl is actually hard-coded for OpenMPI.

berceanu on 12 Nov 2018

👍1

Yes, that's really bad that they are hardcoded. Problem for devs on our side is, that MPI has a defined C-API but no defined cmdline parameters... Differs heavily.

--mca mpi_leave_pinned 0 is quite important if the build MPI is not CUDA-aware. Because in that case, the communication with the network-card does also pin and worse unpin our pinned buffers that we need for device communication. Maybe this MPICH equivalent should therefore be set: https://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-November/004121.html

ax3l on 12 Nov 2018

I got more insight into the original issue. It seems that the crash depends on the value of the --hdf5.period switch.

So far I only tried --hdf5.period 100, which is the default value of the LWFA example.
For a value of 200, the simulation still crashes, but for 500 and above there is no crash.

berceanu on 12 Nov 2018

Was the last crash with --hdf5.period 200 seen with MPICH or OpenMPI?

psychocoderHPC on 13 Nov 2018

Oh, this is still all OpenMP, MPICH seems too complicated to make it work.

berceanu on 13 Nov 2018

I also have trouble dumping particles like on, but on Hemera (HZDR).
@psychocoderHPC I think there is a particle preparation bug or something. My case is quite late (step 75'000) in a sim when I dump particles for the 4th time it hangs.

@berceanu it's ok not to try MPICH any further, looks like we don't have great templates for it in place yet.

ax3l on 13 Nov 2018

👍1

@ax3l We need somehow a reproduce able parameter set else it would not be possible to debug the issue :-(

psychocoderHPC on 13 Nov 2018

I'll try to create you a restart step close to the issue I can trigger. But this would be on Hemera and there it does not throw the beautiful error we have here but just hangs. I also see a .loc file close to it and am currently waiting for HPC response if some kind of AntiVir, etc. might be the reason on Hemera.

ax3l on 13 Nov 2018

👍1

.loc is created by HDF5 and is visible if you crash during the write without closing the HDF5 file.

psychocoderHPC on 13 Nov 2018

👍1

Yep, there is also nothing running as well.
Trying to create a reproducer on Hemera until tomorrow. If this does not help, maybe @berceanu can create an ssh account for us on his machine. Unless @psychocoderHPC wants to try this first, because it's a small example in @berceanu 's case (LWFA default).

ax3l on 13 Nov 2018

From my part it's not a problem to give you guys access if you think that helps.

berceanu on 13 Nov 2018

@psychocoderHPC is not at work tomorrow and I'll travel on Thursday. But is you can already create us a huebl and widera accounts (via mail) that would help.
Here is my SSH pub key: https://keybase.pub/ax3l/ssh_public_keys.txt

ax3l on 13 Nov 2018

@berceanu I moved now from your system back to our hypnos. I can reproduce it on our system with the simulation sizes you provided (-g 512 2048 512 -d 2 4 1)

psychocoderHPC on 19 Nov 2018

👍1

@psychocoderHPC Did you manage to reproduce the issue on the Hypnos cluster?

berceanu on 21 Nov 2018

@berceanu just be warned, today we have a public holiday in Saxony :)

ax3l on 21 Nov 2018

🎉1

@berceanu Yes I can reproduce the issue on our system. I run different cases on hypnos and try to get a clue whats going wrong.
If you change your simulation size or the number of GPUs per direction you can currently workaround the bug.

psychocoderHPC on 22 Nov 2018

Also decreasing the hdf5 output frequency is another workaround. I'm curious if this also happens with other MPI implementations..

berceanu on 22 Nov 2018

I am now debugging this issue since 1 week and found https://support.hdfgroup.org/HDF5/faq/limits.html

What is the limit on the amount of data you can read in Parallel HDF5 ?
The limit is 2 GB because the MPI standard specifies that the 'count' parameter passed to MPI_File_read() be a 32-bit integer.

And if the maximum size for read is 2GB maybe also the max size for write per dataset is also 2GiB.

psychocoderHPC on 28 Nov 2018

https://portal.hdfgroup.org/pages/viewpage.action?pageId=48809737

I would say we need at least HDF5-1.10.2 but since the HDF5 version on the system of @berceanu is 1.10.4 I do not know why it is crashing there.

psychocoderHPC on 28 Nov 2018

I verified that HDF5/splash is the source of all the pain. I manipulated PIConGPU that we write with 8 GPUs each X elements (4byte per elem) so hat I wrote accumulated dataset <4GiB and >4GiB.

X byte_accumulated status
134217727 <4GiB OK
134217728 =4GiB OK
134217729 >4GiB Fail

The reason why 134217728 is working is that the size is to large but there is no access after 4GiB.

Task todo: Write I mini app to check if this behaviour is coming frmom libSPlash or pure HDF5.

https://portal.hdfgroup.org/pages/viewpage.action?pageId=48808714 is saying

What is the limit on the chunk size?
The current maximum number of bytes in a chunk is 2^32-1 (4 GB). As a result of this restriction, the number of elements in a chunk cannot be greater than 4 GB. You must also account for the datatype size of each element.

The chunk size can be larger than a dataset's dimension if the dataset's maximum dimension sizes are declared as unlimited or if the chunk size is less than the maximum dimension size. The chunk size for fixed-size datasets cannot exceed the size of a fixed-size dataset. For example, a dataset consisting of a 5 x 4 fixed-size array cannot be defined with 10 x 10 chunks, as the following error will occur:

but since each MPI rank has its own dataset this should not be the limitation.

Tested with HDF5 1.8.20 on hemera

psychocoderHPC on 28 Nov 2018

👍2

Gosh... it's really having a too long dimension, it's not even about chunks or size:

#include <mpi.h>
#include <hdf5.h>

#include <stdlib.h>
#include <string.h>
#include <stdio.h>


int write_HDF5(
    MPI_Comm const comm, MPI_Info const info,
    int* data, size_t len)
{
    // property list
    hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);

    // MPI-I/O driver
    H5Pset_fapl_mpio(plist_id, comm, info); 

    // file create
    char file_name[100];
    sprintf(file_name, "%zu", len);
    strcat(file_name, ".h5");
    hid_t file_id = H5Fcreate(file_name, H5F_ACC_TRUNC,  
                              H5P_DEFAULT, plist_id); 

    // dataspace
    hsize_t dims[1] = {len};
    hsize_t max_dims[1] = {len};
    // hsize_t* max_dims = NULL;
    hid_t filespace = H5Screate_simple(1,
        dims,
        max_dims);

    // chunking
    hid_t datasetCreationProperty = H5Pcreate(H5P_DATASET_CREATE);

    // dataset
    hid_t dset_id = H5Dcreate(file_id, "dataset1", H5T_NATIVE_INT,  
                              filespace, H5P_DEFAULT,
                              datasetCreationProperty, H5P_DEFAULT);

    // write
    hid_t dset_plist_id = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_COLLECTIVE); 
    // H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_INDEPENDENT); // default

    herr_t status;
    status = H5Dwrite(dset_id, H5T_NATIVE_INT, 
                      H5S_ALL, filespace, dset_plist_id, data); 

    // close all
    status = H5Pclose(plist_id);
    status = H5Pclose(dset_plist_id);
    status = H5Dclose(dset_id);
    status = H5Fclose(file_id);

    return 0;
}

int main(int argc, char* argv[])
{

    MPI_Comm comm = MPI_COMM_WORLD; 
    MPI_Info info = MPI_INFO_NULL;  

    MPI_Init(&argc, &argv);

    size_t lengths[3] = {134217727u, 134217728u, 134217729u};
    for( size_t i = 0; i < 3; ++i )
    {
        size_t len = lengths[i];
        printf("Writing for len=%zu ...\n", len);
        int* data = malloc(len * sizeof(int));
        for( size_t k=0; k<len; ++k)
            data[k] = 420;

        write_HDF5(comm, info, data, len);
        free(data);
        printf("Finished write for len=%zu ...\n", len);
    }

    MPI_Finalize();

    return 0;
}

$ h5pcc phdf5.c && ./a.out
Writing for len=134217727 ...
Finished write for len=134217727 ...
Writing for len=134217728 ...
Finished write for len=134217728 ...
Writing for len=134217729 ...
HDF5-DIAG: Error detected in HDF5 (1.10.4) MPI-process 0:
  #000: H5Dio.c line 336 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 828 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dmpio.c line 671 in H5D__contig_collective_write(): couldn't finish shared collective MPI-IO
    major: Low-level I/O
    minor: Write failed
  #003: H5Dmpio.c line 2013 in H5D__inter_collective_io(): couldn't finish collective MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #004: H5Dmpio.c line 2057 in H5D__final_collective_io(): optimized write failed
    major: Dataset
    minor: Write failed
  #005: H5Dmpio.c line 426 in H5D__mpio_select_write(): can't finish collective parallel write
    major: Low-level I/O
    minor: Write failed
  #006: H5Fio.c line 165 in H5F_block_write(): write through page buffer failed
    major: Low-level I/O
    minor: Write failed
  #007: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
    major: Page Buffering
    minor: Write failed
  #008: H5Faccum.c line 826 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #009: H5FDint.c line 258 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #010: H5FDmpio.c line 1844 in H5FD_mpio_write(): file write failed
    major: Low-level I/O
    minor: Write failed
Finished write for len=134217729 ...

$ du -hs 13421772*                                              
513M    134217727.h5
513M    134217728.h5
4,0K    134217729.h5

ax3l on 29 Nov 2018

Reported in https://forum.hdfgroup.org/t/cannot-write-more-than-512-mb-in-1d/5118

ax3l on 29 Nov 2018

Fun fact we found a dead range [134217727u + 2; 134217727u + 9] is triggering the error.

psychocoderHPC on 29 Nov 2018

🎉1 😄1 👍1

Just for your interest, this is how I valgrind-ed PIConGPU over the weekend with this script:

run PIConGPU without output, "train" suppression file
run again with suppression file in scenario that fails

only helps if the memory violation is runtime triggered and not already present in case 1.:

valgrind --leak-check=full --show-reachable=yes --error-limit=no --gen-suppressions=all --log-file=picongpu_all.log \
    ./bin/picongpu -d 1 1 1 -g 12 80 12 --periodic 1 1 1 -s 10
cat picongpu_all.log | ./parse_valgrind_suppressions.sh > picongpu_all.supp

valgrind --leak-check=full --show-reachable=yes --error-limit=no --suppressions=./picongpu_all.supp --log-file=picongpu_hdf5.log \
    ./bin/picongpu -d 1 1 1 -g 12 80 12 --periodic 1 1 1 -s 10 --hdf5.period 10 --hdf5.file simData

parse_valgrind_suppressions.sh.txt

ax3l on 29 Nov 2018

👍1

For the example above for only length 134217736u I get from valgrind:

valgrind --leak-check=full --show-reachable=yes --error-limit=no --gen-suppressions=all --log-file=ph5_all.log ./a.out

==28325== Warning: set address range perms: large range [0x96bd040, 0x296bd03c) (undefined)
==28325== Syscall param pwritev(vector[...]) points to uninitialised byte(s)
==28325==    at 0x5CCEF53: ??? (syscall-template.S:84)
==28325==    by 0x5656FD9: mca_fbtl_posix_pwritev (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325==    by 0x5612873: mca_common_ompio_file_write (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325==    by 0x561232D: mca_common_ompio_file_write_at_all (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325==    by 0x56A2B25: mca_io_ompio_file_write_at_all (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325==    by 0x55ED9C7: PMPI_File_write_at_all (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325==    by 0x36A983: H5FD_mpio_write (in /home/axel/a.out)
==28325==    by 0x1B7C94: H5FD_write (in /home/axel/a.out)
==28325==    by 0x388873: H5F__accum_write (in /home/axel/a.out)
==28325==    by 0x265294: H5PB_write (in /home/axel/a.out)
==28325==    by 0x1A8F2A: H5F_block_write (in /home/axel/a.out)
==28325==    by 0x36847C: H5D__mpio_select_write (in /home/axel/a.out)
==28325==  Address 0x116bd03f is 134,217,727 bytes inside a block of size 536,870,908 alloc'd
==28325==    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
==28325==    by 0x114345: main (in /home/axel/a.out)

...

ph5_all.log

(Cannot stable reproduce this diagnostics when trying further...)

ax3l on 29 Nov 2018

My little MPI-IO file works xD

$ mpicc pmpio.c && ./a.out && du -hs testfile
Writing for len=134217736 ...
Finished write for len=134217736 ...
Writing for len=134217728 ...
Finished write for len=134217728 ...
Writing for len=134217729 ...
Finished write for len=134217729 ...
Writing for len=134217736 ...
Finished write for len=134217736 ...
Writing for len=134217737 ...
Finished write for len=134217737 ...
513M    testfile

#include <mpi.h>

#include <stdlib.h>
#include <string.h>
#include <stdio.h>


int write_MPIO(
    MPI_Comm const comm, MPI_Info const info,
    int* data, size_t len)
{
    int myrank = 0;

    MPI_File thefile; 

    MPI_File_open(MPI_COMM_WORLD, "testfile", 
        MPI_MODE_CREATE | MPI_MODE_WRONLY, 
        MPI_INFO_NULL, &thefile); 
    MPI_File_set_view(thefile, myrank * len * sizeof(int), 
          MPI_INT, MPI_INT, "native", 
                      MPI_INFO_NULL); 
    MPI_File_write(thefile, data, len, MPI_INT, 
       MPI_STATUS_IGNORE); 
    MPI_File_close(&thefile); 

    return 0;
}

int main(int argc, char* argv[])
{

    MPI_Comm comm = MPI_COMM_WORLD; 
    MPI_Info info = MPI_INFO_NULL;  

    MPI_Init(&argc, &argv);

    size_t lengths[5] = {134217736u, 134217728u, 134217729u, 134217736u, 134217737u};
    for( size_t i = 0; i < 5; ++i )
    {
        size_t len = lengths[i];
        printf("Writing for len=%zu ...\n", len);
        int* data = malloc(len * sizeof(int));
        for( size_t k=0; k<len; ++k)
            data[k] = 420;

        write_MPIO(comm, info, data, len);
        free(data);
        printf("Finished write for len=%zu ...\n", len);
    }

    MPI_Finalize();

    return 0;
}

ax3l on 29 Nov 2018

You guys are heroes! :)

berceanu on 29 Nov 2018

❤1

We still have absolutely no idea what's going on there xD Ahhh.

ax3l on 29 Nov 2018

Yea but you've put so much work into this!

berceanu on 29 Nov 2018

The error occurs exactly at the boundary of H5S_MAX_MPI_COUNT as we just debugged: (H5Smpio.c in function H5S_mpio_all_type)

ax3l on 29 Nov 2018

Note: The magic limit H5S_MAX_MPI_COUNT = 536870911 is coming from the fact that you have int's in MPI and is normally for 4byte types and not for char like in HDF5. The correct limit for char is (2^32-1) == 2147483647 if we divide this by the size of e.g. int it is floor(536870911.75) == 536870911 our magic border.
This only clears where the magic number is coming from but not why we have the issue because it should be save to use any number smaller than 2^32-1 if we do byte writes with MPI file io.

psychocoderHPC on 30 Nov 2018

👍1

True, forgot to write.
One can now of course move the number, but I guess there is a logic bug hidden in the splitting new type creation somewhere that could re-appear for other user-defined types later on in different situations.

ax3l on 30 Nov 2018

I am trying to get this further into HDF5 issue escalation, but they allow themselves 2 weeks for even answering "community" reports...

I am looking so much forward to switching to ADIOS2 only.

ax3l on 30 Nov 2018

@ax3l I played a little bit around and think the bug is later on in the hdf5 chain. I changed the H5S_MAX_MPI_COUNT but it still crashes at e.g. 134217727u +5u elements and each multiple.

IMO the type creation is fine. I think it has something todo with the collective buffered writes or the preparation.

psychocoderHPC on 30 Nov 2018

👍1

Could then also be in the MPI-I/O (ROMIO) layer if we are unlucky.

ax3l on 30 Nov 2018

I extended your MPI code to reproduce the issue. https://gist.github.com/psychocoderHPC/dfcc0ebe7c2547d1e0e40a14cbaf6072

The current problem in the extended code snipped is that MPI_Get_elements_x is not reporting the correct number of written bytes back but if you check the file size you can see that only 17bytes are written. NOTE you must remove the file each time, else the file size is wrong.

psychocoderHPC on 30 Nov 2018

👍1

Can you please post the output here that you expect vs the one you get? (If you add .c/.cpp to your gists file name syntax highlighting will be enabled)

you must remove the file each time

open with truncation to avoid this?

ax3l on 30 Nov 2018

The output is:

rm testfile; ./a.out ; ls -la testfile
Writing for len=134217732 ...
left over: 17
byte 140737488325576 536870928
Finished write for len=134217732 ...
-rw-r--r-- 1 widera fwt 16 Nov 30 16:49 testfile

140737488325576 is the number of written bytes. It is currently proken and IMO an issue on my side.
In the last line 16 is the size of the file.

Here is the output if I use 134217727u + 10u:

rm testfile; ./a.out ; ls -la testfile
Writing for len=134217737 ...
left over: 37
byte 140737488325576 536870948
Finished write for len=134217737 ...
-rw-r--r-- 1 widera fwt 536870948 Nov 30 16:52 testfile

psychocoderHPC on 30 Nov 2018

👍1

They added the bug now as HDFFV-10638.

ax3l on 1 Dec 2018

👍1

The issue is very likely OpenMPI specific, as I cannot reproduce it when using MPICH 3.3 https://github.com/open-mpi/ompi/issues/6285

ax3l on 17 Jan 2019

👍1

That's good to know, I will try installing MPICH. So, what is the failsafe way of running the code now? Is it MPICH + adios, or is MPICH + hdf5 good enough?

berceanu on 18 Jan 2019

Yes, both should work! You might need to slightly adjust your .tpl for mpirun/exec options. Those are, unfortunately, not standardized in the MPI standard.

ax3l on 18 Jan 2019

Oh yes, I remember running into that issue as I was trying to test the setup with MPICH. Do you have an example working .tpl?

berceanu on 18 Jan 2019

New temporary work-around surfaced: OpenMPI also ships a ported ROMIO implementation for IO. You can switch the OpenMPI I/O backend away from its defaults to ROMIO (same as in MPICH) via:

mpirun --mca io ^ompio ...

ax3l on 21 Jan 2019

👍2

Who would have know the rabbit hole runs so deep :)
I just saw the PR.

berceanu on 21 Jan 2019

😄2 👍1

Picongpu: `splash::DCException` in the lwfa example when turning on hdf5 output

Most helpful comment

All 78 comments

Steps to reproduce:

Additional info:

Related issues