Picongpu: HDF5: Field "Kinks" with FreeFormula

Created on 19 Dec 2018  路  22Comments  路  Source: ComputationalRadiationPhysics/picongpu

I am seeing density-"kinks"/dips at multiples of about 1024 cells in one of my simulations:

density_kink

Looks like either an initialization or HDF5 output bug to me, since I can later on see the same dip not only in density but also in electric fields and they stay at the same y over time. Or something with our field strides...

The same output dip is seen for multiples of this position at later times when particles move into the simulation box.

Cells in y: 14720
GPUs in y: 4
GPUs in x: 4
y-cells per GPU: 3680 (<- not at a GPU boundary)

Simulation: 2D
Supercell: 16x16

Particles per cell: 20
Init in cell: random
Shape: PCS

Software / System

PIConGPU 0.4.1 on Hypnos (K80) and Hemera (P100).

Files: (per mail)

affects latest release bug plugin

Most helpful comment

suggested from @ax3l to create an example based on this code snipped
I opened an request in the HDf5 forum: https://forum.hdfgroup.org/t/hdf5bug-h5fd-mpio-collective-chunking/5279

compile:

  • broken: mpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib && mpiexec -n 16 ./a.out
  • fix: mpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib -DFIX && mpiexec -n 16 ./a.out

The problem is H5FD_MPIO_COLLECTIVE combined with chunking. The fix uses H5FD_MPIO_INDEPENDENT. Another fix is to disable chunking.

#include <mpi.h>
#include <hdf5.h>

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 1872llu
#define Y 1872llu

int write_HDF5(
    MPI_Comm const comm, MPI_Info const info,
    float* data, size_t len, int rank)
{
    // property list
    hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);

    // MPI-I/O driver
    H5Pset_fapl_mpio(plist_id, comm, info); 

    // file create
    char file_name[100];
    sprintf(file_name, "%zu", len);
    strcat(file_name, ".h5");
    hid_t file_id = H5Fcreate(file_name, H5F_ACC_TRUNC,  
                              H5P_DEFAULT, plist_id); 

    // dataspace
    hsize_t dims[3] = {Y, X};
    hsize_t globalDims[3] = {Y * 4, X * 4};
    hsize_t max_dims[2] = {Y * 4, X * 4};
    hsize_t offset[2] = {rank/4 * Y, rank%4 * X};

    hid_t srcSize = H5Screate_simple(2, dims, NULL);
    hid_t filespace = H5Screate_simple(2,
        globalDims,
        max_dims);

    printf("%i: %llu,%llu %llu,%llu \n", rank,offset[0],offset[1],globalDims[0],globalDims[1]);

    // chunking
    hsize_t chunk[2] = {128, 128};
    hid_t datasetCreationProperty = H5Pcreate(H5P_DATASET_CREATE);
    H5Pset_chunk(datasetCreationProperty, 2, chunk);

    // dataset
    hid_t dset_id = H5Dcreate(file_id, "dataset1", H5T_NATIVE_FLOAT,  
                              filespace, H5P_DEFAULT,
                              datasetCreationProperty, H5P_DEFAULT);

    // write
    hid_t dset_plist_id = H5Pcreate(H5P_DATASET_XFER);

#ifdef FIX
    H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_INDEPENDENT); // default
#else
    H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_COLLECTIVE); 
#endif

    hid_t dd = H5Dget_space(dset_id);
    H5Sselect_hyperslab(dd, H5S_SELECT_SET, offset,
                        NULL, dims, NULL);

    herr_t status;
    status = H5Dwrite(dset_id, H5T_NATIVE_FLOAT, 
                      srcSize, dd, dset_plist_id, data); 

    // close all
    status = H5Pclose(plist_id);
    status = H5Pclose(dset_plist_id);
    status = H5Dclose(dset_id);
    status = H5Fclose(file_id);

    return 0;
}

int main(int argc, char* argv[])
{

    MPI_Comm comm = MPI_COMM_WORLD; 
    MPI_Info info = MPI_INFO_NULL;  

    MPI_Init(&argc, &argv);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    size_t lengths[1] = {X*Y};
    for( size_t i = 0; i < 1; ++i )
    {
        size_t len = lengths[i];
        printf("Writing for len=%zu ...\n", len);
        float* data = (float*)malloc(len * sizeof(float));
        for( size_t y = 0; y < Y; ++y)
            for( size_t x = 0; x < X; ++x)
                data[y * Y + x] = 100.f + y%1024;

        write_HDF5(comm, info, data, len, rank);
        free(data);
        printf("Finished write for len=%zu ...\n", len);
    }

    MPI_Finalize();

    return 0;
}

screenshot from 2019-01-10 14-07-54

All 22 comments

Zooming further in onto the individual cell with linear y axis, it looks like it is really just one cell at N*1024 and does not go to zero but to some lower value:
density_kink2

Could you please provide the function you used to create the density.
Is the shown density plot the initial density at time step zero?

Could you please also provide the version of PIConGPU and commit hash if it is the dev branch.

It's PIConGPU 0.4.1 and I sent you a mail with the details :)
Yep, this is the t=0 plot for density (grey is what it should be, red is what I see in the fields).

But since this shows up weirdly again later on in time steps where the target expanded, I suspect it is in the field dumping routines (E and density!) somehow...

We can also see it in the transversal direction directly in the HDF5 viewer (zoom in to see the one cell large holes)

ax3l_density

Perfect catch, we saw that indeed the whole front is moved by one cell as well, occurring in sections at transversal GPU boundaries (of which I have four). Since I averaged transversely the images above, my artifact was not perfectly zero as one can see now.

Could be a host-to-device preparation or a splash dump issue.

I have hacked the EFiled calculation and set Border to 100 (red), core to 50 (green) and guard to 1 (gray).
The image show that there are many wrong feature in the composed image. We have the black lines (where the value is zero) and many other issue.

fail

Is this a PNG plugin output or also HDF5?

This output is HDF5. I also tried ADIOS and cannot find such artifacts.

So it's hopefully just a libSplash bug since deviceToHost functionality for fields is identical in our HDF5 & ADIOS plugins.
Critical that this will affect restarts when ADIOS is not compiled in (which is used for checkpoints by default, if found).

The reason for this bug is https://github.com/ComputationalRadiationPhysics/libSplash/blob/140d6ce903d265909c3f8021d5d9cc04e5a72d37/src/DCDataSet.cpp#L220. Somehow the chunking is creating these artefacts.
If I disable chunking all works fine.

Ouch. What do we do about this? Looks like a HDF5 bug to me since I can't see a limitation that we violate:
https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/

besides that our chunks are a little small for parallel I/O:
https://github.com/ComputationalRadiationPhysics/libSplash/blob/v1.7.0/src/include/splash/core/DCHelper.hpp#L87-L101
https://support.hdfgroup.org/HDF5/faq/parallel.html

We can of course just disable the whole feature in libSplash, reducing some functionality such as serial compression and extensible data sets (off by default).

I read also all docu I found and cannot see find something we did wrong. I am currently not sure if we should disable chunking. Maybe we should first check other hdf5 versions and create a mini app.

suggested from @ax3l to create an example based on this code snipped
I opened an request in the HDf5 forum: https://forum.hdfgroup.org/t/hdf5bug-h5fd-mpio-collective-chunking/5279

compile:

  • broken: mpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib && mpiexec -n 16 ./a.out
  • fix: mpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib -DFIX && mpiexec -n 16 ./a.out

The problem is H5FD_MPIO_COLLECTIVE combined with chunking. The fix uses H5FD_MPIO_INDEPENDENT. Another fix is to disable chunking.

#include <mpi.h>
#include <hdf5.h>

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define X 1872llu
#define Y 1872llu

int write_HDF5(
    MPI_Comm const comm, MPI_Info const info,
    float* data, size_t len, int rank)
{
    // property list
    hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);

    // MPI-I/O driver
    H5Pset_fapl_mpio(plist_id, comm, info); 

    // file create
    char file_name[100];
    sprintf(file_name, "%zu", len);
    strcat(file_name, ".h5");
    hid_t file_id = H5Fcreate(file_name, H5F_ACC_TRUNC,  
                              H5P_DEFAULT, plist_id); 

    // dataspace
    hsize_t dims[3] = {Y, X};
    hsize_t globalDims[3] = {Y * 4, X * 4};
    hsize_t max_dims[2] = {Y * 4, X * 4};
    hsize_t offset[2] = {rank/4 * Y, rank%4 * X};

    hid_t srcSize = H5Screate_simple(2, dims, NULL);
    hid_t filespace = H5Screate_simple(2,
        globalDims,
        max_dims);

    printf("%i: %llu,%llu %llu,%llu \n", rank,offset[0],offset[1],globalDims[0],globalDims[1]);

    // chunking
    hsize_t chunk[2] = {128, 128};
    hid_t datasetCreationProperty = H5Pcreate(H5P_DATASET_CREATE);
    H5Pset_chunk(datasetCreationProperty, 2, chunk);

    // dataset
    hid_t dset_id = H5Dcreate(file_id, "dataset1", H5T_NATIVE_FLOAT,  
                              filespace, H5P_DEFAULT,
                              datasetCreationProperty, H5P_DEFAULT);

    // write
    hid_t dset_plist_id = H5Pcreate(H5P_DATASET_XFER);

#ifdef FIX
    H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_INDEPENDENT); // default
#else
    H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_COLLECTIVE); 
#endif

    hid_t dd = H5Dget_space(dset_id);
    H5Sselect_hyperslab(dd, H5S_SELECT_SET, offset,
                        NULL, dims, NULL);

    herr_t status;
    status = H5Dwrite(dset_id, H5T_NATIVE_FLOAT, 
                      srcSize, dd, dset_plist_id, data); 

    // close all
    status = H5Pclose(plist_id);
    status = H5Pclose(dset_plist_id);
    status = H5Dclose(dset_id);
    status = H5Fclose(file_id);

    return 0;
}

int main(int argc, char* argv[])
{

    MPI_Comm comm = MPI_COMM_WORLD; 
    MPI_Info info = MPI_INFO_NULL;  

    MPI_Init(&argc, &argv);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    size_t lengths[1] = {X*Y};
    for( size_t i = 0; i < 1; ++i )
    {
        size_t len = lengths[i];
        printf("Writing for len=%zu ...\n", len);
        float* data = (float*)malloc(len * sizeof(float));
        for( size_t y = 0; y < Y; ++y)
            for( size_t x = 0; x < X; ++x)
                data[y * Y + x] = 100.f + y%1024;

        write_HDF5(comm, info, data, len, rank);
        free(data);
        printf("Finished write for len=%zu ...\n", len);
    }

    MPI_Finalize();

    return 0;
}

screenshot from 2019-01-10 14-07-54

Thanks, great reproducer. In libSplash we also default to collective writing. We know that independent writes are broken in other ways, e.g. they sometimes deadlock and have bad performance if I remember correctly.

@ax3l could you have a look at my report in the HDF5 forum and reproduce https://forum.hdfgroup.org/t/hdf5bug-h5fd-mpio-collective-chunking/5279/4

It looks like that some grid sizes are also broken if I use my workaround with H5FD_MPIO_INDEPENDENT or with disabled chunking.

I need an additional pair of eyes to validate that I did not do something wrong.

I solved my issue with the small size arrays and updated the example code here and in the HDF5 issue.
I missed that 130 is not a multiple of 4 and therefore the input data was not well initialized.

I would not be surprised if the following constrain applies with HDF5:
"your per-rank size must be a multiple of the chunk-size (per dimension)." Which basically dictates from HDF5 how you are allowed to domain-decompose, if you want to write more than 4 GByte per rank and dataset... Let's see.

Suggestions to improve the report, it currently reads confusing:

You are mixing the terminology threads and mpi ranks in your HDF5 forum post.
The term cell is not known to other people, please write "offset of one" or "by one element".

Please remove all wrong text and wrong images, do not cross them out. It will confuse readers that will read this post next weak earliest anyway. You don't want them to reproduce usage errors but the actual bug demonstrator only.

Please report if this is seen for the HDF5 1.10 releases as well, they will not fix 1.8 with high priority.

The issue is very likely OpenMPI specific, as I cannot reproduce it with MPICH 3.3 https://github.com/open-mpi/ompi/issues/6285

New temporary work-around surfaced: OpenMPI also ships a ported ROMIO implementation for IO. We can switch the OpenMPI I/O backend away from its defaults to ROMIO (same as in MPICH) via:

mpirun --mca io ^ompio ...

This issue is work-arounded in #2857 and reported upstream to OpenMPI. (Affects releases 2.0-4.0.)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cbontoiu picture cbontoiu  路  3Comments

saipavankalyan picture saipavankalyan  路  3Comments

psychocoderHPC picture psychocoderHPC  路  4Comments

ax3l picture ax3l  路  3Comments

ax3l picture ax3l  路  4Comments