I am seeing density-"kinks"/dips at multiples of about 1024 cells in one of my simulations:

Looks like either an initialization or HDF5 output bug to me, since I can later on see the same dip not only in density but also in electric fields and they stay at the same y over time. Or something with our field strides...
The same output dip is seen for multiples of this position at later times when particles move into the simulation box.
Cells in y: 14720
GPUs in y: 4
GPUs in x: 4
y-cells per GPU: 3680 (<- not at a GPU boundary)
Simulation: 2D
Supercell: 16x16
Particles per cell: 20
Init in cell: random
Shape: PCS
PIConGPU 0.4.1 on Hypnos (K80) and Hemera (P100).
Files: (per mail)
Zooming further in onto the individual cell with linear y axis, it looks like it is really just one cell at N*1024 and does not go to zero but to some lower value:

Could you please provide the function you used to create the density.
Is the shown density plot the initial density at time step zero?
Could you please also provide the version of PIConGPU and commit hash if it is the dev branch.
It's PIConGPU 0.4.1 and I sent you a mail with the details :)
Yep, this is the t=0 plot for density (grey is what it should be, red is what I see in the fields).
But since this shows up weirdly again later on in time steps where the target expanded, I suspect it is in the field dumping routines (E and density!) somehow...
We can also see it in the transversal direction directly in the HDF5 viewer (zoom in to see the one cell large holes)

Perfect catch, we saw that indeed the whole front is moved by one cell as well, occurring in sections at transversal GPU boundaries (of which I have four). Since I averaged transversely the images above, my artifact was not perfectly zero as one can see now.
Could be a host-to-device preparation or a splash dump issue.
I have hacked the EFiled calculation and set Border to 100 (red), core to 50 (green) and guard to 1 (gray).
The image show that there are many wrong feature in the composed image. We have the black lines (where the value is zero) and many other issue.

Is this a PNG plugin output or also HDF5?
This output is HDF5. I also tried ADIOS and cannot find such artifacts.
So it's hopefully just a libSplash bug since deviceToHost functionality for fields is identical in our HDF5 & ADIOS plugins.
Critical that this will affect restarts when ADIOS is not compiled in (which is used for checkpoints by default, if found).
The reason for this bug is https://github.com/ComputationalRadiationPhysics/libSplash/blob/140d6ce903d265909c3f8021d5d9cc04e5a72d37/src/DCDataSet.cpp#L220. Somehow the chunking is creating these artefacts.
If I disable chunking all works fine.
Ouch. What do we do about this? Looks like a HDF5 bug to me since I can't see a limitation that we violate:
https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/
besides that our chunks are a little small for parallel I/O:
https://github.com/ComputationalRadiationPhysics/libSplash/blob/v1.7.0/src/include/splash/core/DCHelper.hpp#L87-L101
https://support.hdfgroup.org/HDF5/faq/parallel.html
We can of course just disable the whole feature in libSplash, reducing some functionality such as serial compression and extensible data sets (off by default).
I read also all docu I found and cannot see find something we did wrong. I am currently not sure if we should disable chunking. Maybe we should first check other hdf5 versions and create a mini app.
suggested from @ax3l to create an example based on this code snipped
I opened an request in the HDf5 forum: https://forum.hdfgroup.org/t/hdf5bug-h5fd-mpio-collective-chunking/5279
mpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib && mpiexec -n 16 ./a.outmpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib -DFIX && mpiexec -n 16 ./a.outThe problem is H5FD_MPIO_COLLECTIVE combined with chunking. The fix uses H5FD_MPIO_INDEPENDENT. Another fix is to disable chunking.
#include <mpi.h>
#include <hdf5.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#define X 1872llu
#define Y 1872llu
int write_HDF5(
MPI_Comm const comm, MPI_Info const info,
float* data, size_t len, int rank)
{
// property list
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
// MPI-I/O driver
H5Pset_fapl_mpio(plist_id, comm, info);
// file create
char file_name[100];
sprintf(file_name, "%zu", len);
strcat(file_name, ".h5");
hid_t file_id = H5Fcreate(file_name, H5F_ACC_TRUNC,
H5P_DEFAULT, plist_id);
// dataspace
hsize_t dims[3] = {Y, X};
hsize_t globalDims[3] = {Y * 4, X * 4};
hsize_t max_dims[2] = {Y * 4, X * 4};
hsize_t offset[2] = {rank/4 * Y, rank%4 * X};
hid_t srcSize = H5Screate_simple(2, dims, NULL);
hid_t filespace = H5Screate_simple(2,
globalDims,
max_dims);
printf("%i: %llu,%llu %llu,%llu \n", rank,offset[0],offset[1],globalDims[0],globalDims[1]);
// chunking
hsize_t chunk[2] = {128, 128};
hid_t datasetCreationProperty = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_chunk(datasetCreationProperty, 2, chunk);
// dataset
hid_t dset_id = H5Dcreate(file_id, "dataset1", H5T_NATIVE_FLOAT,
filespace, H5P_DEFAULT,
datasetCreationProperty, H5P_DEFAULT);
// write
hid_t dset_plist_id = H5Pcreate(H5P_DATASET_XFER);
#ifdef FIX
H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_INDEPENDENT); // default
#else
H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_COLLECTIVE);
#endif
hid_t dd = H5Dget_space(dset_id);
H5Sselect_hyperslab(dd, H5S_SELECT_SET, offset,
NULL, dims, NULL);
herr_t status;
status = H5Dwrite(dset_id, H5T_NATIVE_FLOAT,
srcSize, dd, dset_plist_id, data);
// close all
status = H5Pclose(plist_id);
status = H5Pclose(dset_plist_id);
status = H5Dclose(dset_id);
status = H5Fclose(file_id);
return 0;
}
int main(int argc, char* argv[])
{
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Info info = MPI_INFO_NULL;
MPI_Init(&argc, &argv);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
size_t lengths[1] = {X*Y};
for( size_t i = 0; i < 1; ++i )
{
size_t len = lengths[i];
printf("Writing for len=%zu ...\n", len);
float* data = (float*)malloc(len * sizeof(float));
for( size_t y = 0; y < Y; ++y)
for( size_t x = 0; x < X; ++x)
data[y * Y + x] = 100.f + y%1024;
write_HDF5(comm, info, data, len, rank);
free(data);
printf("Finished write for len=%zu ...\n", len);
}
MPI_Finalize();
return 0;
}

Thanks, great reproducer. In libSplash we also default to collective writing. We know that independent writes are broken in other ways, e.g. they sometimes deadlock and have bad performance if I remember correctly.
@ax3l could you have a look at my report in the HDF5 forum and reproduce https://forum.hdfgroup.org/t/hdf5bug-h5fd-mpio-collective-chunking/5279/4
It looks like that some grid sizes are also broken if I use my workaround with H5FD_MPIO_INDEPENDENT or with disabled chunking.
I need an additional pair of eyes to validate that I did not do something wrong.
I solved my issue with the small size arrays and updated the example code here and in the HDF5 issue.
I missed that 130 is not a multiple of 4 and therefore the input data was not well initialized.
I would not be surprised if the following constrain applies with HDF5:
"your per-rank size must be a multiple of the chunk-size (per dimension)." Which basically dictates from HDF5 how you are allowed to domain-decompose, if you want to write more than 4 GByte per rank and dataset... Let's see.
Suggestions to improve the report, it currently reads confusing:
You are mixing the terminology threads and mpi ranks in your HDF5 forum post.
The term cell is not known to other people, please write "offset of one" or "by one element".
Please remove all wrong text and wrong images, do not cross them out. It will confuse readers that will read this post next weak earliest anyway. You don't want them to reproduce usage errors but the actual bug demonstrator only.
Please report if this is seen for the HDF5 1.10 releases as well, they will not fix 1.8 with high priority.
The issue is very likely OpenMPI specific, as I cannot reproduce it with MPICH 3.3 https://github.com/open-mpi/ompi/issues/6285
New temporary work-around surfaced: OpenMPI also ships a ported ROMIO implementation for IO. We can switch the OpenMPI I/O backend away from its defaults to ROMIO (same as in MPICH) via:
mpirun --mca io ^ompio ...
This issue is work-arounded in #2857 and reported upstream to OpenMPI. (Affects releases 2.0-4.0.)
Most helpful comment
suggested from @ax3l to create an example based on this code snipped
I opened an request in the HDf5 forum: https://forum.hdfgroup.org/t/hdf5bug-h5fd-mpio-collective-chunking/5279
compile:
mpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib && mpiexec -n 16 ./a.outmpicc -g main.cpp -lhdf5 -L$HDF5_ROOT/lib -DFIX && mpiexec -n 16 ./a.outThe problem is
H5FD_MPIO_COLLECTIVEcombined with chunking. The fix usesH5FD_MPIO_INDEPENDENT. Another fix is to disable chunking.