Running PIConGPU on 8 GPUs with the example LWFA setup and increased resolution (see 8.cfg ), results in the error (full output ):
terminate called after throwing an instance of 'splash::DCException'
what(): Exception for DCDataSet [x] write: Failed to write dataset
when HDF5 raw data output is activated. The simulation finishes successfully with HDF5 disabled.
Looks similar to the crashes that @BeyondEspresso was reporting in https://github.com/ComputationalRadiationPhysics/libSplash/issues/160.
Thx for the report! Is that the first error that is thrown?
We had some interesting "can not open file" errors last week on our local filesystem. But maybe it's something in Splash or the way we use it...
Can you reproducible trigger this to dig deeper? Can you try to increase libSplash verbosity as described in https://github.com/ComputationalRadiationPhysics/libSplash/issues/160 ? On what system are you running? We are talking libSplash 1.7.0 (yes from output)?
Can you post your mpiexec.tpl and is your OpenMPI compiled CUDA aware?
After following the spack install, I do:
export SPLASH_VERBOSE=99
pic-create $PIC_EXAMPLES/LaserWakefield $HOME/picInputs/h5LWFA
pic-build -b "cuda:60"
then increase resolution and activate hdf5 output in 8.cfg (see 8.cfg.hdf5.patch ) and submit with
tbg -s bash -c etc/picongpu/8.cfg -t etc/picongpu/bash/mpiexec.tpl $SCRATCH/runs/lwfa_wh5
Here is the full PIConGPU output with verbose libSplash.
The simulation _always_ crashes in the middle (17/43G) of writing $SCRATCH/runs/lwfa_wh5/simOutput/h5/simData_600.h5. The previous .h5 files are succesfully written, but this one is only about half the size.
Turning off hdf5 output in 8.cfg makes the simulation complete successfully.
The system is a supermicro SYS-4028GR-TRT2 with dual Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz and 512 GB DDR4 RAM. It has 8 Tesla P100-PCIE@16GB (see nvidia-smi output) and is running Ubuntu 16.04.5 LTS. Full specs here. The system has no batch system installed.
List of installed spack packages and mpiexec.tpl. I have [email protected] and [email protected].
Regarding CUDA awareness, here is the OpenMPI info. It is _not_ CUDA aware, since it was configured as --without-cuda.
cc @psychocoderHPC maybe we have still a race condition or mem violation in our HDF5 plugin? Could also be in the MPI (or MPI-IO) stack itself, but looks interesting...
@berceanu if you recklessly switch hdf5 output with adios output, does that work? (we share little, recently added code to make both multi-plugins).
What's the HDF5 version you are using? (spack spec picongpu)
I installed ADIOS 1.13.1 and activated it in 8.cfg (see 8.cfg.adios.patch here). The simulation completes successfully in 1h, producing 1.8T of output :)
I tried both hdf5 1.10.3 and 1.10.4 and they crash with the above splash::DCException, at the same point in the simulation.
cc @psychocoderHPC when you are back at work today, can you take a look at this as well? Could this be something on our side or the MPI level? This is a single-node machine.
@berceanu Thanks for the report with all the useful information. Is it possible that you compile HDF5 with debug symbols compiler option: -g and reproduce the crash. It is only possible to find the source of the issue if we know why it crash within HDF5.
@ax3l I looked into the issue but do not see anything which can be the reason for the problem.
I think debug symbols won't help here for HDF5, although good to have them still.
But in order to dig deeper, we should output inside libSplash why the Exception is thrown.
I think it is this one:
https://github.com/ComputationalRadiationPhysics/libSplash/blob/v1.7.0/src/DCDataSet.cpp#L585-L586
We could first take the return code or internal HDF5 error stack of this H5Dwrite and put it into the HDF5 error interface (Example 12):
if (H5Dwrite(dataset, this->datatype, dsp_src, dataspace, dsetWriteProperties, data) < 0)
{
H5Eprint(H5E_DEFAULT, stderr);
throw DCException(getExceptionString("write: Failed to write dataset"));
}
With this patch, we might get the underlying HDF5 error printed.
@berceanu Thanks for the report with all the useful information. Is it possible that you compile HDF5 with debug symbols
compiler option: -gand reproduce the crash. It is only possible to find the source of the issue if we know why it crash within HDF5.
Is it possible to add this compiler option via Spack and if so, how?
@ax3l I looked into the issue but do not see anything which can be the reason for the problem.
iirc, you guys have access to P100 nodes on Hemera, would it be useful to reproduce the issue there?
Is it possible to add this compiler option via Spack and if so, how?
yes, just do spack install picongpu cppflags="-g".
access to P100
That would help, but I can prepare you a little something to install the patched libSplash in spack. Give me a minute.
In your $SPACK_ROOT edit var/spack/repos/builtin/packages/libsplash/package.py and add the line
version('printHDF5ErrorStackOnWrite', branch='topic-printHDF5ErrorStackOnWrite', git='https://github.com/ax3l/libSplash.git')
In your src/spack-repo/packages/picongpu/package.py, loosen the constrain on libSplash:
depends_on('[email protected],printHDF5ErrorStackOnWrite', when='+hdf5')
Then just do
# add what you usually put here, compilers, etc. ---v
spack install picongpu ^libsplash@printHDF5ErrorStackOnWrite
To load,
# add what you usually put here, compilers, etc. ---v
spack load picongpu ^libsplash@printHDF5ErrorStackOnWrite
Do I need to uninstall the current version of PIConGPU or will this just install another one alongside it, and I can select which one I want to load each time?
This will install another PIConGPU alongside. You can always spack uninstall <spec> when it gets to convoluted with parallel installations later on.
(As a side note, it will reuse most of the dependencies in this case under the hood.)
@berceanu did you get more HDF5 output that way? :)
@ax3l Should I also set SPLASH_VERBOSE=99?
Yes, keep it all on so we have the full info.
OK, so I did
spack install picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected]
and then
spack load picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected]
export SPLASH_VERBOSE=99
followed by the usual pic-create, pic-build and tbg.
The crash happens in the same place. Here is the full output with verbose libSplash from topic-printHDF5ErrorStackOnWrite branch.
The interesting part is here:
HDF5-DIAG: Error detected in HDF5 (1.10.4) MPI-process 1:
#000: H5Dio.c line 336 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 828 in H5D__write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dmpio.c line 893 in H5D__chunk_collective_write(): write error
major: Dataspace
minor: Write failed
#003: H5Dmpio.c line 816 in H5D__chunk_collective_io(): couldn't finish linked chunk MPI-IO
major: Low-level I/O
minor: Can't get value
#004: H5Dmpio.c line 1167 in H5D__link_chunk_collective_io(): couldn't finish MPI-IO
major: Low-level I/O
minor: Can't get value
#005: H5Dmpio.c line 2057 in H5D__final_collective_io(): optimized write failed
major: Dataset
minor: Write failed
#006: H5Dmpio.c line 426 in H5D__mpio_select_write(): can't finish collective parallel write
major: Low-level I/O
minor: Write failed
#007: H5Fio.c line 165 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#008: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#009: H5Faccum.c line 826 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#010: H5FDint.c line 258 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#011: H5FDmpio.c line 1844 in H5FD_mpio_write(): file write failed
major: Low-level I/O
minor: Write failed
terminate called after throwing an instance of 'splash::DCException'
what(): Exception for DCDataSet [x] write: Failed to write dataset
Thanks for the details. I am currently looking into the issue.
@berceanu can you please post the output of
spack spec picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected]
so we know what dependencies, e.g. for MPI were taken?
@berceanu which filesystem are you writing to? Local or network? What's the FS type, EXT4?
As a side-test, can you switch to MPICH instead of the spack-default OpenMPI and verify the bug is seen there as well?
spack install picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected] ^mpich
spack load picongpu backend=cuda cppflags="-g" ^libsplash@printHDF5ErrorStackOnWrite %[email protected] ^mpich
export SPLASH_VERBOSE=99
Could still be a memory violation on our side and we are getting lucky with another impl., but maybe it's also just a HDF5 or MPI-IO bug.
@berceanu Could you please post the PIConGPU output with the splash verbose in formation from the last crashed run where you already posted the call stack.
output is already in https://github.com/ComputationalRadiationPhysics/picongpu/issues/2777#issuecomment-436685063
so we know what dependencies, e.g. for MPI were taken?
here are the dependencies
@berceanu which filesystem are you writing to? Local or network? What's the FS type, EXT4?
local filesystem /dev/sdc1, partition type ext4
As a side-test, can you switch to
MPICHinstead of the spack-defaultOpenMPIand verify the bug is seen there as well?
==> Installing mpich
[...]
==> Error: ProcessError: Command exited with status 1:
'/home/andrei/src/spack/var/spack/stage/mpich-3.2.1-q7xvsy7ehr7ztxrowim7mdm7t4x4wsz6/mpich-3.2.1/configure' '--prefix=/home/andrei/src/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-7.3.0/mpich-3.2.1-q7xvsy7ehr7ztxrowim7mdm7t4x4wsz6' '--enable-shared' '--with-pm=hydra' '--with-pmi=yes' '--enable-romio' '--without-ibverbs' '--with-device=ch3:nemesis:tcp'
1 error found in build log:
422 checking for type of weak symbol alias support... pragma weak
423 checking whether __attribute__ ((weak)) allowed... yes
424 checking whether __attribute__ ((weak_import)) allowed... yes
425 checking whether __attribute__((weak,alias(...))) allowed... yes
426 checking for multiple weak symbol support... yes
427 checking for shared library (esp. rpath) characteristics of CC... done (results in src/env/cc_shlib.conf)
>> 428 configure: error: F90 and F90FLAGS are replaced by FC and FCFLAGS respectively in this configure, please unset F90/F90FLAGS and set FC/FCFLAGS instead and rerun configure again.
Hm, can't reproduce the MPICH install issue. Can you unset F90 and unset F90FLAGS before spack install ...?
Yeah, that seems to work. Now I get
==> Error: FetchError: All fetchers failed for libsplash-printHDF5ErrorStackOnWrite-wj27hgbb5j4qcajb526a26gy3hcin4n3
Guess it's because it got merged?
So what do I need to use instead of ^libsplash@printHDF5ErrorStackOnWrite ?
Upsi, I deleted that branch already... Sorry, can you just change @printHDF5ErrorStackOnWrite with @develop? (cmd line and package.py)
(P.S.: branch restored for convenience as well, I wasn't thinking.)
Running with MPICH:
Running program...
[mpiexec@ServerS] match_arg (utils/args/args.c:159): unrecognized argument am
[mpiexec@ServerS] HYDU_parse_array (utils/args/args.c:174): argument matching returned error
[mpiexec@ServerS] parse_args (ui/mpich/utils.c:1596): error parsing input array
[mpiexec@ServerS] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1648): unable to parse user arguments
[mpiexec@ServerS] main (ui/mpich/mpiexec.c:149): error parsing parameters
Here's the spec.
Ah, your .tpl file is tuned for OpenMPI I guess, just remove the -am argument maybe (there is no infiniband that needs tweaking here).
$ tbg -s bash -c etc/picongpu/8.cfg -t etc/picongpu/bash/mpiexec.tpl $SCRATCH/runs/mpich
Running program...
[proxy:0:0@ServerS] HYDU_create_process (utils/launch/launch.c:75): execvp error on file /Date2/andrei/runs/mpich/tbg/openib.conf (Permission denied)
$ ls -lsa /Date2/andrei/runs/mpich/tbg/openib.conf
4 -rw-r--r-- 1 andrei andrei 232 nov 9 18:34 /Date2/andrei/runs/mpich/tbg/openib.conf
Yes, the -am .../openib.conf option needs to be skipped including the argument :) Sorry, was too short in my description.
Oh, my bad, should've checked the mpirun manpage :)
Running program...
[mpiexec@ServerS] match_arg (utils/args/args.c:159): unrecognized argument mca
[mpiexec@ServerS] HYDU_parse_array (utils/args/args.c:174): argument matching returned error
[mpiexec@ServerS] parse_args (ui/mpich/utils.c:1596): error parsing input array
[mpiexec@ServerS] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1648): unable to parse user arguments
[mpiexec@ServerS] main (ui/mpich/mpiexec.c:149): error parsing parameters
I could remove the --mca mpi_leave_pinned 0 switch but wouldn't that affect https://github.com/ComputationalRadiationPhysics/picongpu/issues/2782#issuecomment-435053460 ?
These last few messages made me realise that, even though the docs specify OpenMPI 1.7+ / MVAPICH2 1.8+ or similar as dependencies, the file etc/picongpu/bash/mpiexec.tpl is actually hard-coded for OpenMPI.
Yes, that's really bad that they are hardcoded. Problem for devs on our side is, that MPI has a defined C-API but no defined cmdline parameters... Differs heavily.
--mca mpi_leave_pinned 0 is quite important if the build MPI is not CUDA-aware. Because in that case, the communication with the network-card does also pin and worse unpin our pinned buffers that we need for device communication. Maybe this MPICH equivalent should therefore be set: https://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2012-November/004121.html
I got more insight into the original issue. It seems that the crash depends on the value of the --hdf5.period switch.
So far I only tried --hdf5.period 100, which is the default value of the LWFA example.
For a value of 200, the simulation still crashes, but for 500 and above there is no crash.
Was the last crash with --hdf5.period 200 seen with MPICH or OpenMPI?
Oh, this is still all OpenMP, MPICH seems too complicated to make it work.
I also have trouble dumping particles like on, but on Hemera (HZDR).
@psychocoderHPC I think there is a particle preparation bug or something. My case is quite late (step 75'000) in a sim when I dump particles for the 4th time it hangs.
@berceanu it's ok not to try MPICH any further, looks like we don't have great templates for it in place yet.
@ax3l We need somehow a reproduce able parameter set else it would not be possible to debug the issue :-(
I'll try to create you a restart step close to the issue I can trigger. But this would be on Hemera and there it does not throw the beautiful error we have here but just hangs. I also see a .loc file close to it and am currently waiting for HPC response if some kind of AntiVir, etc. might be the reason on Hemera.
.loc is created by HDF5 and is visible if you crash during the write without closing the HDF5 file.
Yep, there is also nothing running as well.
Trying to create a reproducer on Hemera until tomorrow. If this does not help, maybe @berceanu can create an ssh account for us on his machine. Unless @psychocoderHPC wants to try this first, because it's a small example in @berceanu 's case (LWFA default).
From my part it's not a problem to give you guys access if you think that helps.
@psychocoderHPC is not at work tomorrow and I'll travel on Thursday. But is you can already create us a huebl and widera accounts (via mail) that would help.
Here is my SSH pub key: https://keybase.pub/ax3l/ssh_public_keys.txt
@berceanu I moved now from your system back to our hypnos. I can reproduce it on our system with the simulation sizes you provided (-g 512 2048 512 -d 2 4 1)
@psychocoderHPC Did you manage to reproduce the issue on the Hypnos cluster?
@berceanu just be warned, today we have a public holiday in Saxony :)
@berceanu Yes I can reproduce the issue on our system. I run different cases on hypnos and try to get a clue whats going wrong.
If you change your simulation size or the number of GPUs per direction you can currently workaround the bug.
Also decreasing the hdf5 output frequency is another workaround. I'm curious if this also happens with other MPI implementations..
I am now debugging this issue since 1 week and found https://support.hdfgroup.org/HDF5/faq/limits.html
What is the limit on the amount of data you can read in Parallel HDF5 ?
The limit is 2 GB because the MPI standard specifies that the 'count' parameter passed to MPI_File_read() be a 32-bit integer.
And if the maximum size for read is 2GB maybe also the max size for write per dataset is also 2GiB.
https://portal.hdfgroup.org/pages/viewpage.action?pageId=48809737
I would say we need at least HDF5-1.10.2 but since the HDF5 version on the system of @berceanu is 1.10.4 I do not know why it is crashing there.
I verified that HDF5/splash is the source of all the pain. I manipulated PIConGPU that we write with 8 GPUs each X elements (4byte per elem) so hat I wrote accumulated dataset <4GiB and >4GiB.
X byte_accumulated status
134217727 <4GiB OK
134217728 =4GiB OK
134217729 >4GiB Fail
The reason why 134217728 is working is that the size is to large but there is no access after 4GiB.
Task todo: Write I mini app to check if this behaviour is coming frmom libSPlash or pure HDF5.
https://portal.hdfgroup.org/pages/viewpage.action?pageId=48808714 is saying
What is the limit on the chunk size?
The current maximum number of bytes in a chunk is 2^32-1 (4 GB). As a result of this restriction, the number of elements in a chunk cannot be greater than 4 GB. You must also account for the datatype size of each element.
The chunk size can be larger than a dataset's dimension if the dataset's maximum dimension sizes are declared as unlimited or if the chunk size is less than the maximum dimension size. The chunk size for fixed-size datasets cannot exceed the size of a fixed-size dataset. For example, a dataset consisting of a 5 x 4 fixed-size array cannot be defined with 10 x 10 chunks, as the following error will occur:
but since each MPI rank has its own dataset this should not be the limitation.
Tested with HDF5 1.8.20 on hemera
Gosh... it's really having a too long dimension, it's not even about chunks or size:
#include <mpi.h>
#include <hdf5.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int write_HDF5(
MPI_Comm const comm, MPI_Info const info,
int* data, size_t len)
{
// property list
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
// MPI-I/O driver
H5Pset_fapl_mpio(plist_id, comm, info);
// file create
char file_name[100];
sprintf(file_name, "%zu", len);
strcat(file_name, ".h5");
hid_t file_id = H5Fcreate(file_name, H5F_ACC_TRUNC,
H5P_DEFAULT, plist_id);
// dataspace
hsize_t dims[1] = {len};
hsize_t max_dims[1] = {len};
// hsize_t* max_dims = NULL;
hid_t filespace = H5Screate_simple(1,
dims,
max_dims);
// chunking
hid_t datasetCreationProperty = H5Pcreate(H5P_DATASET_CREATE);
// dataset
hid_t dset_id = H5Dcreate(file_id, "dataset1", H5T_NATIVE_INT,
filespace, H5P_DEFAULT,
datasetCreationProperty, H5P_DEFAULT);
// write
hid_t dset_plist_id = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_COLLECTIVE);
// H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_INDEPENDENT); // default
herr_t status;
status = H5Dwrite(dset_id, H5T_NATIVE_INT,
H5S_ALL, filespace, dset_plist_id, data);
// close all
status = H5Pclose(plist_id);
status = H5Pclose(dset_plist_id);
status = H5Dclose(dset_id);
status = H5Fclose(file_id);
return 0;
}
int main(int argc, char* argv[])
{
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Info info = MPI_INFO_NULL;
MPI_Init(&argc, &argv);
size_t lengths[3] = {134217727u, 134217728u, 134217729u};
for( size_t i = 0; i < 3; ++i )
{
size_t len = lengths[i];
printf("Writing for len=%zu ...\n", len);
int* data = malloc(len * sizeof(int));
for( size_t k=0; k<len; ++k)
data[k] = 420;
write_HDF5(comm, info, data, len);
free(data);
printf("Finished write for len=%zu ...\n", len);
}
MPI_Finalize();
return 0;
}
$ h5pcc phdf5.c && ./a.out
Writing for len=134217727 ...
Finished write for len=134217727 ...
Writing for len=134217728 ...
Finished write for len=134217728 ...
Writing for len=134217729 ...
HDF5-DIAG: Error detected in HDF5 (1.10.4) MPI-process 0:
#000: H5Dio.c line 336 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: H5Dio.c line 828 in H5D__write(): can't write data
major: Dataset
minor: Write failed
#002: H5Dmpio.c line 671 in H5D__contig_collective_write(): couldn't finish shared collective MPI-IO
major: Low-level I/O
minor: Write failed
#003: H5Dmpio.c line 2013 in H5D__inter_collective_io(): couldn't finish collective MPI-IO
major: Low-level I/O
minor: Can't get value
#004: H5Dmpio.c line 2057 in H5D__final_collective_io(): optimized write failed
major: Dataset
minor: Write failed
#005: H5Dmpio.c line 426 in H5D__mpio_select_write(): can't finish collective parallel write
major: Low-level I/O
minor: Write failed
#006: H5Fio.c line 165 in H5F_block_write(): write through page buffer failed
major: Low-level I/O
minor: Write failed
#007: H5PB.c line 1028 in H5PB_write(): write through metadata accumulator failed
major: Page Buffering
minor: Write failed
#008: H5Faccum.c line 826 in H5F__accum_write(): file write failed
major: Low-level I/O
minor: Write failed
#009: H5FDint.c line 258 in H5FD_write(): driver write request failed
major: Virtual File Layer
minor: Write failed
#010: H5FDmpio.c line 1844 in H5FD_mpio_write(): file write failed
major: Low-level I/O
minor: Write failed
Finished write for len=134217729 ...
$ du -hs 13421772*
513M 134217727.h5
513M 134217728.h5
4,0K 134217729.h5
Fun fact we found a dead range [134217727u + 2; 134217727u + 9] is triggering the error.
Just for your interest, this is how I valgrind-ed PIConGPU over the weekend with this script:
only helps if the memory violation is runtime triggered and not already present in case 1.:
valgrind --leak-check=full --show-reachable=yes --error-limit=no --gen-suppressions=all --log-file=picongpu_all.log \
./bin/picongpu -d 1 1 1 -g 12 80 12 --periodic 1 1 1 -s 10
cat picongpu_all.log | ./parse_valgrind_suppressions.sh > picongpu_all.supp
valgrind --leak-check=full --show-reachable=yes --error-limit=no --suppressions=./picongpu_all.supp --log-file=picongpu_hdf5.log \
./bin/picongpu -d 1 1 1 -g 12 80 12 --periodic 1 1 1 -s 10 --hdf5.period 10 --hdf5.file simData
For the example above for only length 134217736u I get from valgrind:
valgrind --leak-check=full --show-reachable=yes --error-limit=no --gen-suppressions=all --log-file=ph5_all.log ./a.out
==28325== Warning: set address range perms: large range [0x96bd040, 0x296bd03c) (undefined)
==28325== Syscall param pwritev(vector[...]) points to uninitialised byte(s)
==28325== at 0x5CCEF53: ??? (syscall-template.S:84)
==28325== by 0x5656FD9: mca_fbtl_posix_pwritev (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325== by 0x5612873: mca_common_ompio_file_write (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325== by 0x561232D: mca_common_ompio_file_write_at_all (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325== by 0x56A2B25: mca_io_ompio_file_write_at_all (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325== by 0x55ED9C7: PMPI_File_write_at_all (in /home/axel/src/spack/opt/spack/linux-debian9-x86_64/gcc-6.3.0/openmpi-3.1.3-eiu5hplnnginmk4kezfqtxqwmy4yrph5/lib/libmpi.so.40.10.3)
==28325== by 0x36A983: H5FD_mpio_write (in /home/axel/a.out)
==28325== by 0x1B7C94: H5FD_write (in /home/axel/a.out)
==28325== by 0x388873: H5F__accum_write (in /home/axel/a.out)
==28325== by 0x265294: H5PB_write (in /home/axel/a.out)
==28325== by 0x1A8F2A: H5F_block_write (in /home/axel/a.out)
==28325== by 0x36847C: H5D__mpio_select_write (in /home/axel/a.out)
==28325== Address 0x116bd03f is 134,217,727 bytes inside a block of size 536,870,908 alloc'd
==28325== at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
==28325== by 0x114345: main (in /home/axel/a.out)
...
(Cannot stable reproduce this diagnostics when trying further...)
My little MPI-IO file works xD
$ mpicc pmpio.c && ./a.out && du -hs testfile
Writing for len=134217736 ...
Finished write for len=134217736 ...
Writing for len=134217728 ...
Finished write for len=134217728 ...
Writing for len=134217729 ...
Finished write for len=134217729 ...
Writing for len=134217736 ...
Finished write for len=134217736 ...
Writing for len=134217737 ...
Finished write for len=134217737 ...
513M testfile
#include <mpi.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int write_MPIO(
MPI_Comm const comm, MPI_Info const info,
int* data, size_t len)
{
int myrank = 0;
MPI_File thefile;
MPI_File_open(MPI_COMM_WORLD, "testfile",
MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &thefile);
MPI_File_set_view(thefile, myrank * len * sizeof(int),
MPI_INT, MPI_INT, "native",
MPI_INFO_NULL);
MPI_File_write(thefile, data, len, MPI_INT,
MPI_STATUS_IGNORE);
MPI_File_close(&thefile);
return 0;
}
int main(int argc, char* argv[])
{
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Info info = MPI_INFO_NULL;
MPI_Init(&argc, &argv);
size_t lengths[5] = {134217736u, 134217728u, 134217729u, 134217736u, 134217737u};
for( size_t i = 0; i < 5; ++i )
{
size_t len = lengths[i];
printf("Writing for len=%zu ...\n", len);
int* data = malloc(len * sizeof(int));
for( size_t k=0; k<len; ++k)
data[k] = 420;
write_MPIO(comm, info, data, len);
free(data);
printf("Finished write for len=%zu ...\n", len);
}
MPI_Finalize();
return 0;
}
You guys are heroes! :)
We still have absolutely no idea what's going on there xD Ahhh.
Yea but you've put so much work into this!
The error occurs exactly at the boundary of H5S_MAX_MPI_COUNT as we just debugged: (H5Smpio.c in function H5S_mpio_all_type)
Note: The magic limit H5S_MAX_MPI_COUNT = 536870911 is coming from the fact that you have int's in MPI and is normally for 4byte types and not for char like in HDF5. The correct limit for char is (2^32-1) == 2147483647 if we divide this by the size of e.g. int it is floor(536870911.75) == 536870911 our magic border.
This only clears where the magic number is coming from but not why we have the issue because it should be save to use any number smaller than 2^32-1 if we do byte writes with MPI file io.
True, forgot to write.
One can now of course move the number, but I guess there is a logic bug hidden in the splitting new type creation somewhere that could re-appear for other user-defined types later on in different situations.
I am trying to get this further into HDF5 issue escalation, but they allow themselves 2 weeks for even answering "community" reports...
I am looking so much forward to switching to ADIOS2 only.
@ax3l I played a little bit around and think the bug is later on in the hdf5 chain. I changed the H5S_MAX_MPI_COUNT but it still crashes at e.g. 134217727u +5u elements and each multiple.
IMO the type creation is fine. I think it has something todo with the collective buffered writes or the preparation.
Could then also be in the MPI-I/O (ROMIO) layer if we are unlucky.
I extended your MPI code to reproduce the issue. https://gist.github.com/psychocoderHPC/dfcc0ebe7c2547d1e0e40a14cbaf6072
The current problem in the extended code snipped is that MPI_Get_elements_x is not reporting the correct number of written bytes back but if you check the file size you can see that only 17bytes are written. NOTE you must remove the file each time, else the file size is wrong.
Can you please post the output here that you expect vs the one you get? (If you add .c/.cpp to your gists file name syntax highlighting will be enabled)
you must remove the file each time
open with truncation to avoid this?
The output is:
rm testfile; ./a.out ; ls -la testfile
Writing for len=134217732 ...
left over: 17
byte 140737488325576 536870928
Finished write for len=134217732 ...
-rw-r--r-- 1 widera fwt 16 Nov 30 16:49 testfile
140737488325576 is the number of written bytes. It is currently proken and IMO an issue on my side.
In the last line 16 is the size of the file.
Here is the output if I use 134217727u + 10u:
rm testfile; ./a.out ; ls -la testfile
Writing for len=134217737 ...
left over: 37
byte 140737488325576 536870948
Finished write for len=134217737 ...
-rw-r--r-- 1 widera fwt 536870948 Nov 30 16:52 testfile
They added the bug now as HDFFV-10638.
The issue is very likely OpenMPI specific, as I cannot reproduce it when using MPICH 3.3 https://github.com/open-mpi/ompi/issues/6285
That's good to know, I will try installing MPICH. So, what is the failsafe way of running the code now? Is it MPICH + adios, or is MPICH + hdf5 good enough?
Yes, both should work! You might need to slightly adjust your .tpl for mpirun/exec options. Those are, unfortunately, not standardized in the MPI standard.
Oh yes, I remember running into that issue as I was trying to test the setup with MPICH. Do you have an example working .tpl?
New temporary work-around surfaced: OpenMPI also ships a ported ROMIO implementation for IO. You can switch the OpenMPI I/O backend away from its defaults to ROMIO (same as in MPICH) via:
mpirun --mca io ^ompio ...
Who would have know the rabbit hole runs so deep :)
I just saw the PR.
Most helpful comment
Who would have know the rabbit hole runs so deep :)
I just saw the PR.