When running PIConGPU on the V100 at hemera, I get the following warning:
libibverbs: Warning: couldn't load driver 'librxe-rdmav25.so': librxe-rdmav25.so: cannot open shared object file: No such file or directory
@psychocoderHPC and @sbastrakov do you k ow where this is coming from and whether this can be ignored.
(I encounter this warning before and it did not cause a crash, but now I PIConGPU crashed without any specific error.)
I am not familiar with this library. From googling it seems the library is for interconnect things. So I think it makes sense to create an issue so that cluster admins could have a look.
Okay - thanks @sbastrakov - I will open a ticket.
Most likely, the crash my simulation encountered was caused by a node failure. There seems to be no notification in stdout / stderr for that on hemera.
The library issue will be investigated next year.
looks like an Infiniband library. Most likely that the modules forget to set LD_LIBRARY_PATH or were compiled on a node that had a different image than the default node image.
@psychocoderHPC As suggested by Jens Lasch, this warning might originate from an ibv_devinfo call. From the order of output, my guess is that this warning is triggered by our cuda_memtest.sh, which performs an mpiInfo call. Does this executable internally call ibv_devinfo?
@psychocoderHPC As suggested by Jens Lasch, this warning might originate from an
ibv_devinfocall. From the order of output, my guess is that this warning is triggered by ourcuda_memtest.sh, which performs anmpiInfocall. Does this executable internally callibv_devinfo?
No we never call ibv_devinfo by our self. IMO this is coming from MPI itself and there is a very high possibility that some RDMA feature not work correctly because of that.
The cluster admins resolved the issue by a (driver) update.