Picongpu: Warning on hemera V100 libibverbs

Created on 23 Dec 2020  路  7Comments  路  Source: ComputationalRadiationPhysics/picongpu

When running PIConGPU on the V100 at hemera, I get the following warning:

libibverbs: Warning: couldn't load driver 'librxe-rdmav25.so': librxe-rdmav25.so: cannot open shared object file: No such file or directory

@psychocoderHPC and @sbastrakov do you k ow where this is coming from and whether this can be ignored.
(I encounter this warning before and it did not cause a crash, but now I PIConGPU crashed without any specific error.)

machinsystem question

All 7 comments

I am not familiar with this library. From googling it seems the library is for interconnect things. So I think it makes sense to create an issue so that cluster admins could have a look.

Okay - thanks @sbastrakov - I will open a ticket.

Most likely, the crash my simulation encountered was caused by a node failure. There seems to be no notification in stdout / stderr for that on hemera.

The library issue will be investigated next year.

looks like an Infiniband library. Most likely that the modules forget to set LD_LIBRARY_PATH or were compiled on a node that had a different image than the default node image.

@psychocoderHPC As suggested by Jens Lasch, this warning might originate from an ibv_devinfo call. From the order of output, my guess is that this warning is triggered by our cuda_memtest.sh, which performs an mpiInfo call. Does this executable internally call ibv_devinfo?

@psychocoderHPC As suggested by Jens Lasch, this warning might originate from an ibv_devinfo call. From the order of output, my guess is that this warning is triggered by our cuda_memtest.sh, which performs an mpiInfo call. Does this executable internally call ibv_devinfo?

No we never call ibv_devinfo by our self. IMO this is coming from MPI itself and there is a very high possibility that some RDMA feature not work correctly because of that.

The cluster admins resolved the issue by a (driver) update.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bussmann picture bussmann  路  4Comments

saipavankalyan picture saipavankalyan  路  3Comments

berceanu picture berceanu  路  3Comments

hightower8083 picture hightower8083  路  4Comments

steindev picture steindev  路  4Comments