Flux-core: broker: segfault in broker_pmi_barrier

Created on 13 Nov 2020  路  14Comments  路  Source: flux-framework/flux-core

I am working on a Flux extension to Parsl and as part of that extension I am trying to get a single-node Flux instance to execute a few trivial applications to get started. However, I am occasionally seeing a segfault in the broker. The stack trace (as far as I can see source code for, anyway) is main -> boot_pmi -> broker_pmi_barrier.

I'm executing /usr/global/tools/flux/toss_3_x86_64_ib/flux-c0.18.0-s0.10.0/bin/flux start flux mini run --ntasks 1 -l bash script.sh. The segfault doesn't seem to occur when I prefix this with the usual srun -n1 --mpibind=off command. Is this a known issue? Usually I don't have any problems starting single-node Flux instances without srun.

I'm happy to share my core dumps if it would help---just let me know.

All 14 comments

Does executing /usr/global/tools/flux/toss_3_x86_64_ib/flux-c0.18.0-s0.10.0/bin/flux start alone also crash?

It might be useful to try prefixing your flux start with FLUX_PMI_DEBUG=1. With no arguments, and not running in a PMI environment, the broker PMI client should use its "singleton" config, e.g.

$ FLUX_PMI_DEBUG=1 ./flux start
flux-broker: pmi-debug-dlopen: skipping /home/garlick/proj/flux-core/src/common/.libs/libpmi.so
flux-broker: pmi-debug-dlopen: skipping /home/garlick/proj/flux-core/src/common/.libs/libpmi.so
flux-broker: pmi-debug-dlopen: skipping /home/garlick/proj/flux-core/src/common/.libs/libpmi.so
pmi-debug-singleton[-1]: init = operation completed successfully
pmi-debug-singleton[0]: get_params (rank=0 size=1 kvsname=singleton) = operation completed successfully
pmi-debug-singleton[0]: kvs_get (kvsname=singleton key=flux.instance-level value=<none>) = operation failed
pmi-debug-singleton[0]: barrier = operation completed successfully
pmi-debug-singleton[0]: finalize = operation completed successfully

If its finding some PMI environment and trying to use it, it would be interesting to get the details that lead to the segfault from the PMI debug output.

One thing you could try if it's _not_ using the singleton environment is just to add --size=1 to your flux start command. If the size is specified, then flux startprovides the PMI environment, e.g.

FLUX_PMI_DEBUG=1 ./flux start --size=1
pmi-debug-wire.1[-1]: init = operation completed successfully
pmi-debug-wire.1[0]: get_params (rank=0 size=1 kvsname=-) = operation completed successfully
pmi-debug-wire.1[0]: kvs_get (kvsname=- key=flux.instance-level value=<none>) = invalid key argument
pmi-debug-wire.1[0]: barrier = operation completed successfully
pmi-debug-wire.1[0]: finalize = operation completed successfully

Thanks, both of you. I don't seem to be able to reproduce the problem very easily---it always happens in my strange Parsl setup, but when I try to modify trivial things, the segfault doesn't happen. E.g. it seems to depend on whether I execute the script containing the flux start... from the login node (no segfault), or via salloc from a login node (segfault), or via salloc from a compute node (no segfault). But anyway @dongahn, yes, I have confirmed that the segfault does still occur with just the flux start and not the rest.

Here is what I got when I added FLUX_PMI_DEBUG=1:

(37T) [corbett8@quartz1538:test3]$ salloc -N1 -ppdebug runinfo/000/submit_scripts/parsl.slurm.1605308411.9603891.submit
salloc: Pending job allocation 6047818
salloc: job 6047818 queued and waiting for resources
salloc: job 6047818 has been allocated resources
salloc: Granted job allocation 6047818
flux-broker: dlopen /lib64/libpmi.so
flux-broker: using dlopen
/collab/usr/global/tools/flux/toss_3_x86_64_ib/flux-c0.18.0-s0.10.0/libexec/flux/cmd/flux-broker: /usr/WS2/corbett8/test3/runinfo/000/submit_scripts/parsl.slurm.1605308411.9603891.submit: line 28:  6760 Segmentation fault      (core dumped) FLUX_PMI_DEBUG=1 /usr/global/tools/flux/toss_3_x86_64_ib/flux-c0.18.0-s0.10.0/bin/flux start
salloc: Relinquishing job allocation 6047818
salloc: Job allocation 6047818 has been revoked.

Adding --size=1 seems to prevent the segfault.

flux-broker: dlopen /lib64/libpmi.so

This is interesting. This shouldn't be the PMI library to use, should it @garlick? Is this path somehow LD_PRELOADed or PMI_LIBRARY environment variable is somehow used?

My PMI_LIBRARY environment variable is not set... how would I verify that I'm not using LD_PRELOAD? If it's an environment variable, it isn't set either.

No, that actually looks like what I would expect. Not finding a PMI-1 server from flux start or the flux shell, the flux broker tries to bootstrap using slurm's libpmi.so. This normally works outside of a slurm job, and under srun [options] flux start. I would have thought that under salloc it would have worked like it does outside of a slurm job, but maybe it fails in a way that makes us segfault, or maybe it segfaults.

I'll see if I can reproduce this and fix. @jameshcorbett - I'm assuming you can add --size=1 in your script and move on?

Thanks for the detailed report!

I would have thought that under salloc it would have worked like it does outside of a slurm job, but maybe it fails in a way that makes us segfault, or maybe it segfaults.

I thought it used to work for me... but then it has not been the regular mode of operation. My guess would be some environment variable collision or similar.

I was able to reproduce on fluke and the segfault is indeed down in Slurm's PMI library:

(gdb) where
#0  strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
#1  0x00002aaaac43b691 in __find_specmb (
    format=0x4 <Address 0x4 out of bounds>) at printf-parse.h:109
#2  _IO_vfprintf_internal (s=0x7fffffff9de0, 
    format=0x4 <Address 0x4 out of bounds>, ap=0x7fffffffc5a8)
    at vfprintf.c:1308
#3  0x00002aaaac440e5b in buffered_vfprintf (
    s=s@entry=0x2aaaac7bb1c0 <_IO_2_1_stderr_>, 
    format=format@entry=0x4 <Address 0x4 out of bounds>, 
    args=args@entry=0x7fffffffc5a8) at vfprintf.c:2319
#4  0x00002aaaac43b81e in _IO_vfprintf_internal (
    s=0x2aaaac7bb1c0 <_IO_2_1_stderr_>, 
    format=format@entry=0x4 <Address 0x4 out of bounds>, 
    ap=ap@entry=0x7fffffffc5a8) at vfprintf.c:1289
#5  0x00002aaaac4ef6a5 in error_tail (status=status@entry=-1364198929, 
    errnum=errnum@entry=-1364199024, 
    message=message@entry=0x4 <Address 0x4 out of bounds>, 
    args=args@entry=0x7fffffffc5a8) at error.c:197
#6  0x00002aaaac4ef7fd in __error (status=status@entry=-1364198929, 
    errnum=-1364199024, message=0x4 <Address 0x4 out of bounds>) at error.c:247
#7  0x00002aaaae9cc9de in slurm_get_kvs_comm_set (
    kvs_set_ptr=kvs_set_ptr@entry=0x7fffffffc930, pmi_rank=0, pmi_size=1)
    at slurm_pmi.c:232
#8  0x00002aaaae76c167 in PMI_Barrier () at pmi.c:683
#9  0x00000000004174c6 in broker_pmi_barrier (pmi=pmi@entry=0x638030)
    at pmiutil.c:240
#10 0x0000000000416ec7 in boot_pmi (overlay=0x637350, attrs=0x633190, tbon_k=2)
    at boot_pmi.c:276
#11 0x00000000004085a2 in main (argc=1, argv=<optimized out>) at broker.c:409
(gdb) 

The key to the reproducer is that the default salloc command is overridden. When running salloc without any arguments, the default command runs something like srun -N1 -n1 --pty --mpi=none $SHELL (which seems to prevent the libpmi segfault). When a command is provided, salloc will set some environment and exec() the provided commandline, which must be what is confusing Slurm's PMI.

My guess is some SLURM environment variable is missing that causes the segv.

Hm, this is interesting:

$ SLURM_SRUN_COMM_HOST=localhost SLURM_SRUN_COMM_PORT=10000 src/cmd/flux broker
/g/g0/grondo/git/flux-core.git/src/broker/.libs/lt-flux-broker: : Unknown error 2147483638

Something going wrong when the srun host/port can't be determined, or the connection fails.
Well, enough debugging Slurm for now.

The key to the reproducer is that the default salloc command is overridden. When running salloc without any arguments, the default command runs something like srun -N1 -n1 --pty --mpi=none $SHELL (which seems to prevent the libpmi segfault). When a command is provided, salloc will set some environment and exec() the provided commandline, which must be what is confusing Slurm's PMI.

This is long time ago, but I remember SLURM's PMI required the head srun process. Maybe the fact that you don't have that srun (implicated invoked), PMI doesn't know who to talk to gather the port info and SEGV...

Still I don't quite understand why flux start doesn't use self PMI in this case.

flux start with no arguments can't assume it can use self PMI, it has to try bootstrapping using the first libpmi.so found. Otherwise, srun -N2 flux start would not work, and neither would flux mini run -n 2 flux start.

That is why -s 1 works around the issue.

BTW, we could probably reproduce this with basic PMI client (mvapich hello world?) and open a bug against Slurm.

Thanks everyone, and @grondo for looking into it so closely. I had forgotten about that salloc behavior, even though it has actually bitten me before---it explains the inconsistency I was seeing while trying to reproduce the issue. I can work around this bug fairly easily, so I'll close this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SteVwonder picture SteVwonder  路  5Comments

grondo picture grondo  路  7Comments

garlick picture garlick  路  3Comments

SteVwonder picture SteVwonder  路  7Comments

garlick picture garlick  路  3Comments