Flux-core: Flux crashes in Jansson's unpack on UQP ensemble

Created on 31 Oct 2019  ·  79Comments  ·  Source: flux-framework/flux-core

@jameshcorbett integrate Flux into UQP and ran it w/ a large LLNL application. An ensemble ran for awhile (~10mins) and failed because a Flux broker took SIGSEGV and crashes. The backtrace is:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/collab/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017/libexec/flu'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002aaaaaf4d8b7 in unpack () from /lib64/libjansson.so.4
Missing separate debuginfos, use: debuginfo-install boost-filesystem-1.53.0-27.el7.x86_64 boost-graph-1.53.0-27.el7.x86_64 boost-regex-1.53.0-27.el7.x86_64 boost-system-1.53.0-27.el7.x86_64 czmq-3.0.2-3.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 jansson-2.10-1.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.1chaos.ch6.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libicu-50.2-3.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libsodium-1.0.18-1.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 libuuid-2.23.2-61.el7.x86_64 lz4-1.7.5-3.el7.x86_64 munge-libs-0.5.13-1.ch6.x86_64 numactl-libs-2.0.12-3.el7.x86_64 openpgm-5.2.122-2.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 yaml-cpp-0.5.1-2.el7.x86_64 zeromq-4.1.5-3.ch6.x86_64
(gdb) bt
#0  0x00002aaaaaf4d8b7 in unpack () from /lib64/libjansson.so.4
#1  0x00002aaaaaf4e743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2  0x00002aaaaaf4e93b in json_unpack () from /lib64/libjansson.so.4
#3  0x00002aaac0935a0c in treeobj_peek (obj=<optimized out>, typep=0x2aaac0d73708, datap=0x0) at treeobj.c:57
#4  0x00002aaac0935d65 in treeobj_get_type (obj=<optimized out>) at treeobj.c:134
#5  0x00002aaac0935e99 in treeobj_is_dirref (obj=<optimized out>) at treeobj.c:165
#6  0x00002aaac092fa79 in lookup (lh=0x2aab00593dd0) at lookup.c:1057
#7  0x00002aaac092ba5c in lookup_common (h=h@entry=0x2aaac80011c0, mh=0x2aaac800db20, msg=msg@entry=0x2aaacb751bc0,
    arg=0x2aaac800b8b0, replay_cb=replay_cb@entry=0x2aaac092cb40 <lookup_plus_request_cb>, stall=<optimized out>)
    at kvs.c:1388
#8  0x00002aaac092cb6a in lookup_plus_request_cb (h=0x2aaac80011c0, mh=<optimized out>, msg=0x2aaacb751bc0,
    arg=<optimized out>) at kvs.c:1503
#9  0x00002aaac092ed1b in wait_runone (w=0x2aaac84a65d0) at waitqueue.c:173
#10 wait_runqueue (q=<optimized out>) at waitqueue.c:201
#11 0x00002aaac092e11c in cache_entry_set_raw (entry=entry@entry=0x2aaac8673f60, data=0x2aaacb50c458, len=286)
    at cache.c:179
#12 0x00002aaac092c4b1 in content_load_completion (f=0x2aaac86744a0, arg=0x2aaac800b8b0) at kvs.c:531
#13 0x00002aaaaad179d5 in ev_invoke_pending (loop=0x2aaac800a4e0) at ev.c:3322
#14 0x00002aaaaad1ab47 in ev_run (loop=0x2aaac800a4e0, flags=0) at ev.c:3726
#15 0x00002aaaaaceb593 in flux_reactor_run (r=0x2aaac8001270, flags=flags@entry=0) at reactor.c:126
#16 0x00002aaac092db01 in mod_main (h=0x2aaac80011c0, argc=<optimized out>, argv=<optimized out>) at kvs.c:2984
#17 0x000000000040abf6 in module_thread (arg=0x672360) at module.c:162
#18 0x00002aaaab678ea5 in start_thread (arg=0x2aaac0d74700) at pthread_create.c:307
#19 0x00002aaaac2ec8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) quit

The core file size is 435MB.

All 79 comments

Here is the version info:

rztopaz572{dahn}38: /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017/bin/flux --version
commands:           0.13.0-83-g30ccef9
libflux-core:       0.13.0-83-g30ccef9

Looking at the corefile using totalview. It is the KVS module that's crashing.

TotalView's back trace is slightly different in that the leaf function that got the segmentation violation is called json_get_alloc_funcs ():

Screen Shot 2019-10-31 at 11 05 32 AM

Too bad we don't have jansson debuginfo package installed. Not sure how we can investigate what might actually be in that treeobj json_t *, or if obj is even valid memory? Maybe @chu11 will have some ideas.

@grondo: I can build flux with (-O0 -g) and have @jameshcorbett generate another corefile. That way, we should be able to look at the full state before getting into Jansson. If worse comes to the worse, we may need to build our own Jansson in full debug mode and to look into Jansson state.

@chu11: Let me know if you have an insight about this. Otherwise, I will go ahead and try this.

Unlikely that json_get_alloc_funcs is where the segv is really occurring. This function just returns internally stored malloc and free (allowing users to override allocator used by jansson)

void json_get_alloc_funcs(json_malloc_t *malloc_fn, json_free_t *free_fn)
{
    if (malloc_fn)
        *malloc_fn = do_malloc;
    if (free_fn)
        *free_fn = do_free;
}

The other thing we need to look at is the memory footprint of the system when this happen. This was with a large app and depending on the input deck used, out of memory is possibility.

@jameshcorbett: do you know what is the memory usage on a node? type in top on one of the allocated node should give you some info.

I can build flux with (-O0 -g) and have @jameshcorbett generate another corefile. That way, we should be able to look at the full state before getting into Jansson.

Great if the problem is reproducible!

@grondo: somehow I can't locate the unpack() function in Jansson...

somehow I can't locate the unpack() function in Jansson...

In 2.8 it is in src/pack_unpack.c

@jameshcorbett: could you use the following version to see if this also crashes? If so, could you "give" the corefile to me again?

/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg

Sure @dongahn, do you want me to do anything with top? Insert it into the script that launches flux?

FYI @dongahn, RedHat has a new project debuginfod which might be timely. If it works as advertised, we could have one debuginfo server per network zone with all debuginfos readily available

https://developers.redhat.com/blog/2019/10/14/introducing-debuginfod-the-elfutils-debuginfo-server/

I'll see if we have any hope of getting this set up on site. We may want to talk to Ben about STAT and TV support for remote debuginfos.

@dongahn the ensemble has been running for thirty minutes now, which is usually long enough for it to finish or crash. I was running ten simulations, and five at a time. Of the first five running, two crashed with what I believe were internal errors. The other three ran to completion, but towards the end of the simulation's log file there was the line "flux-job: flux_job_event_watch_get: State not recoverable". The UQP launched jobs 6, 7, and 8 to replace the finished jobs, but those seem to running indefinitely with no output.

When I ran the last ensemble (the ensemble that generated that core file I gave you) all the log files instead had the line "flux-job: flux_reactor_run: Success" towards the bottom, shortly before the log file cut off. I hope that helps in some way.

the ensemble has been running for thirty minutes now, which is usually long enough for it to finish or crash.

This is an unoptimized build of Flux and I wonder if that slowed things down further.

Also also mentioned that running this ensemble under Flux was pretty slow compared to the old method. We suspected that an issue could be that there is something wrong with binding/affinity. How do you compare the performance between Flux and your old way on this particular ensemble?

Of the first five running, two crashed with what I believe were internal errors

There are internal errors of the application itself, correct?

The other three ran to completion, but towards the end of the simulation's log file there was the line flux-job: flux_job_event_watch_get: State not recoverable.

@grondo: do you know when such an output will be printed out? Is this typically harmless?

The UQP launched jobs 6, 7, and 8 to replace the finished jobs, but those seem to running indefinitely with no output.

Hmmm. This is bad... I wonder if this is just a follow up problem from the cause of "flux-job: flux_job_event_watch_get: State not recoverable".

When I ran the last ensemble (the ensemble that generated that core file I gave you) all the log files instead had the line "flux-job: flux_reactor_run: Success" towards the bottom, shortly before the log file cut off. I hope that helps in some way.

This is helpful. I am pretty baffled because the only difference between the version you had success and this version is the permission patch @grondo added: https://github.com/flux-framework/flux-core/pull/2468

I am pretty baffled because the only difference between the version you had success and this version is the permission patch @grondo added: #2468

@jameshcorbett: Let's try one more thing which should be very useful given that you had a previous success.

Could you try flux from in-tree build directory?

/usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/src/cmd/flux

This was essentially how you ran your Flux before #2468. If this still gives you the errors, I will build one more Flux version with the exactly same configuration as before the #2468 patch.

This is an unoptimized build of Flux and I wonder if that slowed things down further.

Also also mentioned that running this ensemble under Flux was pretty slow compared to the old method. We suspected that an issue could be that there is something wrong with binding/affinity. How do you compare the performance between Flux and your old way on this particular ensemble?

Actually, the issue I had where Flux was running slowly was a completely separate issue. That only came up when I started to create nested flux instances. Otherwise, I haven't noticed Flux being any slower than srun.

There are internal errors of the application itself, correct?

Yeah, they pop up every once in a while even when I don't use flux at all, and just srun the application.

@dongahn I started to run the in-tree flux, it should be done in half an hour.

flux-job: flux_job_event_watch_get: State not recoverable.

@grondo: do you know when such an output will be printed out? Is this typically harmless?

No, AFAIK, we should never see ENOTRECOVERABLE errors. Maybe more information would be in the flux dmesg output, but all around it seems like something bad happened to this instance.

@dongahn the in-tree build flux worked. However, I did notice that whenever the application crashed (for what I still believe are internal reasons) flux mini run didn't return, so the UQP was never able to mark them as complete. Three applications crashed, so the UQP was left running the rest of the applications just two at a time (instead of five at a time).

@jameshcorbett, the 0.13+ version of flux has a deficiency where it will wait for all tasks to exit before considering the job complete (and flux-mini run exits). If only one task in your job was crashing, this might be what you are experiencing.

Sounds like we need to address this issue sooner rather than later if that is the case.

If all your tasks were exited at the time of the crash, then there is an unknown bug we'll have to investigate further.

@grondo I think that's it. While running the application with sixteen tasks with srun and flux mini run, here's the tail of the log file for each:

Srun:

 0: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
 0: slurmstepd: error: *** STEP 2943859.2 ON rztopaz609 CANCELLED AT 2019-10-29T12:50:09 ***
srun: error: rztopaz609: tasks 0-15: Killed

Flux:

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
PMI2_Abort: (15) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
PMI2_Abort: (1) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
PMI2_Abort: (11) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
PMI2_Abort: (7) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
PMI2_Abort: (2) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
PMI2_Abort: (9) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
PMI2_Abort: (5) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
PMI2_Abort: (0) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
PMI2_Abort: (3) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
(hang)

In this particular case, it is strange to me that the rest of the MPI ranks don't abort, though.

In this case I think we could wire our PMI to generate a job exception on abort, which would terminate the job. However, if a task died for some other reason we'd still have the problem described above.

Sorry about the trouble! We'll work on improving it.

Oh, I see. I just expected the MPI_Abort call would have signaled all other MPI ranks to abort too, and that then Flux would have marked the job as complete.

I'm just happy that Flux isn't mysteriously crashing any more.

I'm just happy that Flux isn't mysteriously crashing any more.

Glad this worked but we really need to track down this issue further while you are making progress by using the in-tree flux.

I think some of flux's dependencies are missing when you use the installed flux. I have no idea yet how to track this down though.

I could be wrong but I think it is up to PMI to do something with the abort (though maybe some MPIs are wired up to be able to do something global with an MPI_Abort)

I really wish we understood what was crashing Flux in your reproducer. That issue is concerning!

Let me know if there's anything I can do to help. I'll re-run the simulations with the various flux versions just to confirm the results.

Yes. If you can run the installed version a couple more times to see if this can produce a core file (full debug), this should be extremely helpful. Thanks James!

You mean /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg? Will do.

Yes! Thanks.

@dongahn I just gave you two new core files. I'm not sure why the first debug run didn't crash, because these two new runs I tried overnight crashed around fifteen minutes in. But anyway, I hope they help.

@jameshcorbett: oh great. I will take a look right away.

Seems Flux crashes at the same spot:

#0  0x00002aaaaaf5e8b7 in unpack () from /lib64/libjansson.so.4
#1  0x00002aaaaaf5f743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2  0x00002aaaaaf5f93b in json_unpack () from /lib64/libjansson.so.4
#3  0x00002aaac0941683 in treeobj_peek (obj=0x2aaac91d9610, typep=0x2aaac0d89588, datap=0x0) at treeobj.c:57
#4  0x00002aaac0941993 in treeobj_get_type (obj=0x2aaac91d9610) at treeobj.c:134
#5  0x00002aaac0941ae8 in treeobj_is_dirref (obj=0x2aaac91d9610) at treeobj.c:165
#6  0x00002aaac093a488 in lookup (lh=0x2aaac8044260) at lookup.c:1057
#7  0x00002aaac093338a in lookup_common (h=0x2aaac8001220, mh=0x2aaac800db00, msg=0x2aaaf39df7e0, arg=0x2aaac800b890,
    replay_cb=0x2aaac093373c <lookup_plus_request_cb>, stall=0x2aaac0d89793) at kvs.c:1388
#8  0x00002aaac0933786 in lookup_plus_request_cb (h=0x2aaac8001220, mh=0x2aaac800db00, msg=0x2aaaf39df7e0, arg=0x2aaac800b890) at kvs.c:1503
#9  0x00002aaac09384fb in wait_runone (w=0x2aaac9edd130) at waitqueue.c:173
#10 0x00002aaac09385ab in wait_runqueue (q=0x2aaac9829220) at waitqueue.c:201
#11 0x00002aaac0937610 in cache_entry_set_raw (entry=0x2aaaf24e6f80, data=0x2aaacb69cb58, len=286) at cache.c:179
#12 0x00002aaac0931483 in content_load_completion (f=0x2aaac9829240, arg=0x2aaac800b890) at kvs.c:531
#13 0x00002aaaaacf79a3 in check_cb (r=0x2aaac8009a50, w=0x2aaac9829040, revents=0, arg=0x2aaac9829240) at future.c:796
#14 0x00002aaaaacebfbf in check_cb (loop=0x2aaac800a540, cw=0x2aaac9829068, revents=32768) at reactor.c:853
#15 0x00002aaaaad28d93 in ev_invoke_pending (loop=0x2aaac800a540) at ev.c:3322
#16 0x00002aaaaad29c85 in ev_run (loop=0x2aaac800a540, flags=0) at ev.c:3726
#17 0x00002aaaaacea717 in flux_reactor_run (r=0x2aaac8009a50, flags=0) at reactor.c:126
#18 0x00002aaac093716a in mod_main (h=0x2aaac8001220, argc=0, argv=0x2aaac800b3b0) at kvs.c:2984
#19 0x000000000040bf76 in module_thread (arg=0x67cf30) at module.c:162
#20 0x00002aaaab689ea5 in start_thread (arg=0x2aaac0d8a700) at pthread_create.c:307
#21 0x00002aaaac2fd8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

json_t * is first passed into treeobj_is_dirref and its type field is 4056313424 and refcount 46913661567584. Both seem pretty large and this makes me believe the json object being passed to json_unpack is invalid.

#5  0x00002aaac0941ae8 in treeobj_is_dirref (obj=0x2aaac91d9610) at treeobj.c:165
165     const char *type = treeobj_get_type (obj);
(gdb) print obj
$1 = (const json_t *) 0x2aaac91d9610
(gdb) print *obj
$2 = {type = 4056313424, refcount = 46913661567584}

I think we need to get @chu11 or @garlick to take a look. I don't know the kvs well enough to efficiently determine where to look next.

This could also be memory corruption, and since all modules run in the same address space the corruption could come from anywhere... :-(

I looked at the other corefile. The type and refcount have different values from the previous corefile. Likely this object is corrupted.

#3  0x00002aaac0941683 in treeobj_peek (obj=0x2aaac8e9b8d0, typep=0x2aaac0d89588, datap=0x0) at treeobj.c:57
57      if (!obj || json_unpack ((json_t *)obj, "{s:i s:s s:o !}",
(gdb) print obj
$1 = (const json_t *) 0x2aaac8e9b8d0
(gdb) print *obj
$2 = {type = 828467315, refcount = 7161066662214840674}

I gave core files to @chu11 and @garlick on quartz:

quartz1916{dahn}24: give -l
 achu has been given:
   345 MB Nov 01 10:25 rztopaz290-flux-broker-0-22454.core
   363 MB Nov 01 10:25 rztopaz484-flux-broker-0-48886.core
   2 file(s)
 garlick has been given:
   345 MB Nov 01 10:25 rztopaz290-flux-broker-0-22454.core
   363 MB Nov 01 10:25 rztopaz484-flux-broker-0-48886.core
   2 file(s)
You have given a total of 4 file(s

I wonder if there is any way to reproduce under valgrind or asan.

So the debug markers so far:

  1. When this occurs, it happens exact at the same stack trace;
  2. There could be other manifestations, but that could have happened due to a pilot error;
  3. It crashes because the json_t object is likely invalid;
  4. The user could make it happen with an "installed" flux but not with an "in-tree" flux;

Once @chu11 and @garlick take a look at this. If they have no clue, maybe we should use memory checkers to see if they can spot the root cause.

Because Flux is being launched with srun, we should try memcheck_all and total view's memscape.

@grondo: Yes, memcheck_all is valgrind.

The installed vs in-tree angle is interesting. @dongahn, have we tried just running a simple workload under the installed version to see if this issue reproduces?

I have not but we could.

I gave core files to @chu11 and @garlick on quartz:

Can you point me to which flux-broker binary was run?

file on that core should give you the path. D


From: Al Chu notifications@github.com
Sent: Friday, November 1, 2019 1:39:42 PM
To: flux-framework/flux-core flux-core@noreply.github.com
Cc: Ahn, Dong H. ahn1@llnl.gov; Mention mention@noreply.github.com
Subject: Re: [flux-framework/flux-core] Flux crashes in Jansson's unpack on UQP ensemble (#2500)

I gave core files to @chu11https://github.com/chu11 and @garlickhttps://github.com/garlick on quartz:

Can you point me to which flux-broker binary was run?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/2500?email_source=notifications&email_token=AAGSPK4CERSMJLXRVVE2CCLQRSHY5A5CNFSM4JHOZG72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC4DGQY#issuecomment-548942659, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAGSPK3QUYFH2ACAJZEPA5DQRSHY5ANCNFSM4JHOZG7Q.

Now I'm in my office: it is /collab/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg/libexec/flux/cmd/flux-broker

Next time you can get this info by typing infile corefile

Assuming the json_t object is corrupted.

(gdb) p *(json_t *)0x2aaac91d9610
$1 = {type = 4056313424, refcount = 46913661567584}

This object is stored in the internal kvs lookup handle, which looks ok

(gdb) p *(lookup_t *)0x2aaac8044260
$2 = {cache = 0x2aaac8009b10, krm = 0x2aaac800b570, current_epoch = 390, ns_name = 0x0, root_ref = 0x2aaaca76cda0 "sha1-4671d4ec87a7c5d61cbc7b111e42c30b90e08899", root_seq = 2293, root_ref_set_by_user = true, path = 0x2aaaca0ba270 "output", h = 0x2aaac8001220, rolemask = 1, userid = 31193, flags = 260, aux = 0x0, val = 0x0, valref_missing_refs = 0x2aaac91d9610, missing_ref = 0x0, missing_namespace = 0x0, errnum = 0, aux_errnum = 0, levels = 0x2aaac9f059d0, wdirent = 0x2aaac91d9610, state = LOOKUP_STATE_VALUE}

and some structs inside this appear to be ok

(gdb) p *(kvsroot_mgr_t *)0x2aaac800b570
$3 = {roothash = 0x2aaac800ccf0, removelist = 0x2aaac800cde0, iterating_roots = false, h = 0x0, arg = 0x2aaac800b890}

(gdb) p *(kvs_ctx_t *)0x2aaac800b890
$4 = {cache = 0x2aaac8009b10, krm = 0x2aaac800b570, faults = 49153, h = 0x2aaac8001220, rank = 0, epoch = 390, prep_w = 0x2aaac8009ac0, idle_w = 0x2aaac800b940, check_w = 0x2aaac800b8f0, transaction_merge = 1, events_init = true, hash_name = 0x2aaac800bf20 "sha1", seq = 25228}

So I've been concentrating on whether its possible just this one json_t is corrupted by use-after-free or something like that. The json_t that's corrupted is the wdirent field in the lookup handle, and note the key user looked up is "output", which is probably from "guest.output" (maybe this isn't relevant). This is also coming from a replay of a lookup, it wasn't in the content store to begin with (again possibly not relevant). (Edit: and one other aside, this is a lookup from a kvs-watch b/c we're going through lookup-plus. Possibly not relevant, but interesting.).

The "wdirent" is a convenience pointer to a json_t stored in a data structure inside the levels zlist. Unfortunately I can't walk the "levels" zlist_t b/c debug symbols for czmq aren't on quartz. Anyone know what the easiest way to go about debugging with those symbols? In the past on test systems, I would install the debug rpms.

Oh yeah, since this is a lookup replay, wondering if it's possible to get the broker log of this particular failure. Make sure there's nothing specific to that that shows up in the logs.

Since the broker crashes I doubt there are logs available from this corefile, but if we have a reproducer we could have @jameshcorbett re-run, this time capturing broker logs to a logfile with the following options to flux-start:

 srun ... flux start -o -Slog-filename=issue2500.log,-Slog-forward-level=7 [existing options]...

Unfortunately I can't walk the "levels" zlist_t b/c debug symbols for czmq aren't on quartz. Anyone know what the easiest way to go about debugging with those symbols? In the past on test systems, I would install the debug rpms.

Even if you have the debug rpms, your mileage would vary because zlist is optimized and in many cases the variables can be optimized out for debugging.

Should I go ahead and run that, @grondo ?

@jameshcorbett: yes that should be very useful for us.

Even if you have the debug rpms, your mileage would vary because zlist is optimized and in many cases the variables can be optimized out for debugging.

This corrupted json_t is mostly likely an object inside another json object. So without walking the list, I can't determine if the outer json_t is corrupted, or if the data structure holding the outer json_t is corrupted.

I haven't been able to find an obvious use-after-free in any part of this area related to any of these possibilities.

@dongahn I just gave you the new core and log files.

@jameshcorbett: you should also have issue2500.log from Flux instance. Could you send that to me as well? Hopefully the corefile came from the outer flux instance.

You should have it already.

Sorry, I only remembered it after I wrote my comment.

@chu11: I gave you the log file.

quartz1922{dahn}22: give achu issue2500.log
Giving issue2500.log.
Gave 1 file(s).

@jameshcorbett: I also noticed that sched-simple was used instead of flux-sched (our fully feature scheduler). Once this debugging dust settles, I will reconfigured the installation so that you can use flux-sched instead.

2019-11-01T23:27:42.862312Z job-manager.debug[0]: submit_cb: added 3 jobs
2019-11-01T23:27:42.864734Z sched-simple.debug[0]: req: 264140488704: spec={1,16,1}
2019-11-01T23:27:42.864929Z sched-simple.debug[0]: alloc: 264140488704: rank0/core[0-15]
2019-11-01T23:27:42.874414Z job-manager.debug[0]: submit_cb: added 2 jobs
2019-11-01T23:27:42.881216Z sched-simple.debug[0]: req: 264224374784: spec={1,16,1}
2019-11-01T23:27:42.881394Z sched-simple.debug[0]: alloc: 264224374784: rank1/core[0-15]
2019-11-01T23:27:42.901526Z sched-simple.debug[0]: req: 264308260864: spec={1,16,1}
2019-11-01T23:27:42.901733Z sched-simple.debug[0]: alloc: 264308260864: rank2/core[0-15]
2019-11-01T23:27:42.907812Z kvs.debug[0]: aggregated 2 transactions (2 ops)
2019-11-01T23:27:42.913110Z sched-simple.debug[0]: req: 264375369728: spec={1,16,1}
2019-11-01T23:27:42.913327Z sched-simple.debug[0]: alloc: 264375369728: rank3/core[0-15]
2019-11-01T23:27:42.923549Z sched-simple.debug[0]: req: 264476033024: spec={1,16,1}
2019-11-01T23:27:42.923743Z sched-simple.debug[0]: alloc: 264476033024: rank0/core[16-31]

/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017 has the flux-sched.

It is just that /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg didn't have it.

@grondo, @garlick: I tried the in-tree flux and it fails on me. I talked to @jameshcorbett and the same command works on an old session but not on the new session (newly ssh into rztopaz).

quartz764{dahn}34: /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/src/cmd/flux start -s 1
flux-start: warning: setting --bootstrap=selfpmi due to --size option
lt-flux-broker: zsecurity_comms_init: /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux: mode should be 0700
lt-flux-broker: overlay_bind failed: Operation not permitted
lt-flux-broker: bootstrap failed
flux-start: 0 (pid 67954) exited with rc=1

The permission is

rztopaz578{dahn}40: ll -d /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux
drwxrwsr-x 4 dahn tools 4096 Oct 31 11:26 /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux/

Do you know what might be causing this?

Sorry. I edited the path to the in-tree flux cmd above.

Nothing in the log KVS wise that looked too interesting to me, but noticed this:

2019-11-01T23:39:56.116152Z kvs.info[0]: tagpool-normal expanded from 1024 to 2048 entries
2019-11-01T23:39:56.129649Z kvs.info[0]: tagpool-normal expanded from 2048 to 4096 entries
2019-11-01T23:39:56.320573Z broker.info[0]: tagpool-normal expanded from 1024 to 2048 entries
2019-11-01T23:39:56.154146Z kvs.info[0]: tagpool-normal expanded from 4096 to 8192 entries
2019-11-01T23:39:56.582265Z broker.info[0]: tagpool-normal expanded from 2048 to 4096 entries
2019-11-01T23:39:57.296304Z broker.debug[0]: content purge: 205 entries
2019-11-01T23:39:57.304147Z broker.info[0]: tagpool-normal expanded from 4096 to 8192 entries
2019-11-01T23:39:56.203997Z kvs.info[0]: tagpool-normal expanded from 8192 to 16384 entries
2019-11-01T23:39:56.255888Z kvs.debug[0]: aggregated 162 transactions (162 ops)
2019-11-01T23:39:58.227132Z kvs.debug[0]: aggregated 204 transactions (204 ops)
2019-11-01T23:39:58.905785Z broker.debug[0]: content purge: 228 entries
2019-11-01T23:40:00.405663Z kvs.info[0]: tagpool-normal expanded from 16384 to 32768 entries
2019-11-01T23:40:01.163255Z broker.debug[0]: content purge: 692 entries
2019-11-01T23:40:00.566211Z kvs.debug[0]: aggregated 207 transactions (207 ops)
2019-11-01T23:40:03.756036Z broker.debug[0]: content purge: 149 entries
2019-11-01T23:40:05.172796Z broker.info[0]: tagpool-normal expanded from 8192 to 16384 entries
2019-11-01T23:40:05.575580Z broker.debug[0]: content purge: 1262 entries
2019-11-01T23:40:03.705158Z kvs.info[0]: tagpool-normal expanded from 32768 to 65536 entries
2019-11-01T23:40:08.664350Z broker.debug[0]: content purge: 280 entries
2019-11-01T23:40:03.842327Z kvs.debug[0]: aggregated 212 transactions (212 ops)

Perhaps it's ok, but it seems like a huge exponential growth in tags over about a 0.6 second span.

@dongahn So I am sort of a the limit of what I can debug with gdb and the current core. Basically, there is a data structure inside the zlist, which holds a pointer to a json KVS directory object. Inside this json object, there are json objects of entries within the directory. It is one of these directory entries that is the corrupted json object in our trace above. The json KVS directory object is owned by yet another data structure in the KVS (cache object). It's hard to get a clue on what's going on without seeing what "parent object" might be corrupted.

Two ideas:

  • we can build a flux that has debug symbols for czmq, jansson, etc. that might get us closer.
  • i could write a patch that will add a bunch of debug stuff into some data structures so we have easier access to them and don't have to go fishing through the zlist. Of course, with changes, who knows if it'll reproduce.

Of the two above, I would maybe prefer the latter, b/c I'm not 100% how I could get to some pointers even if I had access to the zlist (sha hash against KVS data, not sure what the data is b/c json object corrupted, so can't get the cache entry of that KVS data).

@chu11: Let's discuss this a bit offline. We need to involve @jameshcorbett as well since until we will find a simpler reproducer we will need his support. Maybe tomorrow some time?

I'm happy to continue running problems--it doesn't take that much of my time. I can also capture my environment from that old rztopaz session if that would be helpful, @dongahn.

@chu11: Oh btw, do we want to run this with valgrind to see if there is any obvious? I will see if I can run a flux instance under memcheck_all. We will probably need to use a suppression file if Flux already has one.

We will probably need to use a suppression file if Flux already has one.

Yep. We have one as part of our CI tests: https://github.com/flux-framework/flux-core/blob/008ad823bc21683b32918e6bccbc5c5a5ecdc853/t/valgrind/valgrind.supp

Nice!

Oh btw, do we want to run this with valgrind to see if there is any obvious?

I suppose that couldn't hurt to try.

@jameshcorbett:

I tried valgrind memcheck_all on my flux instance on rzgenie in the following way and it worked.

$ srun -N 2 -n 2 --pty /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg/bin/flux start --wrap=/usr/tce/bin/memcheck_all

rzgenie22{dahn}21: flux jobspec srun -N 2 -n 2 ./virtual_ring_mpi | flux job submit -
510430019584
rzgenie22{dahn}22: flux jobspec srun -N 2 -n 2 ./virtual_ring_mpi | flux job submit -
619163156480
rzgenie22{dahn}23: flux mini run -N 2 -n 2 ./virtual_ring_mpi
memcheckview flux-broker.1640811_3.*.mc

Unfortunately memcheck_all does not take additional valgrind options like --suppressions so I was unable to suppress all the false positives. But still it should still be worth trying it this way.

Notice that I didn't have --mpi=none in the srun option because the memcheck_all tool requires MPI_RANK envVariables.

Could you run your crashing program with memcheck_all to see if any apparent memory corruption can be detected?

Note that we have to expect this will slow down the execution by a factor of at least 10x. So please adjust your walltime limit.

Thanks.

@dongahn it's running now, and I gave it a fifteen-hour time limit.

Thanks @jameshcorbett.

I suspect valgrind spotted the root cause. This appears to me as a case of read-after-free. There are many instances of "Invalid read of" varying sizes.

Invalid read of size 4                                             x1 (in /usr/lib64/libjansson.so.4.10.0)
[MPI Rank 0 only]
  at 0x50C63AA: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C7742: json_vunpack_ex (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C793A: json_unpack (in /usr/lib64/libjansson.so.4.10.0)
  by 0xA91B682: treeobj_peek (treeobj.c:57)
  by 0xA91B992: treeobj_get_type (treeobj.c:134)
  by 0xA91BAE7: treeobj_is_dirref (treeobj.c:165)
  by 0xA914487: lookup (lookup.c:1057)
  by 0xA90D389: lookup_common (kvs.c:1388)
  by 0xA90D785: lookup_plus_request_cb (kvs.c:1503)
  by 0xA9124FA: wait_runone (waitqueue.c:173)
  by 0xA9125AA: wait_runqueue (waitqueue.c:201)
  by 0xA91160F: cache_entry_set_raw (cache.c:179)
  by 0xA90B482: content_load_completion (kvs.c:531)
  by 0x4E5F9A2: check_cb (future.c:796)
  by 0x4E53FBE: check_cb (reactor.c:853)
  by 0x4E90D92: ev_invoke_pending (ev.c:3322)
  by 0x4E91C84: ev_run (ev.c:3726)
  by 0x4E52716: flux_reactor_run (reactor.c:126)
  by 0xA911169: mod_main (kvs.c:2984)
  by 0x40BF75: module_thread (module.c:162)
Address 0xea90d60 is 0 bytes inside a block of size 80 free'd
  at 0x4C2AF7D: free (vg_replace_malloc.c:530)
  by 0x50C38B3: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C3958: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C8990: json_delete (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C38B3: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C3958: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C8990: json_delete (in /usr/lib64/libjansson.so.4.10.0)
  by 0xA9111ED: json_decref (jansson.h:112)
  by 0xA91185F: cache_entry_destroy (cache.c:253)
  by 0xA911EBF: cache_entry_destroy_wrapper (cache.c:433)
  by 0x52F3F11: (in /usr/lib64/libczmq.so.3.0.0)
  by 0xA911C1C: cache_expire_entries (cache.c:359)
  by 0xA90CF6C: heartbeat_cb (kvs.c:1273)
  by 0x4E54F57: call_handler (msg_handler.c:236)
  by 0x4E550E4: dispatch_message (msg_handler.c:282)
  by 0x4E5535F: handle_cb (msg_handler.c:374)
  by 0x4E52C45: handle_cb (reactor.c:299)
  by 0x4E5D1CD: check_cb (ev_flux.c:58)
  by 0x4E90D92: ev_invoke_pending (ev.c:3322)
  by 0x4E91C84: ev_run (ev.c:3726)
Address 0xea90d60 is 0 bytes inside a block of size 80 free'dBlock was alloc'd at
  at 0x4C29E83: malloc (vg_replace_malloc.c:299)
  by 0x50C819A: json_object (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C50BC: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C5169: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C5169: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C54F5: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C57AD: json_loadb (in /usr/lib64/libjansson.so.4.10.0)
  by 0xA91C8D1: treeobj_decodeb (treeobj.c:590)
  by 0xA9117FE: cache_entry_get_treeobj (cache.c:242)
  by 0xA90BF28: setroot_event_send (kvs.c:837)
  by 0xA90C7CC: kvstxn_apply (kvs.c:1071)
  by 0xA9124C1: wait_runone (waitqueue.c:171)
  by 0xA9125AA: wait_runqueue (waitqueue.c:201)
  by 0xA9113A4: cache_entry_set_dirty (cache.c:101)
  by 0xA90B98F: content_store_completion (kvs.c:686)
  by 0x4E5F9A2: check_cb (future.c:796)
  by 0x4E53FBE: check_cb (reactor.c:853)
  by 0x4E90D92: ev_invoke_pending (ev.c:3322)
  by 0x4E91C84: ev_run (ev.c:3726)
  by 0x4E52716: flux_reactor_run (reactor.c:126)

While there are various accesses at different points of execution which are marked as "invalid read", many seem to point to a premature "free".

Maybe the conditional statement that leads to zlistx_delete has a logic bug?

https://github.com/flux-framework/flux-core/blob/master/src/modules/kvs/cache.c#L359

It seems that there are also various leaks that memcheck_all found. These are probably the ones that our valgrind unit test couldn't seee. I gave these files to @chu11.

rztopaz578{dahn}36: give -l
 achu has been given:
     8 MB Nov 05 18:49 flux-broker.2963686_0.0.mc
     1 KB Nov 05 18:49 flux-broker.2963686_0.1.mc
     1 KB Nov 05 18:49 flux-broker.2963686_0.2.mc
     1 KB Nov 05 18:49 flux-broker.2963686_0.3.mc
   345 MB Nov 01 10:23 rztopaz290-flux-broker-0-22454.core
   363 MB Nov 01 10:23 rztopaz484-flux-broker-0-48886.core
   208 MB Nov 05 18:49 vgcore.69924
   7 file(s)

Nice find, @dongahn!

Thanks @grondo. It was actually James who did the bulk lifting. And @chu11 pointed me to --wrap option. Thanks James and Al.

Hopefully this can lead to a quick fix...(don't know how deep the bug is though).

@dongahn great find, that definitely gets me looking in a specific area.

Yeah, it appears that a cache entry is getting expired before it is populated, which should be impossible. Will have to look into this.

Ok, I think I see a path where a lookup assumes that a cache entry has not been expired, but it could have been. Couldn't run via reproducer tonight, will try tomorrow.

diff --git a/src/modules/kvs/lookup.c b/src/modules/kvs/lookup.c
index 4119e96..ed82787 100644
--- a/src/modules/kvs/lookup.c
+++ b/src/modules/kvs/lookup.c
@@ -85,7 +85,7 @@ struct lookup {

     /* API internal */
     zlist_t *levels;
-    const json_t *wdirent;       /* result after walk() */
+    json_t *wdirent;       /* result after walk() */
     enum {
         LOOKUP_STATE_INIT,
         LOOKUP_STATE_CHECK_NAMESPACE,
@@ -487,7 +487,7 @@ static lookup_process_t walk (lookup_t *lh)

         zlist_remove (wl->pathcomps, pathcomp);
     }
-    lh->wdirent = wl->dirent;
+    lh->wdirent = json_incref ((json_t *)wl->dirent);

 done:
     return LOOKUP_PROCESS_FINISHED;
@@ -590,6 +590,7 @@ void lookup_destroy (lookup_t *lh)
         json_decref (lh->val);
         free (lh->missing_namespace);
         zlist_destroy (&lh->levels);
+        json_decref (lh->wdirent);
         free (lh);
     }
 }

Am obviously breaking const-ness, gotta figure out what would be better than this later.

tested with above patch and got a different crash than before. I think it is the same logical problem, but in a different place. Just got to think about how to deal with the problem more generically.

#0  0x00002aaaaaf5f8b7 in unpack () from /lib64/libjansson.so.4
#1  0x00002aaaaaf60743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2  0x00002aaaaaf6093b in json_unpack () from /lib64/libjansson.so.4
#3  0x00002aaac094178e in treeobj_peek (obj=0x2aaaf8914d00, typep=0x2aaac0d894f8, datap=0x0) at treeobj.c:57
#4  0x00002aaac0941a9e in treeobj_get_type (obj=0x2aaaf8914d00) at treeobj.c:134
#5  0x00002aaac0941bf3 in treeobj_is_dirref (obj=0x2aaaf8914d00) at treeobj.c:165
#6  0x00002aaac0939040 in walk (lh=0x2aaaf31ad9c0) at lookup.c:362
#7  0x00002aaac093a4c6 in lookup (lh=0x2aaaf31ad9c0) at lookup.c:1028
#8  0x00002aaac093347a in lookup_common (h=0x2aaac8001160, mh=0x2aaac800b9e0, msg=0x2aaaf9e95470, arg=0x2aaac800b8a0, 
    replay_cb=0x2aaac093382c <lookup_plus_request_cb>, stall=0x2aaac0d89793) at kvs.c:1388
#9  0x00002aaac0933876 in lookup_plus_request_cb (h=0x2aaac8001160, mh=0x2aaac800b9e0, msg=0x2aaaf9e95470, arg=0x2aaac800b8a0) at kvs.c:1503
#10 0x00002aaac09385eb in wait_runone (w=0x2aaaf0880180) at waitqueue.c:173
#11 0x00002aaac093869b in wait_runqueue (q=0x2aaaf6c01e40) at waitqueue.c:201
#12 0x00002aaac0937700 in cache_entry_set_raw (entry=0x2aaaf9f08d80, data=0x2aaaf955b708, len=365) at cache.c:179
#13 0x00002aaac0931573 in content_load_completion (f=0x2aaaf7ce7eb0, arg=0x2aaac800b8a0) at kvs.c:531
#14 0x00002aaaaacf88cf in check_cb (r=0x2aaac8009a50, w=0x2aaaf9e954a0, revents=0, arg=0x2aaaf7ce7eb0) at future.c:796
#15 0x00002aaaaacec3d1 in check_cb (loop=0x2aaac800a4f0, cw=0x2aaaf9e954c8, revents=32768) at reactor.c:853
#16 0x00002aaaaad29d59 in ev_invoke_pending (loop=0x2aaac800a4f0) at ev.c:3322
#17 0x00002aaaaad2ac4b in ev_run (loop=0x2aaac800a4f0, flags=0) at ev.c:3726
#18 0x00002aaaaaceab29 in flux_reactor_run (r=0x2aaac8009a50, flags=0) at reactor.c:126
#19 0x00002aaac093725a in mod_main (h=0x2aaac8001160, argc=0, argv=0x2aaac800b3c0) at kvs.c:2984
#20 0x000000000040bfe1 in module_thread (arg=0x679ca0) at module.c:162
#21 0x00002aaaab68aea5 in start_thread (arg=0x2aaac0d8a700) at pthread_create.c:307
#22 0x00002aaaac2fe8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Closing this. If a problem pops out again, we can reopen.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dongahn picture dongahn  ·  7Comments

chu11 picture chu11  ·  3Comments

SteVwonder picture SteVwonder  ·  7Comments

garlick picture garlick  ·  3Comments

SteVwonder picture SteVwonder  ·  7Comments