flux-core 🚀 - EAGAIN experienced when using Python KVS bindings

I'd like it if the python bindings could get off the deprecated KVS interfaces.

Under what circumstances are you getting EAGAIN? Usually that's a symptom of either matchtag exhaustion or improper use of errno residuals.

garlick on 6 Sep 2016

The python code (which is quite hackish and should probably be a kvs_watch) looks like:

while not kvs.exists(job.kvs.fh, complete_key):
   time.sleep(2)

The kvs.exists python function results in a single call to kvs_get() (the latest json string interface).

SteVwonder on 6 Sep 2016

Internally is that kvs_get() being checked for a <0 return _before_ errno is checked?

Just wondering if maybe the code is presuming that errno is zero on success, which is not necessarily true. In that case you might pick up a residual error value from a prior non-fatal error.

garlick on 7 Sep 2016

We have fixed other spots in the python bindings that tried to test errno and assumed error if it was non-zero, but I thought we had fixed those

grondo on 7 Sep 2016

It is checking the return and only raising an Error when the return is <0 . In this case, the return is -1.

SteVwonder on 7 Sep 2016

So if I try/except specifically for EAGAIN and then try again 2 seconds later, everything finishes just fine (as the error code implies). I noticed that zio also checks for EAGAIN and handles that case as not an error. So I assume this is normal behavior for the KVS and we just need to handle it in the Python bindings better.

@trws: any suggestions on how to handle this in the bindings (other than requiring users to constantly check for EAGAIN)? Maybe just check for EAGAIN, and if its set, then return None?

SteVwonder on 7 Sep 2016

Well no EAGAIN shouldn't be happening in this context. The exhausted (64K entry) matchtag pool would be my next worry. Does this happen with a fresh flux handle or does this one have a history such that it could have managed to run out of those?

garlick on 7 Sep 2016

Ok. Sorry to jump the gun there.

Are matchtags only allocated when a kvs watch is performed? I ask this because I can only find flux_matchtag_alloc in the kvs code under watch_rpc (which is also the only source of EAGAIN). The only watches that are performed (directly at least) are FD watches and event watches (no KVS watches). This process does use the JSC, which I believe is using KVS watches under the hood, but I'm only running 5 jobs, so I would be surprised if that is the cause of the pool exhaustion.

In general though, this process is only alive for a few seconds and probably only reads ~100 kvs keys (if that). For a few of the keys, it is performing the loop that checks if the key exists or not (which is a kvs_get under the hood).

Is there any easy way for me to monitor the matchtag pool during execution?

SteVwonder on 7 Sep 2016

There are two matchtag pools, a 64K entry pool used by flux_rpc(), and a 4K "group" pool used by flux_rpc_multi() and kvs_watch(). You can check how many tags are free in each with:

uint32_t flux_matchtag_avail (flux_t h, int flags);

set flags to 0 for the regular pool, or FLUX_MATCHTAG_GROUP for the group pool. It doesn't seem like you're likely hitting this but if you can trivially check perhaps that's a good idea to do.

Incidentally, the group matchtags used by flux_kvs_watch() can only be retired by calling flux_kvs_unwatch() (or closing the handle). I peeked at libjsc and it's not using kvs_watch() anymore.

So make check is working OK here?

garlick on 7 Sep 2016

👍1

I meant to add: kvs_get()is built on top of flux_rpc().

garlick on 7 Sep 2016

Actually, for some reason make check was failing. I did a git clean -fdx and recompiled and now make check passed. Unfortunately, kvs_get is still returning an error and EAGAIN is still being set.

I will monitor the matchtags and see if that is what is happening.

SteVwonder on 7 Sep 2016

Matchtags available before the EAGAIN happens:

regular pool: 65526
group pool: 4095

This was performed immediately before the kvs_get that causes EAGAIN to be set. I used the same flux handle for both the matchtag_avail and the kvs_get. Both operations were through the python bindings.

SteVwonder on 7 Sep 2016

OK, that rules out the matchtag pool on the client side.

I wonder if the error is coming back in the RPC response from the kvs module? If _it_ ran out of matchtags, the new code is doing a better job of getting errors back to the caller. You might try this to isolate the problem to the KVS module (ensure libkvs.so gets rebuilt and python libs are relinked):

diff --git a/src/modules/kvs/libkvs.c b/src/modules/kvs/libkvs.c
index 1e54496..477a858 100644
--- a/src/modules/kvs/libkvs.c
+++ b/src/modules/kvs/libkvs.c
@@ -344,8 +344,10 @@ int kvs_get (flux_t h, const char *key, char **val)
         goto done;
     if (!(rpc = flux_rpc (h, "kvs.get", Jtostr (in), FLUX_NODEID_ANY, 0)))
         goto done;
-    if (flux_rpc_get (rpc, NULL, &json_str) < 0)
+    if (flux_rpc_get (rpc, NULL, &json_str) < 0) {
+        fprintf (stderr, "%s: RPC response: %s\n", __FUNCTION__,  flux_strerror (errno));
         goto done;
+    }
     if (!(out = Jfromstr (json_str))) {
         errno = EPROTO;
         goto done;

Ugh, I hope I didn't introduce some nasty leak with my recent changes in there.

garlick on 7 Sep 2016

Have you by chance run flux dmesg to see if any errors were logged by the KVS module?

garlick on 7 Sep 2016

Just now catching up on this. From a bindings perspective, if it's throwing EAGAIN repeatedly, it's because the function is telling it to. If there is a case where the resulting behavior is unfortunate, it's pretty trivial to put a wrapper around it to catch that and do something else with it, but the user probably wants to know it's an EAGAIN rather than "does not exist" in this case.

For what it's worth, this is probably a good spot to use watch_once, there's a specialized wrapper for the watch_once functions that makes it pretty slick for one-off waits like this.

I may take a bit this morning and clean up some things in the kvs python bindings that have become stupid by the way... there's no reason to be converting json->json-c struct->json->json in python string anymore and it can't possibly be helping performance.

trws on 7 Sep 2016

FYI I opened #796 with an idea to actually include error messages along with rpc and other message error responses. That perhaps would help (or perhaps not) in the "where'd this error originate" game we're currently playing.

grondo on 7 Sep 2016

👍1

@trws: I submitted a PR ( #794 ) with several fixes to the python bindings (including removing json-c). Can you take a look and let me know what you think?

SteVwonder on 7 Sep 2016

If there's a good way to do that, I'd love to incorporate it into the error printer @grondo. It tries to figure it out right now, but only has so much to go on.

trws on 7 Sep 2016

Well there isn't now as we've kind of defaulted to a system call style errno return (it is kind of embedded in the flux_respond() API and CMB1 protocol as well). I'm just suggesting we allow an actual error message to be added in addition to the UNIX errno.

grondo on 7 Sep 2016

@garlick: I haven't seen anything in the logs (collected by setting log-filename). When I run flux list-instances, I don't see anything, so I am unable to proxy in and run flux dmesg.

I should also note that this is a non-deterministic problem. For small runs, I am sometimes able to successfully run without it occurring (I assume it succeeds when the key exists at the time that the checks occur and only fails when the key doesn't exist).

fprintf's from a successful run:

kvs_get: RPC response: Is a directory
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable

fprintf's from a failed run:

kvs_get: RPC response: Is a directory
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable
kvs_get: RPC response: Resource temporarily unavailable
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/data/hd1/Repositories/flux-framework/capacitor/./flux-capacitor", line 374, in scheduler_proc
    sched.state_change(key, value)
  File "./capacitor/initial_program", line 277, in state_change
    self.interface.state_changed(self.jobs[job_id], state)
  File "./capacitor/initial_program", line 342, in state_changed
    while not try_key_exists(job.kvs.fh, complete_key):
  File "./capacitor/initial_program", line 326, in try_key_exists
    key_exists = kvs.exists(fh, key)
  File "/data/hd1/Repositories/flux-framework/flux-core/src/bindings/python/flux/kvs.py", line 36, in exists
    raise err
EnvironmentError: [Errno 11] Resource temporarily unavailable

SteVwonder on 7 Sep 2016

OK so the error is coming back from the KVS module.

I'm confused - why is EAGAIN fatal in the failing case (and only after five occurrences) and not in the successful case?

garlick on 7 Sep 2016

I am also unsure of what is going on. I added a print to the python bindings, and the kvs_get is indeed returning -1 and setting errno to 11 everytime that "Resource temporarily unavailable" appears. I cannot explain why it nondeterministically throws the EnvironmentError.

After running ~50 times, I also encountered the same error with kvs_get_dir via the python bindings. I also noticed that some runs complete without ever printing the "Resource temporarily unavailable". So I assume this is a race condition of some kind in the KVS. I just don't get why I'm getting EAGAIN instead of ENOENT.

I will keep digging (I'm also starting to think some part of the python bindings is gobbling up the EnvironmentError when it shouldn't be).

SteVwonder on 7 Sep 2016

@garlick: I haven't seen anything in the logs (collected by setting log-filename).

Nothing from "kvs.err[0]"?

It looks to me like you should be seeing a log entry in all the cases where a kvs.get RPC would return an EAGAIN response, though I may be missing something.

garlick on 7 Sep 2016

This issue started as Python using the deprecated interface and then morphed into a discussion about the EAGAIN. PR #794 fixed the deprecated interface, but the EAGAIN is still not fixed. Should we reopen this issue or start another one?

SteVwonder on 8 Sep 2016

Reopened, the PR was ready to go, it just didn't fully fix this issue
AFAICT.

On 8 Sep 2016, at 11:45, Stephen Herbein wrote:

This issue started as Python using the deprecated interface and then
morphed into a discussion about the EAGAIN. The EAGAIN is still not
fixed. Should we reopen this issue or start another one?

You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/793#issuecomment-245698932

trws on 8 Sep 2016

👍1

I attempted to use strace as was suggested in yesterday's meeting, but it seems to "fix" the issue and cause others. Specifically, I attempted to run strace as follows strace ${STRACE_ARGS} ./flux-core/src/cmd/.libs/flux.

When I run with ${STRACE_ARGS}="-f", (which traces al descendents) the simulation will always complete. I assume this is because strace is slowing everything down so much that the proposed race condition/resource exhausting underlying the EAGAIN never occurs. I also ran with IO redirected to a file to speedup strace, but the EAGAIN still does not occur.

When I run with no strace args, the simulation hangs. It is hard to determine why since the strace output contains to grow without bound due to the reactor's continued use of poll. So the "last line" of the strace output doesn't correlate with the hang, and the sheer volume of output makes it impractical to read through by hand. This test probably isn't all that useful though since I'm running strace of flux-start, not on flux-broker directly.

Any thoughts or suggestions on what I should change?

SteVwonder on 13 Sep 2016

You could try with flux start strace flux broker and see if it makes
any difference. I'm not sure if it will, but it's worth a shot. Am I
reading correctly that we're using poll..?

On 13 Sep 2016, at 10:09, Stephen Herbein wrote:

I attempted to use strace as was suggested in yesterday's meeting, but
it seems to "fix" the issue and cause others. Specifically, I
attempted to run strace as follows strace ${STRACE_ARGS} ./flux-core/src/cmd/.libs/flux.

When I run with ${STRACE_ARGS}="-f", (which traces al descendents)
the simulation will always complete. I assume this is because strace
is slowing everything down so much that the proposed race
condition/resource exhausting underlying the EAGAIN never occurs. I
also ran with IO redirected to a file to speedup strace, but the
EAGAIN still does not occur.

When I run with no strace args, the simulation hangs. It is hard to
determine why since the strace output contains to grow without bound
due to the reactor's continued use of poll. So the "last line" of the
strace output doesn't correlate with the hang, and the sheer volume of
output makes it impractical to read through by hand. This test
probably isn't all that useful though since I'm running strace of
flux-start, not on flux-broker directly.

Any thoughts or suggestions on what I should change?

You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/793#issuecomment-246752942

trws on 13 Sep 2016

Also of potential interest: it was also mentioned in yesterday's meeting that I should just retry the KVS operation whenever I see EAGAIN just to determine if that would work as a temporary solution.

I modified my code to do so (which required modifications to both kvs_get and kvs_get_dir since it showed up in both operations), but this doesn't even temporarily fix the problem. The python process ends up in an infinite loop of retrying either the get or the get_dir. The 2 interesting points to note here are:

The matchtag pool decreases by 1 after iteration of this infinite loop (i.e., after every failed get). Looking at when matchtag_free is called, my guess is that this is caused by the KVS RPC never completing. Maybe the KVS module is queueing these requests up and never completing them? Or maybe it's just some leak/bug in the python bindings?
If I proxy into the flux instance, the KVS key that the python process is stuck trying to get does indeed exist. The specific key is 'lwj.X.complete_time'. From the CLI, I can dump the 'lwj.X' dir and see the key, and I can get it directly.

SteVwonder on 13 Sep 2016

Interesting thought. It's possible that an RPC isn't getting destroyed
because it's being held in a garbage collected variable rather than
being bound to a with block, let me take a look at how that cleanup code
is set up.

On 13 Sep 2016, at 10:17, Stephen Herbein wrote:

Also of potential interest: it was also mentioned in yesterday's
meeting that I should just retry the KVS operation whenever I see
EAGAIN just to determine if that would work as a temporary solution.

I modified my code to do so (which required modifications to both
kvs_get and kvs_get_dir since it showed up in both operations), but
this doesn't even temporarily fix the problem. The python process
ends up in an infinite loop of retrying either the get or the get_dir.
The 2 interesting points to note here are:

The matchtag pool decreases by 1 after iteration of this infinite
loop (i.e., after every failed get). Looking at when matchtag_free is
called, my guess is that this is caused by the KVS RPC never
completing. Maybe the KVS module is queueing these requests up and
never completing them? Or maybe it's just some leak/bug in the python
bindings?

If I proxy into the flux instance, the KVS key that the python
process is stuck trying to get does indeed exist. The specific key is
'lwj.X.complete_time'. From the CLI, I can dump the 'lwj.X' dir and
see the key, and I can get it directly.

You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/793#issuecomment-246755190

trws on 13 Sep 2016

@trws: poll is definitely being used by something in flux-core/flux-sched/flux-capacitor. Snippet taken from the output of strace during a hang (this is the output that will be dumped continually):

epoll_ctl(3, EPOLL_CTL_ADD, 13, {EPOLLIN, {u32=13, u64=81604378637}}) = 0
epoll_wait(3, [], 64, 0)                = 0
poll([{fd=96, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=89, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=82, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=22, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=76, events=POLLIN}], 1, 0)    = 0 (Timeout)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47650, 0x7ffe8fe47654) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47620, 0x7ffe8fe47624) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe475f0, 0x7ffe8fe475f4) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47620, 0x7ffe8fe47624) = -1 EBADF (Bad file descriptor)
poll([{fd=60, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=66, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=61, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=55, events=POLLIN}], 1, 0)    = 0 (Timeout)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47650, 0x7ffe8fe47654) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47620, 0x7ffe8fe47624) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe475f0, 0x7ffe8fe475f4) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47620, 0x7ffe8fe47624) = -1 EBADF (Bad file descriptor)
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=42, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=15, events=POLLIN}], 1, 0)    = 0 (Timeout)
epoll_wait(13, [], 1, 0)                = 0
poll([{fd=96, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=89, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=82, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=22, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=76, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=60, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=66, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=61, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=55, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=42, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=15, events=POLLIN}], 1, 0)    = 0 (Timeout)
epoll_wait(13, [], 1, 0)                = 0
epoll_ctl(3, EPOLL_CTL_ADD, 76, {EPOLLIN, {u32=76, u64=8589934668}}) = -1 EEXIST (File exists)
epoll_ctl(3, EPOLL_CTL_ADD, 55, {EPOLLIN, {u32=55, u64=8589934647}}) = -1 EEXIST (File exists)

SteVwonder on 13 Sep 2016

Looks like it's zmq_poll, interesting that they do that. Found a
discussion here.

On 13 Sep 2016, at 10:20, Stephen Herbein wrote:

@trws: poll is definitely being used by something in
flux-core/flux-sched/flux-capacitor. Snippet taken from the output of
strace during a hang (this is the output that will be dumped
continually):

epoll_ctl(3, EPOLL_CTL_ADD, 13, {EPOLLIN, {u32=13, u64=81604378637}}) 
= 0
epoll_wait(3, [], 64, 0)                = 0
poll([{fd=96, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=89, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=82, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=22, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=76, events=POLLIN}], 1, 0)    = 0 (Timeout)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47650, 
0x7ffe8fe47654) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47620, 
0x7ffe8fe47624) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe475f0, 
0x7ffe8fe475f4) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47620, 
0x7ffe8fe47624) = -1 EBADF (Bad file descriptor)
poll([{fd=60, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=66, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=61, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=55, events=POLLIN}], 1, 0)    = 0 (Timeout)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47650, 
0x7ffe8fe47654) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47620, 
0x7ffe8fe47624) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe475f0, 
0x7ffe8fe475f4) = -1 EBADF (Bad file descriptor)
getsockopt(-1815885344, SOL_SOCKET, SO_TYPE, 0x7ffe8fe47620, 
0x7ffe8fe47624) = -1 EBADF (Bad file descriptor)
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=42, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=15, events=POLLIN}], 1, 0)    = 0 (Timeout)
epoll_wait(13, [], 1, 0)                = 0
poll([{fd=96, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=89, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=82, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=22, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=76, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=60, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=66, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=61, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=55, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=42, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=15, events=POLLIN}], 1, 0)    = 0 (Timeout)
epoll_wait(13, [], 1, 0)                = 0
epoll_ctl(3, EPOLL_CTL_ADD, 76, {EPOLLIN, {u32=76, u64=8589934668}}) = 
-1 EEXIST (File exists)
epoll_ctl(3, EPOLL_CTL_ADD, 55, {EPOLLIN, {u32=55, u64=8589934647}}) = 
-1 EEXIST (File exists)

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/flux-framework/flux-core/issues/793#issuecomment-246756283

trws on 13 Sep 2016

It looks like the RPC cleanup stuff might have gotten messed up along the way. I'm looking into it, but I think you're onto something @SteVwonder, the bindings may not be cleaning up RPCs right now if I'm reading this right.

trws on 13 Sep 2016

@SteVwonder, give #807 a try and see if that helps. It's a significant refactor of the cleanup handling in the python bindings, and should ensure cleanup of python generated RPCs, kvsdir objects and so-forth.

trws on 14 Sep 2016

@trws: Thanks for working on that! I think it is a great improvement. Unfortunately, it is not the underlying problem that I'm experiencing (following output generated from your branch that was then rebased onto my branch in order to include the matchtag bindings for python):

lwj.3.complete_time kvs entry not found, waiting for it to be created
EAGAIN: Failed to check if key exists (lwj.3.complete_time)
regular matchtag avail: 65495
group matchtag avail: 4095
lwj.3.complete_time kvs entry not found, waiting for it to be created
EAGAIN: Failed to check if key exists (lwj.3.complete_time)
regular matchtag avail: 65494
group matchtag avail: 4095
lwj.3.complete_time kvs entry not found, waiting for it to be created
EAGAIN: Failed to check if key exists (lwj.3.complete_time)
regular matchtag avail: 65493
group matchtag avail: 4095
lwj.3.complete_time kvs entry not found, waiting for it to be created
EAGAIN: Failed to check if key exists (lwj.3.complete_time)
regular matchtag avail: 65492
group matchtag avail: 4095
lwj.3.complete_time kvs entry not found, waiting for it to be created
EAGAIN: Failed to check if key exists (lwj.3.complete_time)
regular matchtag avail: 65491
group matchtag avail: 4095
lwj.3.complete_time kvs entry not found, waiting for it to be created
EAGAIN: Failed to check if key exists (lwj.3.complete_time)
regular matchtag avail: 65490
group matchtag avail: 4095
lwj.3.complete_time kvs entry not found, waiting for it to be created
EAGAIN: Failed to check if key exists (lwj.3.complete_time)
regular matchtag avail: 65489
group matchtag avail: 4095
lwj.3.complete_time kvs entry not found, waiting for it to be created

The thing that gets me about this is that the entry does exist:

# sherbein at sealand in /data/hd1/Repositories/flux-framework/logs/2n2c.2j.0i [10:26:32]
→ flux proxy /tmp/flux--1-ad6DdE/0/

# sherbein at sealand in /data/hd1/Repositories/flux-framework/logs/2n2c.2j.0i [10:27:04]
→ flux kvs dir lwj.3 | grep -v -Ee '(environ|rdl)'
lwj.3.state = complete
lwj.3.nnodes = 2
lwj.3.cmdline = [ "sleep", "60" ]
lwj.3.ntasks = 4
lwj.3.is_hierarchical = 0
lwj.3.runtime = 60
lwj.3.cwd = /data/hd1/Repositories/flux-framework
lwj.3.walltime = 65
lwj.3.create-time = 1473862908.578746
lwj.3.execution_time = 60
lwj.3.rank.
lwj.3.starting_time = 3900.000350
lwj.3.running_time = 3900.000350
lwj.3.complete_time = 3960.000350
lwj.3.io_time = 0.000000

# sherbein at sealand in /data/hd1/Repositories/flux-framework/logs/2n2c.2j.0i [10:27:27]
→ flux kvs get lwj.3.complete_time
3960.000350

I'm really not sure what to do at this point other than wait for #806, #792, or #796 to be fixed/merged and then try again. Anybody have any other ideas? This is blocking most of my work since I rely quite heavily on the python KVS bindings for my main script (the initial program). Maybe I should just do a bisect to find a working version? That would both give me a version that I can continue work on while also pointing out what commit caused this regression.

SteVwonder on 14 Sep 2016

git bisect might be a good idea at this point.

Another thought is to open the handle that's getting the EAGAINs with FLUX_O_TRACE, which will cause all messages sent/received to be dumped to stderr. (Or set FLUX_HANDLE_TRACE=1 in the environment).

If there's some branch I can poke at that exhibits this problem, if you can give me instructions on how to reproduce the problem, I'm happy to give it a try.

garlick on 14 Sep 2016

@SteVwonder, Do you already have a known working commit to pass to git bisect? (sorry if I missed that above)? Also if you have a script that reproduces the problem, git bisect run will automate the bisect for you..

grondo on 14 Sep 2016

@garlick: The FLUX_O_TRACE is a good idea. I will try that first. If that fails, I will get you a reproducer. If I make a Docker image (or Dockerfile), do you have somewhere to run that (maybe on Cloud9 or your local Linux box)?

@grondo: As far as I could tell, everything worked around the end of July (I hadn't pulled at all during August). So I would probably go back to July 21: fd8723b2539b116e24aceae9c024ac82a8da1a0c. As far as the script, that is a really good idea. I have a script that experiences the hang non-deterministicly. So when it hangs, it is definitely broken. When it completes, it might be ok. I will look to see if I can have git-bisect repeat a test multiple times, if not, I'll make a wrapper script that does.

SteVwonder on 14 Sep 2016

👍1

Ideally I'd like to pull from git somewhere and run it in my environment
(which is c9 this week). Docker image could be a last resort I guess.

On Sep 14, 2016 8:13 AM, "Stephen Herbein" [email protected] wrote:

@garlick https://github.com/garlick: The FLUX_O_TRACE is a good idea. I
will try that first. If that fails, I will get you a reproducer. If I make
a Docker image (or Dockerfile), do you have somewhere to run that (maybe on
Cloud9 or your local Linux box)?

@grondo https://github.com/grondo: As far as I could tell, everything
worked around the end of July (I hadn't pulled at all during August). So I
would probably go back to July 21: fd8723b
https://github.com/flux-framework/flux-core/commit/fd8723b2539b116e24aceae9c024ac82a8da1a0c.
As far as the script, that is a really good idea. I have a script that
experiences the hang non-deterministicly. So when it hangs, it is
definitely broken. When it completes, it might be ok. I will look to see if
I can have git-bisect repeat a test multiple times, if not, I'll make a
wrapper script that does.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/flux-framework/flux-core/issues/793#issuecomment-247046731,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKX20u4DVwho7MaOuqD-OFHIyC3hlr7ks5qqA8rgaJpZM4J2PEn
.

garlick on 14 Sep 2016

FWIW, as I was working on the FLUX_O_TRACE, I received a Protocol Error. This isn't the first time that it has popped up while I have been debugging the EAGAIN, but it certainly less frequent than the EAGAIN.

Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/data/hd1/Repositories/flux-framework/capacitor/./flux-capacitor", line 375, in scheduler_proc
    sched.state_change(key, value)
  File "./capacitor/initial_program", line 278, in state_change
    self.interface.state_changed(self.jobs[job_id], state)
  File "./capacitor/initial_program", line 363, in state_changed
    while not try_key_exists(job.kvs.fh, complete_key):
  File "./capacitor/initial_program", line 332, in try_key_exists
    key_exists = kvs.exists(fh, key)
  File "/data/hd1/Repositories/flux-framework/flux-core/src/bindings/python/flux/kvs.py", line 30, in exists
    get_key_direct(flux_handle, key)
  File "/data/hd1/Repositories/flux-framework/flux-core/src/bindings/python/flux/kvs.py", line 21, in get_key_direct
    _raw.get(flux_handle, key, valp)
  File "/data/hd1/Repositories/flux-framework/flux-core/src/bindings/python/flux/wrapper.py", line 191, in __call__
    raise EnvironmentError(err, os.strerror(err))
EnvironmentError: [Errno 71] Protocol error

@garlick: will do, if it comes to it, I will write up a short list of instructions for setting up and running my reproducer. Hopefully the O_TRACE produces some clues.

SteVwonder on 14 Sep 2016

You didn't happen to grab the patch from #802 that switched kvs.watch over to the regular matchtag pool without the rest of #806 did you? If so, a "zombie" kvs.watch response could continue to use a matchtag retired by kvs_unwatch() and get mixed up with the kvs_get() response above resulting in EPROTO.

garlick on 14 Sep 2016

@garlick: I didn't grab that patch.

Sorry for taking so long to respond, but oddly enough though, it seems that the problem has disappeared. I added FLUX_O_TRACE, and afterwards, the EAGAIN stopped appearing. After removing FLUX_O_TRACE, it still wouldn't reappear. My only guess at this point is that it was due to some inconsistent state due to how I was building. I was diligent about running make distclean and git clean -dxf, but I may have missed deleting the install directory. I am not exactly sure what caused it to work.

I confirmed that it works on https://github.com/SteVwonder/flux-core/commit/d517733e48ab3ac7c86012b7c22cbe09a290733a as well as the latest version of flux-core (https://github.com/flux-framework/flux-core/commit/430bedb0bb1bfeb0c2bcc566e4b7f72b6e9733a6).

Thanks everyone for helping me debug, and sorry for all the trouble. I'll post here if the problem comes back, but for now, I think we can close the issue.

SteVwonder on 14 Sep 2016

@SteVwonder, has this gotten any better? Are we in a position to close?

trws on 10 Nov 2016

@trws, I believe it is good now. Closing.

SteVwonder on 11 Nov 2016

Flux-core: EAGAIN experienced when using Python KVS bindings

All 43 comments

Related issues