Just placing some notes here from some testing I did with sched-simple in unlimited mode.
During that testing, it was noted that there are likely several bottlenecks in the job exec system. These were really evident when running an instance across 512 cores. Running 1024 sleep 0 jobs resulted in runtimes
A perf record run on rank 0 of showed one potential site for optimization in the kvs' namespace_create(perf output below). It won't be a silver bullet, but the implementation seems to make several blocking flux_event_subscribe() calls per namespace. Perhaps these could be made asynchronous, namespace creation could be batched, or the kvs could use some other mechanism to notify itself of namespace events. I just wanted to be sure this issue was open in case someone hda time or the need to look in here :wink:
- 35.71% 0.32% libflux-core.so.2.0.0 [.] ev_run โ
- 35.40% ev_run โ
- 34.88% ev_invoke_pending โ
- 17.80% handle_cb โ
- 17.23% dispatch_message (inlined) โ
- 17.09% call_handler โ
- 9.83% namespace_create_request_cb โ
+ namespace_create (inlined) โ
- 3.63% heartbeat_cb โ
- 3.38% kvsroot_mgr_iter_roots โ
heartbeat_root_cb โ
event_unsubscribe (inlined) โ
flux_event_unsubscribe โ
+ op_event_unsubscribe โ
- 1.17% start_cb โ
- job_start (inlined) โ
+ 1.12% job_get_kill_timeout (inlined) โ
0.60% content_store_request โ
0.52% store_cb โ
- 3.59% transaction_check_cb โ
- 3.57% kvsroot_mgr_iter_roots โ
+ 2.90% kvstxn_check_root_cb โ
0.57% zhash_next โ
- 3.02% jobinfo_start_continue โ
- 2.51% jobinfo_load_implementation (inlined) โ
- exec_init โ
+ 2.48% job_shell_path (inlined) โ
- 2.37% check_cb โ
+ 1.39% module_cb โ
+ 0.56% 0x2aaaab440d32 โ
- 2.13% transaction_prep_cb โ
- kvsroot_mgr_iter_roots โ
+ 1.61% kvstxn_prep_root_cb โ
- 1.07% content_store_completion โ
- 0.70% flux_future_destroy โ
+ 0.66% aux_destroy โ
+ 0.94% prepare_cb โ
0.59% remote_out_check_cb โ
- 35.70% 0.00% libflux-core.so.2.0.0 [.] flux_reactor_run โ
flux_reactor_run โ
- ev_run โ
- 35.10% ev_invoke_pending โ
- 17.98% handle_cb โ
- 17.41% dispatch_message (inlined) โ
- 17.27% call_handler โ
- 9.96% namespace_create_request_cb โ
- namespace_create (inlined) โ
- 9.93% event_subscribe (inlined) โ
flux_event_subscribe โ
+ op_event_subscribe โ
+ 3.68% heartbeat_cb โ
+ 1.17% start_cb โ
0.60% content_store_request โ
0.52% store_cb โ
+ 3.59% transaction_check_cb โ
+ 3.06% jobinfo_start_continue โ
+ 2.37% check_cb โ
+ 2.13% transaction_prep_cb โ
+ 1.07% content_store_completion โ
+ 0.94% prepare_cb โ
0.59% remote_out_check_cb โ
- 35.13% 0.19% libflux-core.so.2.0.0 [.] ev_invoke_pending โ
- 34.95% ev_invoke_pending โ
- 17.89% handle_cb โ
- 17.32% dispatch_message (inlined) โ
- 17.18% call_handler โ
- 9.88% namespace_create_request_cb โ
+ namespace_create (inlined) โ
+ 3.67% heartbeat_cb โ
+ 1.17% start_cb โ
0.60% content_store_request โ
0.52% store_cb โ
Perhaps these could be made asynchronous
Meaning direct rpc calls to local.sub?
Or how connectors work and event_subscribe in connectors is implemented? Would it be ok for event_subscribe / event_unsubscribe to return a future instead?
Meaning direct rpc calls to local.sub?
I didn't dig too much further unfortunately. If subscribe returned a future that would be good, but sounds like that may not be trivial (#1557). Even if there was some way to turn the separate calls to flux_event_subscribe() into a single call, that might help.
I also didn't look to see if there was perhaps some way for the kvs to sidestep the need to subscribe on each namespace create.
I think this may be low priority for now, but I wanted to make sure to capture some notes for posterity.
Also, there's no guarantee this is even the main bottleneck. There are other components of the exec system that could use some slight re-engineering to handle the high throughput case.
Throwing this benchmark in here. Looks like kvs_namespace_create() isn't the whole story, but we're limited to about ~140 namespace/s on an idle instance.
import flux
import os
import time
f = flux.Flux()
owner=os.getuid()
pid=os.getpid()
count=1024
remaining = count
def namespace_create_cb(future, arg):
global remaining
f = future.get_flux()
try:
future.get()
except Exception as e:
f.reactor_stop_error(f.get_reactor())
remaining = remaining - 1
if remaining == 0:
f.reactor_stop(f.get_reactor())
t0 = time.time()
for i in range(count):
payload = dict(namespace="{}-{}".format(pid, i), owner=owner, flags=0)
f.rpc("kvs.namespace-create", payload).then(namespace_create_cb)
f.reactor_run(f.get_reactor(), 0)
dt = time.time() - t0
nps = count/dt
print("created {0} namespaces in {1:.3f}s. {2:.2f} ns/s.".format(count, dt, nps))
Actually, when I add namespace_remove to the benchmark, things get more interesting. This version of the benchmark simulates the namespace create/destroy workload of running a mass of very short jobs.
Note that things start to break down quickly once there is a lot of activity.
# nsbenchmark.py
import flux
import os
import time
import sys
f = flux.Flux()
owner = os.getuid()
pid = os.getpid()
count = 1024
if len(sys.argv) > 1:
count = int(sys.argv[1])
remaining = count
created = 0
t0 = time.time()
def namespace_remove_cb(future, arg):
global remaining
f = future.get_flux()
try:
future.get()
except Exception as e:
f.reactor_stop_error(f.get_reactor())
remaining = remaining - 1
if remaining == 0:
f.reactor_stop(f.get_reactor())
def report_created(count):
global t0
dt = time.time() - t0
nps = count / dt
print(
"created all {0} namespaces after {1:.3f}s. {2:.2f} ns/s".format(count, dt, nps)
)
def namespace_create_cb(future, arg):
global created
global count
f = future.get_flux()
try:
future.get()
except Exception as e:
f.reactor_stop_error(f.get_reactor())
created += 1
if created == count:
report_created(count)
f.rpc("kvs.namespace-remove", dict(namespace=future.ns_name)).then(
namespace_remove_cb
)
for i in range(count):
name = "{}-{}".format(pid, i)
payload = dict(namespace="{}-{}".format(pid, i), owner=owner, flags=0)
future = f.rpc("kvs.namespace-create", payload)
future.then(namespace_create_cb)
future.ns_name = name
f.reactor_run(f.get_reactor(), 0)
dt = time.time() - t0
nps = count / dt
print("create/destroy {0} namespaces in {1:.3f}s. {2:.2f} ns/s.".format(count, dt, nps))
$ for i in 100 200 300 400 500 600 700 800 900 1000 1200 1600; \
do src/cmd/flux start -s 4 flux python nsbenchmark.py $i; done
created all 100 namespaces after 0.158s. 633.92 ns/s
create/destroy 100 namespaces in 0.205s. 488.48 ns/s.
created all 200 namespaces after 0.604s. 331.18 ns/s
create/destroy 200 namespaces in 0.989s. 202.14 ns/s.
created all 300 namespaces after 1.341s. 223.75 ns/s
create/destroy 300 namespaces in 1.807s. 166.03 ns/s.
created all 400 namespaces after 2.364s. 169.23 ns/s
create/destroy 400 namespaces in 2.801s. 142.80 ns/s.
created all 500 namespaces after 3.694s. 135.36 ns/s
create/destroy 500 namespaces in 5.466s. 91.47 ns/s.
created all 600 namespaces after 5.361s. 111.92 ns/s
create/destroy 600 namespaces in 8.335s. 71.99 ns/s.
created all 700 namespaces after 7.224s. 96.90 ns/s
create/destroy 700 namespaces in 11.443s. 61.17 ns/s.
created all 800 namespaces after 9.463s. 84.54 ns/s
create/destroy 800 namespaces in 15.055s. 53.14 ns/s.
created all 900 namespaces after 11.991s. 75.06 ns/s
create/destroy 900 namespaces in 19.644s. 45.82 ns/s.
created all 1000 namespaces after 14.762s. 67.74 ns/s
create/destroy 1000 namespaces in 24.121s. 41.46 ns/s.
created all 1200 namespaces after 21.375s. 56.14 ns/s
create/destroy 1200 namespaces in 34.617s. 34.67 ns/s.
created all 1600 namespaces after 38.435s. 41.63 ns/s
create/destroy 1600 namespaces in 62.701s. 25.52 ns/s.
tweaked the script above to look at creates and destroys separately
# nsbenchmark.py
import flux
import os
import time
import sys
f = flux.Flux()
owner = os.getuid()
pid = os.getpid()
count = 1024
if len(sys.argv) > 1:
count = int(sys.argv[1])
remaining = count
created = 0
t0 = time.time()
def report_created(count):
global t0
dt = time.time() - t0
nps = count / dt
print(
"created all {0} namespaces after {1:.3f}s. {2:.2f} ns/s".format(count, dt, nps)
)
def namespace_create_cb(future, arg):
global created
global count
f = future.get_flux()
try:
future.get()
except Exception as e:
f.reactor_stop_error(f.get_reactor())
created += 1
if created == count:
report_created(count)
f.reactor_stop(f.get_reactor())
def report_removed(count):
global t1
dt = time.time() - t1
nps = count / dt
print(
"destroy {0} namespaces in {1:.3f}s. {2:.2f} ns/s.".format(count, dt, nps)
)
def namespace_remove_cb(future, arg):
global remaining
f = future.get_flux()
try:
future.get()
except Exception as e:
f.reactor_stop_error(f.get_reactor())
remaining = remaining - 1
if remaining == 0:
report_removed(count)
f.reactor_stop(f.get_reactor())
for i in range(count):
payload = dict(namespace="{}-{}".format(pid, i), owner=owner, flags=0)
future = f.rpc("kvs.namespace-create", payload)
future.then(namespace_create_cb)
f.reactor_run(f.get_reactor(), 0)
t1 = time.time()
for i in range(count):
payload = dict(namespace="{}-{}".format(pid, i))
future = f.rpc("kvs.namespace-remove", payload)
future.then(namespace_remove_cb)
f.reactor_run(f.get_reactor(), 0)
created all 10 namespaces after 0.008s. 1248.45 ns/s
destroy 10 namespaces in 0.004s. 2774.74 ns/s.
created all 100 namespaces after 0.123s. 811.42 ns/s
destroy 100 namespaces in 0.030s. 3362.46 ns/s.
created all 500 namespaces after 2.555s. 195.67 ns/s
destroy 500 namespaces in 0.153s. 3259.45 ns/s.
created all 1000 namespaces after 9.901s. 101.00 ns/s
destroy 1000 namespaces in 0.318s. 3141.30 ns/s.
created all 2000 namespaces after 40.676s. 49.17 ns/s
destroy 2000 namespaces in 0.697s. 2870.90 ns/s.
destroy rate seems to remain somewhat consistent, but creates clearly degrades over time.
ran some benchmarks using perf record -g --call-graph=dwarf src/cmd/flux start flux python nsbenchmark.py 1000 and got some worse numbers that what @grondo originally got (I'm suspecting @grondo's first perf stats didn't do quite as much namespace creation).
- 62.07% 0.00% flux-broker-0 kvs.so [.] namespace_create_request_cb
- namespace_create_request_cb
- 62.02% namespace_create (inlined)
- 61.81% event_subscribe (inlined)
- 61.80% flux_event_subscribe
- op_event_subscribe
+ 60.84% flux_future_get
+ 0.69% flux_future_destroy
As suspected, flux_event_subscribe() appears to be a big culprit, we are spending just way too much time sitting on flux_future_get().
I initially thought, could we make flux_event_subscribe() simply take multiple topics? That would probably help, but if we're going down the path of adding/modifying connector callbacks, we might as well just do the fix for #1557, as painful as it might be.
Another option is to just eliminate calls to flux_event_subscribe(). We could instead have events that every KVS module always listens to, checks if the namespaces in the events are relevant to namespaces its aware of, and process accordingly. While doing this would unquestionably improve KVS namespace create performance, I'm unsure of the consequences of this change in regards to the rest of KVS performance (i.e. KVS module processing events it isn't concerned with). It seems like the wrong path to go down as it could be a net win or loss for certain workloads.
Given issue #1557 is already created, I'd lean to try and fix that first.
When I originally ran perf it was with a high throughput job workload, so lots of other stuff was going on compared to the namespace benchmark.
flux_event_subscribe() is always synchronous I think so it would be great if it could be completely avoid, except perhaps at module startup where you subscribe to kvs-namespace.* once or something.
I'm unsure of the consequences of this change in regards to the rest of KVS performance (i.e. KVS module processing events it isn't concerned with). It seems like the wrong path to go down as it could be a net win or loss for certain workloads.
I don't really understand how/why the kvs subscribes to events generated by itself, but could you just subscribe once to all kvs.*, and then avoid "processing events it isn't concerned with?". I assume these are events because the kvs module on all ranks needs to be notified, so direct messages are right out?
Another thing to consider for high throughput is batching. As you work in here think about how "events" could be batched up for notification and sent out at most once per 10ms or so (look at job-manager states event for example.) Again, I'm not cognizant of the design here, so my suggestions could be off the mark.
This might be a good coffee time discussion, first to review how this works (because I've forgotten!), then to brainstorm some solutions?
I'd also like to see the event subscription design changed. The handle "ops" for subscribe unsubscribe should go away. However, I think it's actually may be a couple of weeks work so we should ask if we want to do that right now.
I don't really understand how/why the kvs subscribes to events generated by itself, but could you just subscribe once to all kvs.*, and then avoid "processing events it isn't concerned with?". I assume these are events because the kvs module on all ranks needs to be notified, so direct messages are right out?
What I'm referring to is that ranks only subscribe to events for KVS namespaces they know about. i.e. jobs that are running on that rank. So if the KVS subscribes to all events, there could be a lot of events that get received / parsed that aren't necessary to be received / parsed.
Don't know if direct messages is something that should be entertained, the KVS shouldn't know how many ranks care about a specific namespace.
Another thing to consider for high throughput is batching.
Thinking about it, batching up namespace "removes" might be ok. But I think we may want setroot events to be done in a timely manner?
This might be a good coffee time discussion, first to review how this works (because I've forgotten!), then to brainstorm some solutions?
Yeah, lets talk at coffee time.
What I'm referring to is that ranks only subscribe to events for KVS namespaces they know about. i.e. jobs that are running on that rank.
FYI - the job exec module behaves exactly the same as the benchmark posted above. It doesn't transmit any information about what ranks a job using that namespace will be running on, so I'm not sure that we're talking about the same event subscriptions here. (I was focused on the 3 back-to-back flux_event_subscribe() that seem to be in the main namespace_create path in kvs.c:namespace_create())
I'll wait for coffee-time enlightenment ;-)
One last thing I was wondering, currently the job-exec module doesn't release resources until the "namespace move" operation is fully complete, which is a process of three steps:
In the job-exec redo we'll be incrementally releasing resources, which will have to happen before all this anyway. Would a simple optimization for now be to switch the order of release/finalize here? Then resources aren't waiting on namespace_remove().
FYI - the job exec module behaves exactly the same as the benchmark posted above. It doesn't transmit any information about what ranks a job using that namespace will be running on, so I'm not sure that we're talking about the same event subscriptions here. (I was focused on the 3 back-to-back flux_event_subscribe() that seem to be in the main namespace_create path in kvs.c:namespace_create())
Now that you mention it, I had just assumed that every rank would need to read/write from the guest namespace, but now that I think about it, that may not be the case. IIRC, the shell's I/O only reads/writes on the first rank of a job. So perhaps the number of event messages flying around would be far less than what I thought.
Would a simple optimization for now be to switch the order of release/finalize here? Then resources aren't waiting on namespace_remove().
Eh, I tried this (it was trivial) and it didn't have any impact. Not surprising I guess since we've identified that the creates (during high workload) are the primary bottleneck.
So just writing out things as notes. There are 3 events subscribed to per namespace:
I guess one question is whether these synchronous, back-to-back event subscribes could be removed from namespace_create() itself. If the event subscription done lazily on first access that probably isn't so bad as putting in the critical path of flux_kvs_namespace_create() (at least for the job throughput workload)
I guess one question is whether these synchronous, back-to-back event subscribes could be removed from namespace_create() itself.
Hmmm. Good idea.
On rank 0 ...
But on worker nodes, all of the above would probably still have to be done the first time they recognize that a new namespace exists.
Obviously would prefer to not special case rank-0, but is an option.
For fun, I did exactly what I described above, moving the event subscribes to first need. Code is a little ugly at the moment, but it could be prettied up pretty easily.
Here's the benchmark results
created all 10 namespaces after 0.004s. 2701.99 ns/s
destroy 10 namespaces in 0.004s. 2342.53 ns/s.
created all 100 namespaces after 0.030s. 3296.84 ns/s
destroy 100 namespaces in 0.070s. 1433.55 ns/s.
created all 500 namespaces after 0.148s. 3389.76 ns/s
destroy 500 namespaces in 1.457s. 343.09 ns/s.
created all 1000 namespaces after 0.286s. 3491.77 ns/s
destroy 1000 namespaces in 8.952s. 111.71 ns/s.
created all 2000 namespaces after 0.751s. 2664.59 ns/s
destroy 2000 namespaces in 24.138s. 82.86 ns/s.
Namespace create is fast and relatively stable, but of course destroys are slowed down now that we have the event subscribe in that path instead.
And this is balanced by the fact that the first writes to the namespace will also be slowed down.
Based on coffee time discussion:
Is it useful to special case kvs rank 0 that must subscribe to _all_ namespaces' setroot events? Have it just subscribe to "kvs.setroot-" up front, and skip the inidividual sub/unsub requests?
Is it useful to special case kvs rank 0 that must subscribe to all namespaces' setroot events? Have it just subscribe to "kvs.setroot-" up front, and skip the inidividual sub/unsub requests?
Initial results is yes!
created all 10 namespaces after 0.005s. 2069.63 ns/s
destroy 10 namespaces in 0.004s. 2841.67 ns/s.
created all 100 namespaces after 0.033s. 3005.53 ns/s
destroy 100 namespaces in 0.036s. 2786.14 ns/s.
created all 500 namespaces after 0.152s. 3283.08 ns/s
destroy 500 namespaces in 0.161s. 3108.90 ns/s.
created all 1000 namespaces after 0.294s. 3405.36 ns/s
destroy 1000 namespaces in 0.300s. 3332.54 ns/s.
created all 2000 namespaces after 0.793s. 2521.53 ns/s
destroy 2000 namespaces in 0.828s. 2414.47 ns/s.
namespace creates and destroys stay relatively consistent. I'll whip up a PR.
Wow, what a difference!
Most helpful comment
Initial results is yes!
namespace creates and destroys stay relatively consistent. I'll whip up a PR.