This issue is moved from flux-sched PR #191.
It turned out the performance issue above simply goes back to kvs performance degradation with respect to increasing numbers of lwj directories. As the delay-scheduling optimization made job enqueue operations go much faster,
flux submitwas also made faster. This submit vs. schedule rate difference then generally resulted in higher numbers of lwj directions (with the job state being submitted), which wreck purge cannot purge.
This suggests that a significant performance limiting factor for scheduling is kvs, degrading as a function of the number of subdirectories and their sizes under the single lwj directory. This issue is created to discuss ways to overcome this issue by providing a more efficient lwj schema within kvs, a schema that can make lwj access patterns exhibited by its users (submit, wreck, sched, ...) to become much more kvs-performance friendly.
@grondo's comment:
Sounds like we'll need to address submission scalability in flux-core sometime soon.
Would changing the lwj. directory to 1 level prefix tree help do you think? Or using the least significant couple digits (or suffix tree I suppose) would cause lwj's to be distributed across parent dirs immediately instead of filling a dir and moving to the next directory. Not sure which would be better for performance....
I apologize that you are having to implement workarounds for KVS performance issues I am slow to fix.
A quick thought: rather than imposing a hierarchical naming scheme on all jobs, is it feasible to add a "chroot" option to flux-submit so that a script submiting multiple jobs can put them all in a subdir, and maybe even announce their creation with an event at the end rather than as each one is added? Something very roughly like:
for i in `seq 1 1024`; do
flux-submit --subdir perfest42 --defer-announce hostname
done
flux-submit --subdir perfrest42 --announce
Then perhaps these could be linked or moved to the lwj hierarchy in one go at "announce" time?
Just a thought, maybe a little off base.
Sorry for the late response (busy morning.)
Not sure which would be better for performance....
@grondo: I agree. I have to guess that we want to factor kvs performance characteristics into account to design this right.
When you say 1-level prefix tree, do you mean a fully distributed lwj scheme whereby each job will get lwj
For the suffix scheme, you will have up to 100 lwjnn directories at the root and accesses will be distributed 100-ways. W/ 10K jobs, each lwjnn will have up to 100 subdirectories and if the kvs performance degrades as a strict function of number of subdirectories -- this should be good enough for the current test (i.e., kvs performs well when we have 100 lwjs). Not sure how kvs will perform though under such distributed schema though.
In general, it would be good to have some micro kvs benchmark of some sort.
@garlick: Bulk submit is an excellent idea. However, I'm not sure how this will affect the overlapping scheduling w/ submits in my current testing and hence how this will affect the overall throughput. Scheduler will idle while submit is chugging through the bulk submit... I always thought such a bulk submit will be done through hierarchical jobs, though.
When you say 1-level prefix tree, do you mean a fully distributed lwj scheme whereby each job will get lwj at the root directory or something else?
Sorry I meant only 1 level of subdirectories max, like lwj.01.1, lwj.02.2, ... etc, but not lwj.01.01.101 or similar (seems like probably overkill). Sorry if I mixed up terminology, maybe should have said max _depth_ of 1.
A quick thought: rather than imposing a hierarchical naming scheme on all jobs, is it feasible to add a "chroot" option to flux-submit so that a script submiting multiple jobs can put them all in a subdir, and maybe even announce their creation with an event at the end rather than as each one is added? Something very roughly like:
Bulk submit could be handled easily as you suggest through many different methods. More concerning is that the current scheme will slow down job throughput any time the queue grows large, for instance on a busy system, where you might not be able to organize a bulk submit as you suggest.
Sorry I meant only 1 level of subdirectories max, like lwj.01.1, lwj.02.2, ... etc, but not lwj.01.01.101 or similar (seems like probably overkill). Sorry if I mixed up terminology, maybe should have said max depth of 1.
Your terminology is correct; I was the one who got confused.
I think I can write a simple benchmark script for this. At the moment kvs performance seems to be commit-bound. If we view the performance of a commit as a function of taking hashes over all of the directories leading to the key's path to the root, we can look at the performance differences between shallow hierarchy vs. deeper hierarchy. If we fix the number of keys (100K?) and distribute it as a shallow, medium or deeper hierarchy, their relative performance merit can inform our design decision.
My guess is a deeper hierarchy will do better because you will end up taking hashes over less number of items. But if traversing the hierarchy is more limiting than the hashing performance, the opposite might be true.
If we view the performance of a commit as a function of taking hashes over all of the directories leading to the key's path to the root
Not quite it (I think). Each time a new entry is added to the kvs, a new "version" of the directory for each path component has to be written to the content store. This is due to the hash tree organization.
E.g. add an entry to lwj.1.0.stderr, and you have to write new objects for
So if any of the intervening directories become large (like lwj), each update requires a big object to be sent via RPC to the content cache.
stupid idea: could kvs calculate deltas for large directories and send delta encoded objects?
Ugh, I've embarrassed myself.
@garlick: Thank you for the clarification. Sorry, actually this was more or less my mental model of a commit performance though I failed to describe it clearly. Thinking was, if we distribute the fixed number of keys into a shallow hierarchy, intervening directories tend to be larger than that of a deeper hierarchy and relative performance can provide a useful data point?
stupid idea: could kvs calculate deltas for large directories and send delta encoded objects?
Not stupid. A related idea is to split large directories. In both cases you'd have to fetch multiple objects to reconstruct the directory (if I understood your idea - back references?) so some read performance is traded away for for write performance - maybe the right way to go.
BTW, when we say a large intervening directory, we are talking about the number of items on that directory, rather than the size of the value that the item points to, correct?
Thinking was, if we distribute the fixed number of keys into a shallow hierarchy, intervening directories tend to be larger than that of a deeper hierarchy and relative performance can provide a useful data point?
Agreed, I think the model should take into account how much data that is unrelated to the commit has to be re-written. This is the "overhead" that we want to reduce.
BTW, when we say a large intervening directory, we are talking about the number of items on that directory, rather than the size of the value that the item points to, correct?
Yes.
Not stupid. A related idea is to split large directories. In both cases you'd have to fetch multiple objects to reconstruct the directory (if I understood your idea - back references?) so some read performance is traded away for for write performance - maybe the right way to go.
Back references is another way to do it, but I was thinking more simply a way to keep large RPCs off the wire would be to send a delta encoding of the object relative to some other object already in the store, though that probably doesn't help when getting the object later so never mind.
A related idea is to split large directories. In both cases you'd have to fetch multiple objects to reconstruct the directory (if I understood your idea - back references?) so some read performance is traded away for for write performance - maybe the right way to go.
Excellent idea! I was vaguely thinking about this kind of scheme (though I don't know how difficult it is to implement of course :-). Perhaps one can decouple the logical hierarchical namespace and how it is internally stored. If a deeper hierarchy with a strict directory size limit really works better, cvs can internally use it and the logical namespace contains a mapping to the physical representation...
Back references is another way to do it, but I was thinking more simply a way to keep large RPCs off the wire would be to send a delta encoding of the object relative to some other object already in the store, though that probably doesn't help when getting the object later so never mind.
Hmm, that's interesting! Possibly this in combination with clever cache management could be a useful approach.
In terms of communication vs. computation: How is the overhead of computing new objects (like hashing) relative to the overhead of sending a large RPC?
In terms of communication vs. computation: How is the overhead of computing new objects (like hashing) relative to the overhead of sending a large RPC?
My sense is that computing hashes is negligible compared to the RPCs but that's just the sort of thing somebody says before somebody else sends them a nice graph showing how wrong they were :-)
I have some testing results measured with the current master 3fd9da. As mentioned above, I wrote a test script:
kvs-layout-test.sh
usage: key-layout-test.sh N L [O]
Put and commit N keys into flux kvs as directories of an L-level hierarchy.
If O is given, put and commit O small objects into each of these directories.
positional arguments:
N an integer for the total number of items to distribute
L an integer for the number of levels in this hierarchy
optional arguments:
O an integer for the number of levels in this hierarchy
I ran this with 50K keys (N) with the level (L) ranging from flat 1 level hierarchy to 5 level hierarchy as well as one-off 10 level, each within a new flux session (size=1) on a dedicated compute node on hype. I passed O=5 so that the test puts 5 small objects into each of these directories. The results are the following:
Elapsed time for N=50000 L=1 O=5 (items per dir = ~50000): crashed after 11124 directories
Elapsed time for N=50000 L=2 O=5 (items per dir = ~224+5): 3448 secs
Elapsed time for N=50000 L=3 O=5 (items per dir = ~37+5): 3143 secs
Elapsed time for N=50000 L=4 O=5 (items per dir = ~15+5): 3176 secs
Elapsed time for N=50000 L=5 O=5 (items per dir = ~9+5): 3216 secs
Elapsed time for N=50000 L=10 O=5 (items per dir = ~3+5): 3457 secs
My observations:
I can do more testing using my test scripts (available in this gist) to come up with a better analytic performance model. But it seems it is clear to me that, if we design lwj schema such a way that each and all of its subdirectories will not exceed 100 items, we should get pretty optimal kvs performance. I did also confirm commit performance does not degrade much with the best configuration over time!
Since people like @SteVwonder plans to do one million job simulations, perhaps 3-level or 4-level lwj prefix tree will do? E.g., with 3 levels, all of the first 10K of 1M jobs map to lwj1. Of that first 100 jobs go to lwj1.1 before making its down directory lwj.1.1.1 - lwj.1.1.100?
Thanks for running this test, @dongahn!
A couple of thoughts:
Thinking a bit more about a hierarchical lwj scheme, perhaps we should use a powers of two number as the number of directories that each hierarchical component contains. This way, a flat id can be very efficiently calculated with bitwise operator:
ID = k2 * 2^14 + k1 * 2^7 + k0
Each directory will contain up to 128 and the total number of jobs this can represent will be 2^21-1 which is about 2 million.
BTW we will really want to make sure we will never grow the size of the lower level directories as that will impact larger numbers of commits (the first level affects the performance globally). A bush style hierarchy seems a right kind of schema, not only for lwj but also other types..
Also i started to wonder the perf impact of this on get... I will run the test to do this as well next week...
OK. I modified this test and measured the get performance as well. Essentially, the tester now writes the keys in ascending order according to the specified hierarchy and then reads all them back again in ascending order from key=0. Since it took too long to do this for 50K keys, I used 20K keys this time. (The flat level still hasn't finished so I'm omitting it for now; and I did use persist-filesystem so this should complete if enough time is given).
As you increase the level in the hierarchy, it appears that get performance gets worse so we need to be careful not to grow the level too high for LWJ. In a relative term, get performance seems to get degraded greater but in terms of an absolute term, the degradation won't probably make lots of difference.
So My preference is still 128-ary and expressjobid = k2 * 2^14 + k1 * 2^7 + k0 and find your job at lwj.k2.k1.k0 or lwj.k2.k1.jobid.
#### Writing a Hierarchy (2 Levels) ...
Elapsed time for Writing a hierarchy: 1292 secs
Config: N=20000 L=2 O=5
Config: Items per directory: (~146)
#### Reading back the hierarchy (2 Levels) ...
Elapsed time for Reading the hierarchy: 65 secs
Config: N=20000 L=2 O=5
Config: Items per directory: (~146)
+++++++++++++++++++++++++++++++++++++++++
#### Writing a Hierarchy (3 Levels) ...
Elapsed time for Writing a hierarchy: 1215 secs
Config: N=20000 L=3 O=5
Config: Items per directory: (~32)
#### Reading back the hierarchy (3 Levels) ...
Elapsed time for Reading the hierarchy: 75 secs
Config: N=20000 L=3 O=5
Config: Items per directory: (~32)
+++++++++++++++++++++++++++++++++++++++++
#### Writing a Hierarchy (4 Levels) ...
20000Elapsed time for Writing a hierarchy: 1276 secs
Config: N=20000 L=4 O=5
Config: Items per directory: (~17)
#### Reading back the hierarchy (4 Levels) ...
Elapsed time for Reading the hierarchy: 87 secs
Config: N=20000 L=4 O=5
Config: Items per directory: (~17)
+++++++++++++++++++++++++++++++++++++++++
#### Writing a Hierarchy (5 Levels) ...
20000Elapsed time for Writing a hierarchy: 1336 secs
Config: N=20000 L=5 O=5
Config: Items per directory: (~13)
#### Reading back the hierarchy (5 Levels) ...
Elapsed time for Reading the hierarchy: 95 secs
Config: N=20000 L=5 O=5
Config: Items per directory: (~13)
+++++++++++++++++++++++++++++++++++++++++
#### Writing a Hierarchy (10 Levels) ...
Elapsed time for Writing a hierarchy: 2109 secs
Config: N=20000 L=10 O=5
Config: Items per directory: (~8)
#### Reading back the hierarchy (10 Levels) ...
Elapsed time for Reading the hierarchy: 135 secs
Config: N=20000 L=10 O=5
Config: Items per directory: (~8)
Unfortunately, looking at the current usage of lwj.%ju in the code, making the change to the prefix tree based hierarchy for lwj. is going be non-trivial at this point, though it is certainly do-able. :cry:
@dongahn and I just had a quick chat, and perhaps a good experiment would be to apply the hierarchical approach to only the lwj-active. directory for now. There will still be unbounded growth of the lwj. directory in the kvs, but these will only be kvs links, and thus the dir will only change once at job submission, and once again if a job or set of jobs is purged.
Once jobs are archived, they go under hearbeat numbered directories in lwj-complete. so the directory entries are already naturally distributed somewhat and should result in smaller directory objects (though lwj-complete. itself will be slowly growing with time -- if that becomes a problem it can be addressed easily I think)
Because the current lwj layout in kvs also stupidly creates per-task directory entries under lwj.<id>, there is a similar problem here for large jobs (the lwj.
@grondo: After chatting with you, I briefly looked at the relevant code of sched. Fortunately, the main code only accesses lwj through JSC, I can make some changes to JSC such that it will access lwj from lwj-active when the job exists there, so growing lwj should be manage, _I think._ That leaves us dealing with growing lwj-complete for the scheduler milestone. But as you said, this would be naturally distributed, and combined withwreck purge, my sense is this should be manageable.
My biggest concern is @SteVwonder's current effort of wanting to run one million job simulation. With unbounded growth of links at lwj, the system's performance may become too bad. I have some evidence of this even at 50K flat-level test although the entry type was directory. If this can be purged with wreck purge, this shouldn't be an issue. Does the links also purged?
Does the links also purged?
Yes, links are purged.
I can make some changes to JSC such that it will access lwj from lwj-active when the job exists there, so growing lwj should be manage, I think.
Unless there is some performance issue, you should still be able to access jobs from lwj.. The lwj-active. entry and the corresponding link in lwj.<id> will be created under a single commit, so theoretically users of the lwj. links should not be aware of any changes.
@dongahn: for this million job simulation, the job's directories within the KVS do not need to exist long past their completion time. I plan to have a module/program that listens for complete event, pulls the relevant information out of the KVS, and then deletes the job directory. Basically what purge does but almost instantaneously. This should keep lwj-complete in check.
If you can get lwj-active structure hierarchically, I think that should be sufficient.
@SteVwonder: Good to know. Thanks!
Unless there is some performance issue, you should still be able to access jobs from lwj.. The lwj-active. entry and the corresponding link in lwj.
will be created under a single commit, so theoretically users of the lwj. links should not be aware of any changes.
Just to clarify, I was just worried about a performance issue with a put/commit through a link within lwj. If you write a key through, say, lwj.4.mykey, which is actually lwj-active.4.mykey, would committing mykey involves hashing the entire lwj? or just hashing lwj-active? If latter, JSC code can stay the same; otherwise I will change it to use lwj-active where applicable.
@grondo: I should also say, I have enough techniques to use to meet our short-term goals, an incremental work as you proposed (applying a hierarchical scheme to lwj-active) seems like the most bang for a buck at this point.
I'm not entirely sure what you mean by "hashing". Do you mean updating? Putting a key to a path where some component of the path is a symbolic link does not modify the path leading up to the link, only the "real" path is modified.
@garlick: Nice! Yes, updating should have been a better term, thanks.
Because the current lwj layout in kvs also stupidly creates per-task directory entries under lwj.
, there is a similar problem here for large jobs (the lwj.. directory itself will have large number of entries). This affects single, large job performance certainly, but may not have appreciable impact on the job throughput case. However, it will certainly need to be addressed in the wreck replacement.
@grondo: Just a quick thought. You may have thought about this with your reduction module effort and etc. But I have to think we may want to treat this issue a bit differently than the current KVS performance issue. My issue with small-scale job throughput is a performance issue. Yours seems an inherent scalability challenge with kvs: Even if we use the same hierarchical scheme for the task directories, a large job performance will be ultimately bounded by the single-commit-performance of kvs. Unless we distribute kvs and its commit operations with DHT or some sort, it seems this can continue to be the case. E.g., When I looked at my benchmark results, optimal performance is about 70 commits per second. So a 100K-way directory creation from a large job can still take long even with the best layout.
So it feels to me like this may also be an areas where we can have a good mileage out of your aggregation module. In this case, a more scalable scheme can be to aggregate and reduce the per-task info into a manageable set of objects before they get to be committed.
As far as we can limit the number of commits so that the count grows logarithmically with scale, we may be able to overcome this challenges. Of course, there is also an issue with the amount of data growing much worse than logarithmic. But we will just have to use best efforts to use data reduction as much as possible before storing them to kvs. This will undoubtedly have other impact to read performance (a reader has to read a bigger object and find the corresponding sub key in it). But currently, our get scheme is much more performance and perhaps more importantly scalable.
Just my $0.2 and sorry if this was too obvious :-)
As far as we can limit the number of commits so that the count grows logarithmically with scale, we may be able to overcome this challenges.
Directories for tasks and most of their contents are created under kvs_fence so this should be a single commit. I think currently with -o stdio-delay-commit the number of commits per lwj is not a function of the number of tasks. (I should verify this though)
Directories for tasks and most of their contents are created under kvs_fence so this should be a single commit. I think currently with -o stdio-delay-commit the number of commits per lwj is not a function of the number of tasks. (I should verify this though)
Heh, i knew there was something :-)
I think perhaps #811 resolved this one? @dongahn?
Yes!
Most helpful comment
My sense is that computing hashes is negligible compared to the RPCs but that's just the sort of thing somebody says before somebody else sends them a nice graph showing how wrong they were :-)