Trent is allocating the system named upgrade to us for flux testing. I think he is in the process now of reinstalling it with no slurm abd no lustre. It has EDR infiniband, ~16 low-end compute nodes, and resides in B439 in the "collaboration zone" network.
His plan as I understand it is to let us have that for a while then towards the end of the summer, move it to B451 and add ~64 more nodes, and a small number of heterogenous nodes with gpus, local storage, etc.
I'm not sure an issue is exactly the right place for this, but I wanted to collect these details somewhere, and be able to add more info as we develop a workflow for development and test on this machine.
Update from Trent: he will build us a new test system in B451 not move the old one. The new system should be available sometime in september (waiting for electrical) and initially will consist of 108 Haswell nodes in three racks, with 64GB (or maybe 32GB) of RAM per node, and QDR infiniband. He expects some number of the 108 not to be functional.
I requested "10s of terabytes" on the management node in an array of some sort, so we can put the KVS content store on there and decide whether garbage collection needs to be solved this year or not.
He says there are other oddball nodes coming out of B439 (GPUs, lcoal storage, etc) and if we want any of those in our system we would need to free space in our racks, reducing the count of other nodes.
Awesome news! How many of those 108 nodes are management nodes? Or is that configurable?
if we want any of those in our system we would need to free space in our racks, reducing the count of other nodes.
IMO, having even just one or two nodes with GPUs and local storage will be useful in preparing for Corona. With those on the system, we can make sure that they are discovered, incorporated into the resource model, and scheduled correctly.
@SteVwonder: I think we can configure it how we like, but as far as I know, we will initially only have one management node. We should think about what we want to swap in.
Latest update from Trent on the small test cluster:
upgrade is now at TOSS 3.3, no lustre, no slurm. running with iscsi and / as an overlay so you can write to anything. All nodes but mgmt are stateless.
power control is crap, need to see if we can make that better.
[root@upgrade1:~]# nodeattr -q mgmt
upgrade1
[root@upgrade1:~]# nodeattr -q login
upgrade2
[root@upgrade1:~]# nodeattr -q gw
upgrade[2-3]
[root@upgrade1:~]# nodeattr -q compute
upgrade[4-19]
These nodes are the same as the nodes I was thinking of moving to B451,
if they are not good enough we can think about it more. They only have
16GB of memory. We could shrink node count and consolidate memory to get
you to 32GB I think.Trent
Here's how you get to the management node:
$ ssh rzgw
Password: RZ password + RZ OTP
rzgw3@garlick:ssh upgradei
Password : CZ password + CZ OTP
[garlick@upgrade1:~]$
I did get flux booted on this cluster, running as the user "flux", started by systemd, with a config file that looks like this:
session-id = "upgrade"
rank = RANK
tbon-endpoints = [
"tcp://192.168.128.1:8020",
"tcp://192.168.128.2:8020",
"tcp://192.168.128.3:8020",
"tcp://192.168.128.4:8020",
"tcp://192.168.128.5:8020",
"tcp://192.168.128.6:8020",
"tcp://192.168.128.7:8020",
"tcp://192.168.128.8:8020",
"tcp://192.168.128.9:8020",
"tcp://192.168.128.10:8020",
"tcp://192.168.128.11:8020",
"tcp://192.168.128.12:8020",
"tcp://192.168.128.13:8020",
"tcp://192.168.128.14:8020",
"tcp://192.168.128.15:8020",
"tcp://192.168.128.16:8020",
"tcp://192.168.128.17:8020",
"tcp://192.168.128.18:8020",
"tcp://192.168.128.19:8020",
]
where RANK is substited for the local rank. I couldn't get it to automatically recognize the IP's that correspond to local addresses, so maybe a bug there.
The log is going to /tmp/flux.log on upgrade1.
curve keys are located in the home directory of the flux user, which I set to /etc/flux.
So far I haven't been able to connect to the broker on any rank, e.g. with "flux ping 0" or similar. And I have to go so more fun tomorrow.
Here's the boot log
2018-07-11T23:34:26.886837Z broker.debug[0]: insmod connector-local
2018-07-11T23:34:36.888933Z broker.info[0]: wireup: 1/19 (incomplete) 10.0s
2018-07-11T23:34:39.215213Z broker.debug[13]: insmod connector-local
2018-07-11T23:34:39.212339Z broker.info[0]: wireup: 2/19 (incomplete) 12.3s
2018-07-11T23:34:39.215377Z broker.debug[1]: insmod connector-local
2018-07-11T23:34:39.212510Z broker.info[0]: wireup: 3/19 (incomplete) 12.3s
2018-07-11T23:34:39.215871Z broker.debug[6]: insmod connector-local
2018-07-11T23:34:39.215984Z broker.debug[4]: insmod connector-local
2018-07-11T23:34:39.215955Z broker.debug[12]: insmod connector-local
2018-07-11T23:34:39.212582Z broker.info[0]: wireup: 4/19 (incomplete) 12.3s
2018-07-11T23:34:39.212600Z broker.info[0]: wireup: 5/19 (incomplete) 12.3s
2018-07-11T23:34:39.212618Z broker.info[0]: wireup: 6/19 (incomplete) 12.3s
2018-07-11T23:34:39.216062Z broker.debug[7]: insmod connector-local
2018-07-11T23:34:39.213343Z broker.info[0]: wireup: 7/19 (incomplete) 12.3s
2018-07-11T23:34:39.216388Z broker.debug[16]: insmod connector-local
2018-07-11T23:34:39.216527Z broker.debug[9]: insmod connector-local
2018-07-11T23:34:39.216861Z broker.debug[3]: insmod connector-local
2018-07-11T23:34:39.216939Z broker.debug[10]: insmod connector-local
2018-07-11T23:34:39.216567Z broker.debug[5]: insmod connector-local
2018-07-11T23:34:39.216923Z broker.debug[15]: insmod connector-local
2018-07-11T23:34:39.213470Z broker.info[0]: wireup: 8/19 (incomplete) 12.3s
2018-07-11T23:34:39.213489Z broker.info[0]: wireup: 9/19 (incomplete) 12.3s
2018-07-11T23:34:39.213507Z broker.info[0]: wireup: 10/19 (incomplete) 12.3s
2018-07-11T23:34:39.213524Z broker.info[0]: wireup: 11/19 (incomplete) 12.3s
2018-07-11T23:34:39.213542Z broker.info[0]: wireup: 12/19 (incomplete) 12.3s
2018-07-11T23:34:39.213562Z broker.info[0]: wireup: 13/19 (incomplete) 12.3s
2018-07-11T23:34:39.216904Z broker.debug[11]: insmod connector-local
2018-07-11T23:34:39.216981Z broker.debug[18]: insmod connector-local
2018-07-11T23:34:39.213734Z broker.info[0]: wireup: 14/19 (incomplete) 12.3s
2018-07-11T23:34:39.213758Z broker.info[0]: wireup: 15/19 (incomplete) 12.3s
2018-07-11T23:34:39.217130Z broker.debug[8]: insmod connector-local
2018-07-11T23:34:39.214493Z broker.info[0]: wireup: 16/19 (incomplete) 12.3s
2018-07-11T23:34:39.217399Z broker.debug[2]: insmod connector-local
2018-07-11T23:34:39.214742Z broker.info[0]: wireup: 17/19 (incomplete) 12.3s
2018-07-11T23:34:39.217505Z broker.debug[17]: insmod connector-local
2018-07-11T23:34:39.214787Z broker.info[0]: wireup: 18/19 (incomplete) 12.3s
2018-07-11T23:34:39.217597Z broker.debug[14]: insmod connector-local
2018-07-11T23:34:39.214858Z broker.info[0]: wireup: 19/19 (complete) 12.3s
2018-07-11T23:34:39.214863Z broker.info[0]: Run level 1 starting
2018-07-11T23:34:39.228074Z broker.debug[0]: insmod barrier
2018-07-11T23:34:39.237503Z broker.debug[1]: insmod barrier
2018-07-11T23:34:39.237519Z broker.debug[3]: insmod barrier
2018-07-11T23:34:39.237558Z broker.debug[2]: insmod barrier
2018-07-11T23:34:39.237627Z broker.debug[5]: insmod barrier
2018-07-11T23:34:39.237625Z broker.debug[4]: insmod barrier
2018-07-11T23:34:39.237692Z broker.debug[7]: insmod barrier
2018-07-11T23:34:39.237687Z broker.debug[6]: insmod barrier
2018-07-11T23:34:39.237837Z broker.debug[8]: insmod barrier
2018-07-11T23:34:39.237854Z broker.debug[9]: insmod barrier
2018-07-11T23:34:39.237877Z broker.debug[10]: insmod barrier
2018-07-11T23:34:39.237942Z broker.debug[12]: insmod barrier
2018-07-11T23:34:39.237947Z broker.debug[11]: insmod barrier
2018-07-11T23:34:39.237981Z broker.debug[13]: insmod barrier
2018-07-11T23:34:39.238028Z broker.debug[15]: insmod barrier
2018-07-11T23:34:39.238038Z broker.debug[14]: insmod barrier
2018-07-11T23:34:39.238067Z broker.debug[16]: insmod barrier
2018-07-11T23:34:39.238159Z broker.debug[17]: insmod barrier
2018-07-11T23:34:39.238191Z broker.debug[18]: insmod barrier
2018-07-11T23:34:39.245648Z broker.debug[0]: insmod content-sqlite
2018-07-11T23:34:39.247053Z broker.debug[0]: content backing store: enabled content-sqlite
2018-07-11T23:34:39.259664Z broker.debug[0]: insmod kvs
2018-07-11T23:34:39.277075Z broker.debug[1]: insmod kvs
2018-07-11T23:34:39.277100Z broker.debug[4]: insmod kvs
2018-07-11T23:34:39.277108Z broker.debug[5]: insmod kvs
2018-07-11T23:34:39.277124Z broker.debug[2]: insmod kvs
2018-07-11T23:34:39.277135Z broker.debug[6]: insmod kvs
2018-07-11T23:34:39.277134Z broker.debug[3]: insmod kvs
2018-07-11T23:34:39.277208Z broker.debug[7]: insmod kvs
2018-07-11T23:34:39.277264Z broker.debug[9]: insmod kvs
2018-07-11T23:34:39.277255Z broker.debug[8]: insmod kvs
2018-07-11T23:34:39.277287Z broker.debug[11]: insmod kvs
2018-07-11T23:34:39.277295Z broker.debug[12]: insmod kvs
2018-07-11T23:34:39.277325Z broker.debug[10]: insmod kvs
2018-07-11T23:34:39.277413Z broker.debug[15]: insmod kvs
2018-07-11T23:34:39.277398Z broker.debug[13]: insmod kvs
2018-07-11T23:34:39.277438Z broker.debug[16]: insmod kvs
2018-07-11T23:34:39.277465Z broker.debug[14]: insmod kvs
2018-07-11T23:34:39.277560Z broker.debug[17]: insmod kvs
2018-07-11T23:34:39.277579Z broker.debug[18]: insmod kvs
2018-07-11T23:34:39.286295Z broker.debug[0]: insmod aggregator
2018-07-11T23:34:39.295791Z broker.debug[1]: insmod aggregator
2018-07-11T23:34:39.295959Z broker.debug[2]: insmod aggregator
2018-07-11T23:34:39.295993Z broker.debug[4]: insmod aggregator
2018-07-11T23:34:39.296003Z broker.debug[5]: insmod aggregator
2018-07-11T23:34:39.295993Z broker.debug[3]: insmod aggregator
2018-07-11T23:34:39.296049Z broker.debug[6]: insmod aggregator
2018-07-11T23:34:39.296177Z broker.debug[7]: insmod aggregator
2018-07-11T23:34:39.296186Z broker.debug[8]: insmod aggregator
2018-07-11T23:34:39.296259Z broker.debug[9]: insmod aggregator
2018-07-11T23:34:39.296317Z broker.debug[10]: insmod aggregator
2018-07-11T23:34:39.296343Z broker.debug[11]: insmod aggregator
2018-07-11T23:34:39.296383Z broker.debug[12]: insmod aggregator
2018-07-11T23:34:39.296408Z broker.debug[13]: insmod aggregator
2018-07-11T23:34:39.296453Z broker.debug[16]: insmod aggregator
2018-07-11T23:34:39.296454Z broker.debug[14]: insmod aggregator
2018-07-11T23:34:39.296478Z broker.debug[15]: insmod aggregator
2018-07-11T23:34:39.296547Z broker.debug[17]: insmod aggregator
2018-07-11T23:34:39.296611Z broker.debug[18]: insmod aggregator
2018-07-11T23:34:39.326284Z broker.debug[2]: insmod resource-hwloc
2018-07-11T23:34:39.326402Z broker.debug[3]: insmod resource-hwloc
2018-07-11T23:34:39.326586Z broker.debug[4]: insmod resource-hwloc
2018-07-11T23:34:39.326624Z broker.debug[5]: insmod resource-hwloc
2018-07-11T23:34:39.326754Z broker.debug[6]: insmod resource-hwloc
2018-07-11T23:34:39.326917Z broker.debug[7]: insmod resource-hwloc
2018-07-11T23:34:39.326942Z broker.debug[9]: insmod resource-hwloc
2018-07-11T23:34:39.326977Z broker.debug[8]: insmod resource-hwloc
2018-07-11T23:34:39.327157Z broker.debug[10]: insmod resource-hwloc
2018-07-11T23:34:39.327141Z broker.debug[12]: insmod resource-hwloc
2018-07-11T23:34:39.327197Z broker.debug[11]: insmod resource-hwloc
2018-07-11T23:34:39.327320Z broker.debug[13]: insmod resource-hwloc
2018-07-11T23:34:39.327401Z broker.debug[15]: insmod resource-hwloc
2018-07-11T23:34:39.327416Z broker.debug[16]: insmod resource-hwloc
2018-07-11T23:34:39.327465Z broker.debug[14]: insmod resource-hwloc
2018-07-11T23:34:39.327623Z broker.debug[17]: insmod resource-hwloc
2018-07-11T23:34:39.327669Z broker.debug[18]: insmod resource-hwloc
2018-07-11T23:34:39.333856Z resource-hwloc.debug[3]: loaded
2018-07-11T23:34:39.332964Z kvs.debug[0]: aggregated 12 transactions (120 ops)
2018-07-11T23:34:39.338320Z resource-hwloc.debug[2]: loaded
2018-07-11T23:34:39.338356Z resource-hwloc.debug[4]: loaded
2018-07-11T23:34:39.338370Z resource-hwloc.debug[6]: loaded
2018-07-11T23:34:39.338374Z resource-hwloc.debug[5]: loaded
2018-07-11T23:34:39.338414Z resource-hwloc.debug[8]: loaded
2018-07-11T23:34:39.338418Z resource-hwloc.debug[7]: loaded
2018-07-11T23:34:39.338489Z resource-hwloc.debug[11]: loaded
2018-07-11T23:34:39.338487Z resource-hwloc.debug[9]: loaded
2018-07-11T23:34:39.338497Z resource-hwloc.debug[10]: loaded
2018-07-11T23:34:39.338513Z resource-hwloc.debug[12]: loaded
2018-07-11T23:34:39.338510Z resource-hwloc.debug[13]: loaded
2018-07-11T23:34:39.338537Z resource-hwloc.debug[15]: loaded
2018-07-11T23:34:39.335273Z kvs.debug[0]: aggregated 4 transactions (40 ops)
2018-07-11T23:34:39.340421Z resource-hwloc.debug[14]: loaded
2018-07-11T23:34:39.340438Z resource-hwloc.debug[17]: loaded
2018-07-11T23:34:39.340432Z resource-hwloc.debug[16]: loaded
2018-07-11T23:34:39.340468Z resource-hwloc.debug[18]: loaded
2018-07-11T23:34:39.343834Z broker.debug[1]: insmod resource-hwloc
2018-07-11T23:34:39.350159Z resource-hwloc.debug[1]: loaded
2018-07-11T23:34:39.354149Z broker.debug[0]: insmod cron
2018-07-11T23:34:39.354459Z cron.info[0]: synchronizing cron tasks to event hb
2018-07-11T23:34:39.366989Z broker.debug[0]: insmod userdb
2018-07-11T23:34:39.367242Z userdb.info[0]: default rolemask override=0x2
2018-07-11T23:34:39.368294Z broker.info[0]: rc1: running /etc/flux/rc1.d/01-enclosing-instance
2018-07-11T23:34:39.372197Z broker.info[0]: rc1: running /etc/flux/rc1.d/02-hostlist
2018-07-11T23:34:39.399592Z broker.info[0]: rc1: running /etc/flux/rc1.d/sched-start
2018-07-11T23:34:39.409446Z broker.debug[0]: insmod sched
2018-07-11T23:34:39.409721Z sched.info[0]: sched comms module starting
2018-07-11T23:34:39.410111Z sched.debug[0]: loaded: sched.fcfs
2018-07-11T23:34:39.410127Z sched.info[0]: sched.fcfs plugin loaded
2018-07-11T23:34:39.410135Z sched.debug[0]: LUA_PATH /usr/share/lua/5.1/?.lua;;;
2018-07-11T23:34:39.410141Z sched.debug[0]: LUA_CPATH /usr/lib64/lua/5.1/?.so;;;
2018-07-11T23:34:39.410145Z sched.info[0]: start to read resources
2018-07-11T23:34:39.422818Z sched.info[0]: resrc constructed using hwloc
2018-07-11T23:34:39.422835Z sched.info[0]: loaded resrc
2018-07-11T23:34:39.422842Z sched.info[0]: resources loaded
2018-07-11T23:34:39.423078Z sched.info[0]: events registered
2018-07-11T23:34:39.423781Z broker.info[0]: Run level 1 Exited (rc=0) 0.2s
2018-07-11T23:34:39.423795Z broker.info[0]: Run level 2 starting
Oh this works:
$ FLUX_URI=local:///run/flux flux ping 0
0!cmb.ping pad=0 seq=0 time=0.260 ms (6DA46!0F181!0)
0!cmb.ping pad=0 seq=1 time=0.213 ms (6DA46!0F181!0)
As user flux:
$ FLUX_URI=local:///run/flux flux module list -r all
Module Size Digest Idle S Nodeset
barrier 1392872 1EC3381 30 S [0-18]
cron 1478480 E6BC96C 0 S 0
sched 488800 8B45110 30 S 0
kvs 1837960 A9A18B3 0 S [0-18]
resource-hwloc 1420192 28ECAFF 30 S [0-18]
aggregator 1410408 0970F5B 30 S [0-18]
content-sqlite 1402152 730D18C 30 S 0
job 1470616 92ECB07 30 S [0-18]
userdb 1394640 31BD3EA 5 S 0
connector-local 1425264 D6E6249 0 R [0-18]
$
$ FLUX_URI=local:///run/flux flux wreckrun -N 19 hostname
upgrade1
upgrade13
upgrade2
upgrade3
upgrade4
upgrade5
upgrade6
upgrade7
upgrade8
upgrade9
upgrade10
upgrade11
upgrade12
upgrade14
upgrade15
upgrade16
upgrade17
upgrade18
upgrade19
$
I upgraded "upgrade" to flux-core-0.10.0. I was able to drop the rank setting from the config file and use the same config on all the nodes. The default local connector path works now too, e.g.
[garlick@upgrade2:conf.d]$ flux ping 18
18!cmb.ping pad=0 seq=0 time=0.990 ms (13B1C!598ED!1!0!18)
18!cmb.ping pad=0 seq=1 time=0.931 ms (13B1C!598ED!1!0!18)
18!cmb.ping pad=0 seq=2 time=0.925 ms (13B1C!598ED!1!0!18)
18!cmb.ping pad=0 seq=3 time=0.810 ms (13B1C!598ED!1!0!18)