Initial SoD implementation uses a nasty hack to store task proxy prerequisites in the DB (because I'm "database-challenged"):
cylc.flow.task_pool
415 # Update prerequisite satisfaction status from DB
416 sat = {}
417 for k, v in json.loads(satisfied).items():
418 sat[tuple(json.loads(k))] = v
419 # TODO (from Oliver's PR review):
420 # Wait, what, the keys to a JSON dictionary are themselves JSON
421 # :vomiting_face:!
422 # This should be converted to its own DB table pre-8.0.0.
423 for pre in itask.state.prerequisites:
424 for k, v in pre.satisfied.items():
425 pre.satisfied[k] = sat[k]
This works fine by is fundamentally :vomiting_face:
For reference the task_pool.satisfied field currently contains data in the format:
{
[<name>, <cycle>, <output>]: <satisfied>,
...
}
Where <satisfied> records whether an output has been generated and whether it was generated by the workflow (naturally) or by user intervention (artificially).
This approach makes housekeeping a little simpler but it also results in a lot of data duplication and effectively uses the database as a JSON file which is the sort of thing which makes database engineers sick.
I think what we probably want is one table of tasks, another of outputs and a third mapping between the two. This turns the housekeep problem into a garbage collection problem, whenever you remove a task from the database you check the mapping to see if you can remove any outputs at the same time.
it also results in a lot of data duplication
That's not really true - the task_pool table is continually pruned (it only records the current task pool, so the only duplication in the DB now will be over current active (and current partially satisfied waiting) tasks that have the same upstream parent - e.g. members of a family that just triggered together. And that duplication does not accumulate in the DB (because the table is pruned as the task pool evolves).
Example workflow (lets say foo.1 fails):
P1 = foo & ipsum:custom1 => bar
I think we would need to have a table task_prerequisites (asterisk denotes primary key):
| cycle* TEXT | name* TEXT | prereq_name* TEXT | prereq_cycle* TEXT | prereq_output* TEXT | satisfied TEXT |
| ----------- | ---------- | ----------------- | ------------------ | ------------------- | ------------------- |
| 1 | bar | foo | 1 | succeeded | false |
| 1 | bar | ipsum | 1 | custom1 | satisfied naturally |
I'm trying to figure out how to use this with the way things are loaded from the database: https://github.com/cylc/cylc-flow/blob/96f56035985fdf465f6f8e734d98804398576f99/cylc/flow/scheduler.py#L707-L718
The callback handles the task_pool table row by row: https://github.com/cylc/cylc-flow/blob/96f56035985fdf465f6f8e734d98804398576f99/cylc/flow/task_pool.py#L348-L369
But there can be >1 row of the task_prerequisites table per row of the task_pool table. So I think we would need a function call to select from the task_prerequisites table down here in the load_db_task_pool_for_restart() method, replacing the JSON parsing stuff: https://github.com/cylc/cylc-flow/blob/96f56035985fdf465f6f8e734d98804398576f99/cylc/flow/task_pool.py#L420-L433
Thanks for taking this on @MetRonnie :bouquet: (I'm afraid I won't be much help on it unless I spend time trying to figure it out myself, which I guess is what you're doing).
No problem, just brainstorming/checking if anyone can see anything glaringly wrong with my approach...
Looks ok. Incase you were concerned about using a compound primary key made out of so many fields, don't worry. We would need to refactor the DB to make compound primary keys go away and that's a job for another day.
Are we just storing the flat prerequisites in the DB? and having the scheduler patch together the conditional graph complexities?
i.e.
(A.1 | B.1) | C.1 => D.1
Currently we store this as a bunch of aliases to the flat set (A.1, B.1, C.1), and the expression as a whole ((c1 | c2) | c3`) as it's own prerequisite also..
Is this just an internal running scheduler thing? or just for the API dump? and/or is it part of that same json dump we've been using as a key?
This is just about storing partially satisfied prerequisites in the DB so that restarts work (because under SoD, there is no dependency matching to re-satisfy them after a restart, which is what used to happen). I don't think it affects anything else.
This is just about storing partially satisfied prerequisites in the DB so that restarts work (because under SoD, there is no dependency matching to re-satisfy them after a restart, which is what used to happen). I don't think it affects anything else.
Ah, right 馃憤
Most helpful comment
No problem, just brainstorming/checking if anyone can see anything glaringly wrong with my approach...