Cylc-flow: Better DB task pool prerequisite storage

Created on 5 Aug 2020 · 9Comments · Source: cylc/cylc-flow

Initial SoD implementation uses a nasty hack to store task proxy prerequisites in the DB (because I'm "database-challenged"):

cylc.flow.task_pool

 415       # Update prerequisite satisfaction status from DB
 416             sat = {}
 417             for k, v in json.loads(satisfied).items():
 418                 sat[tuple(json.loads(k))] = v
 419       # TODO (from Oliver's PR review):
 420       #   Wait, what, the keys to a JSON dictionary are themselves JSON
 421       #     :vomiting_face:!
 422       #   This should be converted to its own DB table pre-8.0.0.
 423             for pre in itask.state.prerequisites:
 424                 for k, v in pre.satisfied.items():
 425                     pre.satisfied[k] = sat[k]

This works fine by is fundamentally :vomiting_face:

sod-follow-up

Source

hjoliver

😄2

Most helpful comment

No problem, just brainstorming/checking if anyone can see anything glaringly wrong with my approach...

MetRonnie on 9 Oct 2020

👍2

All 9 comments

For reference the task_pool.satisfied field currently contains data in the format:

{
    [<name>, <cycle>, <output>]: <satisfied>,
    ...
}

Where <satisfied> records whether an output has been generated and whether it was generated by the workflow (naturally) or by user intervention (artificially).

This approach makes housekeeping a little simpler but it also results in a lot of data duplication and effectively uses the database as a JSON file which is the sort of thing which makes database engineers sick.

I think what we probably want is one table of tasks, another of outputs and a third mapping between the two. This turns the housekeep problem into a garbage collection problem, whenever you remove a task from the database you check the mapping to see if you can remove any outputs at the same time.

oliver-sanders on 5 Aug 2020

it also results in a lot of data duplication

That's not really true - the task_pool table is continually pruned (it only records the current task pool, so the only duplication in the DB now will be over current active (and current partially satisfied waiting) tasks that have the same upstream parent - e.g. members of a family that just triggered together. And that duplication does not accumulate in the DB (because the table is pruned as the task pool evolves).

hjoliver on 5 Aug 2020

Example workflow (lets say foo.1 fails):

P1 = foo & ipsum:custom1 => bar

I think we would need to have a table task_prerequisites (asterisk denotes primary key):
| cycle* TEXT | name* TEXT | prereq_name* TEXT | prereq_cycle* TEXT | prereq_output* TEXT | satisfied TEXT |
| ----------- | ---------- | ----------------- | ------------------ | ------------------- | ------------------- |
| 1 | bar | foo | 1 | succeeded | false |
| 1 | bar | ipsum | 1 | custom1 | satisfied naturally |

I'm trying to figure out how to use this with the way things are loaded from the database: https://github.com/cylc/cylc-flow/blob/96f56035985fdf465f6f8e734d98804398576f99/cylc/flow/scheduler.py#L707-L718

The callback handles the task_pool table row by row: https://github.com/cylc/cylc-flow/blob/96f56035985fdf465f6f8e734d98804398576f99/cylc/flow/task_pool.py#L348-L369

But there can be >1 row of the task_prerequisites table per row of the task_pool table. So I think we would need a function call to select from the task_prerequisites table down here in the load_db_task_pool_for_restart() method, replacing the JSON parsing stuff: https://github.com/cylc/cylc-flow/blob/96f56035985fdf465f6f8e734d98804398576f99/cylc/flow/task_pool.py#L420-L433

MetRonnie on 8 Oct 2020

Thanks for taking this on @MetRonnie :bouquet: (I'm afraid I won't be much help on it unless I spend time trying to figure it out myself, which I guess is what you're doing).

hjoliver on 8 Oct 2020

No problem, just brainstorming/checking if anyone can see anything glaringly wrong with my approach...

MetRonnie on 9 Oct 2020

👍2

Looks ok. Incase you were concerned about using a compound primary key made out of so many fields, don't worry. We would need to refactor the DB to make compound primary keys go away and that's a job for another day.

oliver-sanders on 9 Oct 2020

👍1

Are we just storing the flat prerequisites in the DB? and having the scheduler patch together the conditional graph complexities?

i.e.

(A.1 | B.1) | C.1 => D.1

Currently we store this as a bunch of aliases to the flat set (A.1, B.1, C.1), and the expression as a whole ((c1 | c2) | c3`) as it's own prerequisite also..

Is this just an internal running scheduler thing? or just for the API dump? and/or is it part of that same json dump we've been using as a key?

dwsutherland on 26 Oct 2020

This is just about storing partially satisfied prerequisites in the DB so that restarts work (because under SoD, there is no dependency matching to re-satisfy them after a restart, which is what used to happen). I don't think it affects anything else.

hjoliver on 26 Oct 2020

👍1

This is just about storing partially satisfied prerequisites in the DB so that restarts work (because under SoD, there is no dependency matching to re-satisfy them after a restart, which is what used to happen). I don't think it affects anything else.

Ah, right 👍

dwsutherland on 26 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Update Protobuf to 3.12+ - Fields with default value not set in deltas

dwsutherland · 3Comments

Remove executable bit of Python source code

kinow · 4Comments

Cylc Review WSGI script not a callable?

kinow · 4Comments

Parallel cylc trigger edit problem

dpmatthews · 3Comments

empy: Python3 and the future of cylc support

oliver-sanders · 5Comments