Prefect: Duplicate runs scheduled

Created on 8 Nov 2020  路  15Comments  路  Source: PrefectHQ/prefect

Description

Occasionally a specific flow will get double-scheduled in the latest version of Prefect server (we've run into this bug several times on previous versions). I.e. for a single clock we will get two identical runs at the same time (both visible in the UI, then both executing at the same time).

Unfortunately this does not happen on every scheduler cycle or we could troubleshoot more easily.

This may or may not be relevant, but it only happens on the times when the schedule is set to run with specific parameters. Another clock on the same schedule does not have this problem.

We can temporarily fix it by:

  • Toggling the schedule on and off in the UI (only works sometimes).
  • Deleting the postgres data and re-registering the flow

Expected Behavior

Any unique combination of time, flow, and parameters should only get scheduled once by the server.

Reproduction

Unfortunately we can't always reproduce it, but this problem eventually recurs even when we completely wipe the server setup and redeploy.

The schedule in question is generated as:

# normally imported from a config:
CRON_TIMES_LOADONLY = ["0 6 * * *"] 
CRON_TIMES_FULLRUN = ["0 9 * * *"]

start_date = pendulum.now(tz="America/Chicago")  # pin clock to CDT
clocks_loadonly = [
    clocks.CronClock(s, start_date=start_date) for s in CRON_TIMES_LOADONLY
]
clocks_fullrun = [
    clocks.CronClock(
        s, start_date=start_date, parameter_defaults={"run-dbtbuild": True}
    )
    for s in CRON_TIMES_FULLRUN
]
schedule = Schedule(clocks=clocks_loadonly + clocks_fullrun)

This definition style hasn't caused us any issues when not used with parameter defaults.

Environment

{
"config_overrides": {
"server": {
"host": true,
"telemetry": {
"enabled": true
},
"ui": {
"apollo_url": true
}
}
},
"env_vars": [],
"system_information": {
"platform": "Linux-4.15.0-118-generic-x86_64-with-glibc2.10",
"prefect_backend": "cloud",
"prefect_version": "0.13.14",
"python_version": "3.8.5"
}
}

Most helpful comment

Same here. I also use parameter_defaults. Flow re-registration seems to fix this issue temporarily. In my case, the scheduling is de-regulated after one or two days. Really difficult to reproduce it locally :/

Environment

```
{
"config_overrides": {
"api": {
"url": true
},
"context": {
"secrets": false
},
"server": {
"database": {
"host_port": true
},
"ui": {
"graphql_url": true
}
}
},
"env_vars": [
"PREFECT__CONTEXT__SECRETS__...",
"PREFECT__CONTEXT__SECRETS__..."
],
"system_information": {
"platform": "Linux-4.19.0-10-amd64-x86_64-with-debian-10.6",
"prefect_backend": "server",
"prefect_version": "0.13.14",
"python_version": "3.7.3"
}
}

All 15 comments

Same here. I also use parameter_defaults. Flow re-registration seems to fix this issue temporarily. In my case, the scheduling is de-regulated after one or two days. Really difficult to reproduce it locally :/

Environment

```
{
"config_overrides": {
"api": {
"url": true
},
"context": {
"secrets": false
},
"server": {
"database": {
"host_port": true
},
"ui": {
"graphql_url": true
}
}
},
"env_vars": [
"PREFECT__CONTEXT__SECRETS__...",
"PREFECT__CONTEXT__SECRETS__..."
],
"system_information": {
"platform": "Linux-4.19.0-10-amd64-x86_64-with-debian-10.6",
"prefect_backend": "server",
"prefect_version": "0.13.14",
"python_version": "3.7.3"
}
}

@jnoynaert you mentioned that this has happened on pervious versions, I'm wondering if you know how long you have been encountering this issue? Looking through recent server releases this PR https://github.com/PrefectHQ/server/pull/111 seems like it could be related since it changes behavior when scheduling with parameter defaults.

Also, would anyone mind providing a flow example that they are seeing this with? (schedule, parameters, etc.)

Another thing that would be very useful to know: the scheduler uses idempotency_keys when creating flow runs to avoid duplication. In Server, these keys are stored in the database on the flow_run table. For your duplicate runs, could you query for the associated idempotency keys and let us know what you see? If they are different, then this suggest the idempotency key logic is somehow flawed in a non-deterministic way.

@joshmeek I can't say conclusively because some logs have already been dropped, but it looks like we were on 0.13.11 when we first saw this.

@cicdw I've included an export of the flow_run table (filtered to just the relevant flow, and including the non-duplicate runs running on the clock with no parameters) here; it should be open but let me know if you run into access issues. The idempotency keys are the same for the duplicated runs...

See competent-skua & quirky-frigatebird runs for an example of the double schedule (note also that it's starting to triple schedule future runs, beginning tomorrow a.m.).

Thanks @jnoynaert this is immensely useful - I already see some clues: the idempotency key is only valid for 24 hours, and these runs were created more than 24 hours apart (which shouldn't happen - the scheduler should only begin scheduling after the currently max scheduled start time, so something fishy is going on). Are you by any chance turning the schedule on / off at any point?

This specific schedule has been toggled off and back on once (after the first double schedule caused some failures). That temporarily eliminated the duplicates from the upcoming runs but then the issue recurred. In past cases we've never toggled the schedule on our production environment and still see the error (same case for generating the first set of duplicates).

The one thing we're doing that might be different than a usual test case is that our CI/CD processes redeploy over current flow versions pretty regularly when our data pipelines get redeployed due to changes in other parts of the same codebase. It's not unusual to get 5-10 "version" bumps in Prefect with no change to the flow definitions. However, I've never been able to get that deploy/registration process to have any unusual impact or server or noticed any correlated behavior with this issue.

@cicdw FYI it would not be any trouble to reset our environment with additional logging or some debug flags enabled if that helps

OK awesome, good to know - I'll keep digging and let you know if I find some good places to include logs; I'll report back soon when I know more!

I'm also experiencing duplicate runs, I'm not sure if this is related to the reported issue from @jnoynaert, but in my current setup I've two local agent's with the same labels and my environment is as follows:

{
"config_overrides": {},
"env_vars": [],
"system_information": {
"platform": "Linux-5.4.0-1031-azure-x86_64-with-glibc2.29",
"prefect_backend": "server",
"prefect_version": "0.13.14",
"python_version": "3.8.5"
}

The log also shows the flow gets trigger twice from different Agent's:

7 November 2020,03:28:38 Agent01 INFO Submitted for execution: PID: 1161
7 November 2020,03:28:38 Agent02 INFO Submitted for execution: PID: 1147
....
7 November 2020,03:28:41 prefect.CloudTaskRunner INFO Task 'post': Finished task run for task with final state: 'Success'
7 November 2020,03:28:41 prefect.CloudFlowRunner INFO Flow run SUCCESS: all reference tasks succeeded
7 November 2020,03:28:41 prefect.CloudTaskRunner INFO Task 'post': Finished task run for task with final state: 'Success'
7 November 2020,03:28:41 prefect.CloudFlowRunner INFO Flow run SUCCESS: all reference tasks succeeded

I'm registering the flows within the same server and with the same Prefect version as Server and UI and they both have the same version.
If I don't configure my agent's at the exact same time it only trigger it once, meaning if I wait one second starting the other agent the above doesn't happen, but at some point they always overlaps.

Hi @davidmorch - this is different issue; Prefect Server does not implement any lock on states so if you run horizontally scale agents there is a chance they submit the same run twice. Prefect Cloud uses a global lock that prevents this behavior.

The one thing we're doing that might be different than a usual test case is that our CI/CD processes redeploy over current flow versions pretty regularly when our data pipelines get redeployed due to changes in other parts of the same codebase. It's not unusual to get 5-10 "version" bumps in Prefect with no change to the flow definitions. However, I've never been able to get that deploy/registration process to have any unusual impact or server or noticed any correlated behavior with this issue.

Hey @jnoynaert, this doesn't relate to this issue but heads up that we've just released an idempotency key feature for flow registration that would help your versioning be more meaningful :) https://docs.prefect.io/orchestration/tutorial/first.html#register-flow-with-the-prefect-api

@madkinsz thanks! I actually added that to our deploy right after reading the release notes; unfortunately for some reason calling flow.register() with idempotency_key=flow.serialized_hash() does not give the same hash on repeated runs. Might have something to do with the fact that I re-import all of the flows from other submodules before registering them, but I haven't dug into it to see if it needs a bug report

@madkinsz thanks! I actually added that to our deploy right after reading the release notes; unfortunately for some reason calling flow.register() with idempotency_key=flow.serialized_hash() does not give the same hash on repeated runs. Might have something to do with the fact that I re-import all of the flows from other submodules before registering them, but I haven't dug into it to see if it needs a bug report

Interesting! It should give the same hash on repeated runs. If you dig into it and have a MRE feel free to tag me in an issue and I'll take a look. You could also use a different key like the git last modified timestamp for the file with the flow instead of the serialized hash.

Update: the fix for the duplicate scheduled runs has been merged into server and the next core release will use the updated images 馃憤

Was this page helpful?
0 / 5 - 0 ratings

Related issues

emcake picture emcake  路  3Comments

kforti picture kforti  路  3Comments

ludwigm picture ludwigm  路  3Comments

rej-jsa picture rej-jsa  路  4Comments

mark-w-325 picture mark-w-325  路  3Comments