Prefect: Investigate reports of failed mapped tasks returning None to downstream tasks

Created on 26 May 2020 · 10Comments · Source: PrefectHQ/prefect

Description

I've heard from a contributor that an unstable mapping behavior occurs. The way I heard it was:

in a mapped pipeline
a dask worker unexpectedly dies
the downstream task unexpectedly runs and receives 'None' as input, causing a runtime error because of the weird input

I also found reports of this in our slack history (archived here: https://github.com/PrefectHQ/prefect/issues/2655) that implied a link to specific deployment environments and for high volume mapped pipelines.

Note: is this possibly related to https://github.com/PrefectHQ/prefect/issues/2430?

Expected Behavior

What did you expect to happen instead?
The upstream mapped task is Failed, and the downstream mapped task does not run.

Reproduction

A minimal example that exhibits the behavior.
I have not observed it myself yet, but based on the slack thread it seems a high volume mapping task on an unstable network using DaskKubernetesEnvironment is the best way to reproduce.

Environment

Any additional information about your environment

Optionally run prefect diagnostics from the command line and paste the information here

bug

Source

lauralorenz

Most helpful comment

Without a reproducible example, I'm not sure how to progress on this, especially since it may have been resolved by the mapping refactor. +0.5 on closing if others are ok with it, since we don't have an immediate action plan or reproducer.

jcrist on 16 Jun 2020

👍2

All 10 comments

I think this is possibly solved by #2646 but I'm not sure how to confirm.

cicdw on 1 Jun 2020

jcrist on 16 Jun 2020

👍2

I think this just happened to me: https://cloud.prefect.io/prefect/task-run/2cf36073-939b-47c5-9d51-a94089889e1b?logId=

dylanbhughes on 27 Jun 2020

@dylanbhughes do you have a Result configured for either your Flow or for any of your tasks?

cicdw on 29 Jun 2020

Just a GCS ResutHandler

dylanbhughes on 29 Jun 2020

Here's another example: {staging_url}prefect-staging/flow-run/be28eebe-06c5-4cb9-b0b4-516722416ffd?logId=

dylanbhughes on 30 Jun 2020

Another: {staging_url}/prefect-staging/task-run/fe1c7aac-c7ae-48de-9c97-5fba8c1a3f95?logId=

dylanbhughes on 9 Jul 2020

Finally I'll say: looks unresolved by the mapping refactor. What I'm seeing is if there's a failure to communicate to cloud or the run is restarted by the zombie killer, mapped tasks in the same level can return None when resurrected

dylanbhughes on 9 Jul 2020

I'm digging into this - I have a theory.

cicdw on 10 Jul 2020

Good news everyone! I have a reproducible example of this behavior. @jcrist it's for your favorite part of the codebase - results! It's specific to the following situation:

zombies occur mid-way through a mapped pipeline on tasks that have retries
there is a reduce task immediately after the zombie-level

It appears that all data that was produced by the successfully mapped children prior to the zombie-death is _not_ properly rehydrated on the other end whenever the process is resurrected for a retry.

Here's the flow I used locally to test:

import prefect
from prefect import task, Flow

from datetime import timedelta
import os
import time
import sys


@task
def return_list():
    prefect.context['logger'].debug(f'PID: {os.getpid()}')
    return list(range(10))


@task(max_retries=2, retry_delay=timedelta(seconds=0))
def map_task(x):
    if x == 5:
        prefect.context['logger'].critical('Waiting: do it! do it!')
        time.sleep(20)
    return x


@task
def reducer(ll):
    msg = '\n'.join("{i}: {v}".format(i=i, v=v) for i, v in enumerate(ll))
    prefect.context['logger'].debug(msg)


with Flow("zombie") as flow:
    reducer(map_task.map(return_list))

Whenever I saw the waiting log I killed both the flow runner process as well as the heartbeat process for the task. After waiting for Cloud to do its thing, I then saw:

[2020-07-10 03:26:31] 807-- DEBUG - prefect.CloudTaskRunner | Task 'reducer': Calling task.run() method...
[2020-07-10 03:26:31] 29-- DEBUG - prefect.reducer | 0: None
1: None
2: None
3: None
4: None
5: None
6: None
7: 7
8: 8
9: 9

It appears that our load_results logic doesn't quite work whenever the immediate upstream was a mapped task. I can resolve tomorrow 👍

cicdw on 10 Jul 2020

🚀1

Was this page helpful?

0 / 5 - 0 ratings