Pipelines: [BUG] Downloading archived artifacts through UI truncates output file

Created on 22 May 2020 · 26Comments · Source: kubeflow/pipelines

What steps did you take:

Hi everyone. We've found a strange problem that sometimes artifacts cannot be properly opened after downloading through Pipelines UI. One step of our pipeline produces parquet file, which can be opened properly when we download tgz directly from Minio UI and extract it manually by tar -xzvf <artifact.tar.gz>.

What happened:

However, when we click on artifact link in Pipeline UI, browser "successfully" (without exception/bad response) downloads file which is, however, significantly smaller than its original version and it can't be read by pandas.

What did you expect to happen:

Downloading artifacts from Pipelines UI should provide correct file, with its original size and content.

How to reproduce:

Put any tgz artifact file into Minio, then try to download it via backend's /artifacts/get endpoint. Please, let me know if you can't reproduce it. I'm really concerned that I couldn't find any similar issue since 0.5.0 version, so it may be some outlandish bug in our environment.

How did you deploy Kubeflow Pipelines (KFP)?

Private AWS cluster

KFP version:
0.5.0

KFP SDK version:
0.5.1

Anything else you would like to add:

I've been haunting this bug for hours and for me it's still quite unclear. Maybe this information can help to figure out root of the problem

0) parquet files (and another testing file mentioned in point 5) are written inside component with (if engine and compression matter)

data_frame.to_parquet(args.output_path, engine='fastparquet', compression='gzip')

1) I could see this problem for several tgz files, but for some reason it seems to be unlikely reproducible when archive's size is small (in my experience, less than 20MB).
2) The problems is not relevant for same non-archived files, I could download everything even though files are larger than their archives.
3) Size of downloaded file is almost random, but when I download it few times repeatedly, size for two files can be the same.
4) I meticulously tried to reproduce the problem for one of our artifacts and found out that all not-entirely-downloaded (and extracted by backend) files are beginning of original file. Why I think so? If downloaded file is K-bytes-length, I copy first K bytes or original file into third one (head -c K original_file > third_file) and diff cli tool considers binary files (downloaded and "third", with equal size) as equal by content.
5) Since I can't share our artifact file in case if you need it for investigation and in order to be sure that it's not clearly our mistake, I took random parquet file from kaggle, read it in ipython and re-wrote it to another file with options from point 0. The problem persists, however, for some reason, point 4 does not (maybe it depends on some content format, I'm not familiar with pandas and parquet files).

6) I guess there might be some implicit problem with streaming unarchive modules in nodejs. My main argument here is that non-archived files seem to be downloaded smoothly. I looked into pod's logs but there are only messages below

GET /pipeline/artifacts/get?source=minio&bucket=mlpipeline&key=artifacts ...
Getting storage artifact at: minio: mlpipeline/artifacts/...

/kind bug
/area frontend

arefrontend kinbug lifecyclstale statutriaged

Source

anatolyzolotar

Most helpful comment

I can't figure out the issue, might need to do even more testing:

purely functional test - I used the function to retrieve the tarball from minio - no issue
light-weight service - I constructed a simple express server over the function (very similar to the actual server) - no issue
actual local test (no containerization, no istio, no k8s) - I build and run the actual node server in my local machine (config it to read from the minio service) - no issue
full deployment in k8s - early termination of the stream (but i check the data is right, just terminated early)

eterna2 on 29 May 2020

👍2

All 26 comments

Can you check the size of the file in the storage so that we can be sure this is the issue in the Frontend and not Argo?

DO you get the same file size every time you try downloading?

Ark-kun on 22 May 2020

@Ark-kun

Can you check the size of the file in the storage so that we can be sure this is the issue in the Frontend and not Argo?

Sure. As I mentioned above, artifacts and their archives have correct size in Minio UI and I can download files from there without any problems.

DO you get the same file size every time you try downloading?

Yes,

Size of downloaded file is almost random, but when I download it few times repeatedly, size for two files can be the same.

If you mean downloading via MInio UI, downloaded files have the same size and they're correct tgz arhcives (or other types, depending on file).

anatolyzolotar on 22 May 2020

I somehow missed "parquet file, which can be opened properly when we download tgz directly from Minio UI and extract it manually by tar -xzvf .".

I've assigned the people who know the most about that part of code.

Ark-kun on 22 May 2020

❤1

@eterna2 is this related to the endpoint not able to handle binary data?

Bobgy on 22 May 2020

@Bobgy first I would say "yes", but I tried to download non-archived parquet files via endpoint and everything seemed good. I get this problem when file is tgz. However, it can be not true in general. I tend to think that the the problem is somewhere here in streaming processing of archives.

anatolyzolotar on 22 May 2020

there is an implicit logic in the frontend server in handling tgz.

If it is a tgz, the node server will automatically deflate and return the 1st record in the archive, instead of the actual tarball.

Maybe we shld revisit this logic? Cuz the original use case in the past was most of the artifacts are string-based.

Let me see if I can reproduce this. Parquet > 20 mb?

eterna2 on 22 May 2020

@eterna2 yes, I reproduced it for this file https://github.com/kubeflow/pipelines/files/4665228/rewritten_baseline_data.tar.gz (I don't think it matters, but before upload it as an attachment, I had to change extension from tgz to tar.gz). Please, try too.
UPD: both tgz files for which I could reproduce this problem also had only one file inside.

anatolyzolotar on 24 May 2020

I see 2 files inside the tarball.

._rewritten_baseline_data.pqt
rewritten_baseline_data.pqt

eterna2 on 26 May 2020

I think this is the reason. THere is a hidden file inside ur tarball.

This is a sample code I written, to download ur tarball, extract only 1 parquet, and output as artifact. I managed to download the parquet correctly.

import functools
import typing

import kfp.dsl
import kfp.components
import kfp.compiler


func_to_container_op = functools.partial(
    kfp.components.func_to_container_op,
    base_image="python:3.7-slim",
    packages_to_install=["pandas", "pyarrow", "requests"],
)


@func_to_container_op
def get_artifact(
    download_from: str,
    parquet_path: kfp.components.OutputPath(str),
):
    import io
    import logging
    import tarfile
    import requests

    logging.basicConfig(level=logging.INFO)
    logging.info("Downloading from %s", download_from)
    stream = io.BytesIO(requests.get(download_from).content)

    logging.info("Unpacking file")
    tar = tarfile.open(fileobj=stream)
    filename = [name for name in tar.getnames() if name[0] != "."][0]
    logging.info("Extracting %s", filename)
    extracted_parquet = tar.extractfile(filename)

    if parquet_path and extracted_parquet:
        with open(parquet_path, "wb") as writer:
            logging.info("Writing artifact to %s", parquet_path)
            writer.write(extracted_parquet.read())


@func_to_container_op
def trivial(parquet_path: kfp.components.InputPath(str)):
    import pandas as pd

    with open(parquet_path, "rb") as reader:
        df = pd.read_parquet(reader)
        df.head()



@kfp.dsl.pipeline(name="get_parquet_pipeline")
def get_parquet_pipeline(
    download_from: str = "https://github.com/kubeflow/pipelines/files/4665228/rewritten_baseline_data.tar.gz",
):
    op1 = get_artifact(download_from=download_from)
    trivial(op1.outputs["parquet"])


kfp.compiler.Compiler().compile(get_parquet_pipeline, "get_parquet_pipeline.yaml")

eterna2 on 26 May 2020

@Bobgy

Do u think we shld add a new flag unpack? If unpack flag is provided, we will deflate, unpack 1st record (what we are doing now), otherwise we just return the artifact as is?

eterna2 on 26 May 2020

@eterna2 Sounds reasonable for the api.
Which link should we provide on KFP UI then? with unpack, is that right? Then how do we guide users to remove the unpack flag when downloading the complete artifact?

one idea: without unpack flag, if we detected more than one file in the tarball, we should add a warning sentence at the beginning of response that this is not full content?

Bobgy on 26 May 2020

@eterna2 you're right, there is one more hidden file in the archive. I must have messed rewriting the example parquet to tar with gz compression and this hidden file seems to be one of apple quarantine files with which I had headache when I tried to reproduce this problem. It really looks like for this tar.gz, API endpoint always returns this hidden file. I've rewritten this parquet file and re-packed in into new tgz without excess/hidden files:
rewritten2.tar.gz. This is just one of files for which the problem persists (I can't share others).

However I don't think that this hidden file is the reason of the problem because I've just reproduced the problem for rewritten2.tar.gz:
1) I just tried to download+unarchive the artifact through frontend API endpoint (/artifacts/get) few times,
2) the files are not fully downloaded and can't be read by pandas.read_parquet
Original parquet file:
-rw-r--r-- 1 me staff 21482643 May 27 14:17 rewritten2.pqt
Downloaded files:

-rw-r--r--@  1 me  staff   3653120 May 27 14:22 get
-rw-r--r--@  1 me  staff   4439552 May 27 14:23 get (1)
-rw-r--r--@  1 me  staff   4587008 May 27 14:23 get (2)
-rw-r--r--@  1 me  staff   5333522 May 27 14:23 get (3)
-rw-r--r--@  1 me  staff   4275712 May 27 14:33 get (4)
-rw-r--r--@  1 me  staff   4767232 May 27 14:33 get (5)

3) files are beginning of original file, for example

>>> head -c 4587008 rewritten2.pqt > rewritten2-first-4587008-bytes.pqt
>>> diff get\ \(2\) rewritten2-first-4587008-bytes.pqt

Doesn't see any difference there.

So, downloading tgz files (with unarchiving) through frontend API endpoint seems to be randomly interrupted for some reason. Downloading non-archived files works perfectly (I've just done it 10 times and everything is ok) and this makes me think that is not some connectivity problem.

What I'm asking you to do:
Please, try to upload fixed artifact rewritten2.tar.gz to your Minio and download it through /artifacts/get API endpoint.

Thank you very much

anatolyzolotar on 27 May 2020

I managed to reproduce this. But I am still investigating the cause.

But I was unable to figure out where. I am suspecting there might be some "special" char in the binary. I suspect might be express or some middleware issue.

This is what I did so far:

uploaded the tar to minio
retrieve with API
I get the same bug as u described. Incomplete file download.

Then I

do everything exactly the same, except no express, no middleware
- i.e. I use port-forwarding, and use the getObjectStream function directly (this function stream from minio, deflate, and untar the artifact) and pipe to file
  
  I do not encounter this issue - I get the full file content

So I am suspecting the issue might be at the express server layer. Either the server terminate prematurely because of EOF or some null characters, or some other issues. Need to investigate.

eterna2 on 29 May 2020

🎉1 👍1

@eterna2, Thank you very much. However I want to remind the interesting fact that everything works fine for non-archived artifact. You also can try to extract parquet file on your local computer, put it in Minio and download via API endpoint. As far as I understand, in this case download stream should be the same as if you download the same parquet in tar.gz which is being extracted. Do you agree or there is any subtle difference between downloads with/without tar extraction?

anatolyzolotar on 29 May 2020

I can confirm it is express problem.

i first tried w/o express (just the function to retrieve the artifact) - it works
then I add a simple express server, and add a handler that uses the function - I see incomplete inconsistent download

Let me debug further, and trace

eterna2 on 29 May 2020

Almost there. I suspect it is all the piping around. And the stream is prematurely terminated.

eterna2 on 29 May 2020

I can't figure out the issue, might need to do even more testing:

purely functional test - I used the function to retrieve the tarball from minio - no issue
light-weight service - I constructed a simple express server over the function (very similar to the actual server) - no issue
actual local test (no containerization, no istio, no k8s) - I build and run the actual node server in my local machine (config it to read from the minio service) - no issue
full deployment in k8s - early termination of the stream (but i check the data is right, just terminated early)

eterna2 on 29 May 2020

👍2

@eterna2 Any chance this can be a problem in ingress?
Did you try kubectl port-forward to a full deployment in k8s? Does it give an error too?

Bobgy on 1 Jun 2020

@eterna2 Any chance this can be a problem in ingress?
Did you try kubectl port-forward to a full deployment in k8s? Does it give an error too?

@Bobgy
yes u r right! These is what I did:

port-forward to the ml-pipeline-ui pod <- downloads properly
port-forward to istio-gateway <- downloads incomplete
port-forward to ml-pipeline-ui nodeport service <- downloads incomplete

This is very strange.

I have never look in-depth on how k8s services do the networking, so I am clueless what might be the issues. But these are my observations

the parquet itself shld not have an issue because I tried with a pipeline and output it as an artifact - argo managed to tarball it properly and u can download the parquet
the folder structure of the provided tarball is abit different from how argo would do it (is this a possible reason?)
http chunked transfer encoding assumes the transfer is complete if it receives an empty chunk (because the size is not known in a streaming protocol). I was able to download the artifact completely if I remove the streaming code and replace it with a single buffer array.

I will see if I can add a content-length header to the http response, and see if it works. But might need 1 more API call to minio to get the object size.

eterna2 on 1 Jun 2020

I noticed that if I am using wget <<address>> it works

something like:

wget http://localhost:8080/artifacts/get\?
source\=minio\&namespace\=kubeflow\&bucket\=mlpipeline\&key\=artifacts%2Fkubeflow-ai-wtv67%2Fkubeflow-ai-wtv67-1129671773%2Ffiles.tgz -O test.tgz

Same URL using different browsers don't work!

Gsantomaggio on 8 Jun 2020

@eterna2 @Bobgy do you have any clues where to look further or how the bug can be fixed? Thanks.

anatolyzolotar on 9 Jun 2020

sorry. this is tough. I can propose a workaround - that is to not automatically unpack tarballs.

A fix might need more effort to drill down to the issue. Cuz from my testing, i suspect might be a combination of code and data issue, that is very specific on how u handle the data chunks (hence the diff when using diff browsers).

I might need to setup wireshark to look at the individual packets, and currently might not have the bandwidth to do that.

Would be happy if anyone else can help to do more testing? I reported my observation above.

@Bobgy what do u think of the workaround - default no unpacking of gzip and tar for artifact for the download links (i.e. this feature will be only used for preview).

eterna2 on 16 Jun 2020

Can we provide an extra link for downloading the original tar file before this issue fixed? so that we don't get rid of existing convenient features

Bobgy on 18 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 17 Sep 2020

Please unstale, non-text files are not downloadable at the moment

solovyevt on 21 Sep 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 24 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings