Hi everyone. We've found a strange problem that sometimes artifacts cannot be properly opened after downloading through Pipelines UI. One step of our pipeline produces parquet file, which can be opened properly when we download tgz directly from Minio UI and extract it manually by tar -xzvf <artifact.tar.gz>.
However, when we click on artifact link in Pipeline UI, browser "successfully" (without exception/bad response) downloads file which is, however, significantly smaller than its original version and it can't be read by pandas.
Downloading artifacts from Pipelines UI should provide correct file, with its original size and content.
Put any tgz artifact file into Minio, then try to download it via backend's /artifacts/get endpoint. Please, let me know if you can't reproduce it. I'm really concerned that I couldn't find any similar issue since 0.5.0 version, so it may be some outlandish bug in our environment.
How did you deploy Kubeflow Pipelines (KFP)?
Private AWS cluster
KFP version:
0.5.0
KFP SDK version:
0.5.1
I've been haunting this bug for hours and for me it's still quite unclear. Maybe this information can help to figure out root of the problem
0) parquet files (and another testing file mentioned in point 5) are written inside component with (if engine and compression matter)
data_frame.to_parquet(args.output_path, engine='fastparquet', compression='gzip')
1) I could see this problem for several tgz files, but for some reason it seems to be unlikely reproducible when archive's size is small (in my experience, less than 20MB).
2) The problems is not relevant for same non-archived files, I could download everything even though files are larger than their archives.
3) Size of downloaded file is almost random, but when I download it few times repeatedly, size for two files can be the same.
4) I meticulously tried to reproduce the problem for one of our artifacts and found out that all not-entirely-downloaded (and extracted by backend) files are beginning of original file. Why I think so? If downloaded file is K-bytes-length, I copy first K bytes or original file into third one (head -c K original_file > third_file) and diff cli tool considers binary files (downloaded and "third", with equal size) as equal by content.
5) Since I can't share our artifact file in case if you need it for investigation and in order to be sure that it's not clearly our mistake, I took random parquet file from kaggle, read it in ipython and re-wrote it to another file with options from point 0. The problem persists, however, for some reason, point 4 does not (maybe it depends on some content format, I'm not familiar with pandas and parquet files).
6) I guess there might be some implicit problem with streaming unarchive modules in nodejs. My main argument here is that non-archived files seem to be downloaded smoothly. I looked into pod's logs but there are only messages below
GET /pipeline/artifacts/get?source=minio&bucket=mlpipeline&key=artifacts ...
Getting storage artifact at: minio: mlpipeline/artifacts/...
/kind bug
/area frontend
Can you check the size of the file in the storage so that we can be sure this is the issue in the Frontend and not Argo?
DO you get the same file size every time you try downloading?
@Ark-kun
Can you check the size of the file in the storage so that we can be sure this is the issue in the Frontend and not Argo?
Sure. As I mentioned above, artifacts and their archives have correct size in Minio UI and I can download files from there without any problems.
DO you get the same file size every time you try downloading?
Yes,
Size of downloaded file is almost random, but when I download it few times repeatedly, size for two files can be the same.
If you mean downloading via MInio UI, downloaded files have the same size and they're correct tgz arhcives (or other types, depending on file).
I somehow missed "parquet file, which can be opened properly when we download tgz directly from Minio UI and extract it manually by tar -xzvf
I've assigned the people who know the most about that part of code.
@eterna2 is this related to the endpoint not able to handle binary data?
@Bobgy first I would say "yes", but I tried to download non-archived parquet files via endpoint and everything seemed good. I get this problem when file is tgz. However, it can be not true in general. I tend to think that the the problem is somewhere here in streaming processing of archives.
there is an implicit logic in the frontend server in handling tgz.
If it is a tgz, the node server will automatically deflate and return the 1st record in the archive, instead of the actual tarball.
Maybe we shld revisit this logic? Cuz the original use case in the past was most of the artifacts are string-based.
Let me see if I can reproduce this. Parquet > 20 mb?
@eterna2 yes, I reproduced it for this file https://github.com/kubeflow/pipelines/files/4665228/rewritten_baseline_data.tar.gz (I don't think it matters, but before upload it as an attachment, I had to change extension from tgz to tar.gz). Please, try too.
UPD: both tgz files for which I could reproduce this problem also had only one file inside.
I see 2 files inside the tarball.
I think this is the reason. THere is a hidden file inside ur tarball.
This is a sample code I written, to download ur tarball, extract only 1 parquet, and output as artifact. I managed to download the parquet correctly.
import functools
import typing
import kfp.dsl
import kfp.components
import kfp.compiler
func_to_container_op = functools.partial(
kfp.components.func_to_container_op,
base_image="python:3.7-slim",
packages_to_install=["pandas", "pyarrow", "requests"],
)
@func_to_container_op
def get_artifact(
download_from: str,
parquet_path: kfp.components.OutputPath(str),
):
import io
import logging
import tarfile
import requests
logging.basicConfig(level=logging.INFO)
logging.info("Downloading from %s", download_from)
stream = io.BytesIO(requests.get(download_from).content)
logging.info("Unpacking file")
tar = tarfile.open(fileobj=stream)
filename = [name for name in tar.getnames() if name[0] != "."][0]
logging.info("Extracting %s", filename)
extracted_parquet = tar.extractfile(filename)
if parquet_path and extracted_parquet:
with open(parquet_path, "wb") as writer:
logging.info("Writing artifact to %s", parquet_path)
writer.write(extracted_parquet.read())
@func_to_container_op
def trivial(parquet_path: kfp.components.InputPath(str)):
import pandas as pd
with open(parquet_path, "rb") as reader:
df = pd.read_parquet(reader)
df.head()
@kfp.dsl.pipeline(name="get_parquet_pipeline")
def get_parquet_pipeline(
download_from: str = "https://github.com/kubeflow/pipelines/files/4665228/rewritten_baseline_data.tar.gz",
):
op1 = get_artifact(download_from=download_from)
trivial(op1.outputs["parquet"])
kfp.compiler.Compiler().compile(get_parquet_pipeline, "get_parquet_pipeline.yaml")
@Bobgy
Do u think we shld add a new flag unpack? If unpack flag is provided, we will deflate, unpack 1st record (what we are doing now), otherwise we just return the artifact as is?
@eterna2 Sounds reasonable for the api.
Which link should we provide on KFP UI then? with unpack, is that right? Then how do we guide users to remove the unpack flag when downloading the complete artifact?
one idea: without unpack flag, if we detected more than one file in the tarball, we should add a warning sentence at the beginning of response that this is not full content?
@eterna2 you're right, there is one more hidden file in the archive. I must have messed rewriting the example parquet to tar with gz compression and this hidden file seems to be one of apple quarantine files with which I had headache when I tried to reproduce this problem. It really looks like for this tar.gz, API endpoint always returns this hidden file. I've rewritten this parquet file and re-packed in into new tgz without excess/hidden files:
rewritten2.tar.gz. This is just one of files for which the problem persists (I can't share others).
However I don't think that this hidden file is the reason of the problem because I've just reproduced the problem for rewritten2.tar.gz:
1) I just tried to download+unarchive the artifact through frontend API endpoint (/artifacts/get) few times,
2) the files are not fully downloaded and can't be read by pandas.read_parquet
Original parquet file:
-rw-r--r-- 1 me staff 21482643 May 27 14:17 rewritten2.pqt
Downloaded files:
-rw-r--r--@ 1 me staff 3653120 May 27 14:22 get
-rw-r--r--@ 1 me staff 4439552 May 27 14:23 get (1)
-rw-r--r--@ 1 me staff 4587008 May 27 14:23 get (2)
-rw-r--r--@ 1 me staff 5333522 May 27 14:23 get (3)
-rw-r--r--@ 1 me staff 4275712 May 27 14:33 get (4)
-rw-r--r--@ 1 me staff 4767232 May 27 14:33 get (5)
3) files are beginning of original file, for example
>>> head -c 4587008 rewritten2.pqt > rewritten2-first-4587008-bytes.pqt
>>> diff get\ \(2\) rewritten2-first-4587008-bytes.pqt
Doesn't see any difference there.
So, downloading tgz files (with unarchiving) through frontend API endpoint seems to be randomly interrupted for some reason. Downloading non-archived files works perfectly (I've just done it 10 times and everything is ok) and this makes me think that is not some connectivity problem.
What I'm asking you to do:
Please, try to upload fixed artifact rewritten2.tar.gz to your Minio and download it through /artifacts/get API endpoint.
Thank you very much
I managed to reproduce this. But I am still investigating the cause.
But I was unable to figure out where. I am suspecting there might be some "special" char in the binary. I suspect might be express or some middleware issue.
This is what I did so far:
Then I
getObjectStream function directly (this function stream from minio, deflate, and untar the artifact) and pipe to fileSo I am suspecting the issue might be at the express server layer. Either the server terminate prematurely because of EOF or some null characters, or some other issues. Need to investigate.
@eterna2, Thank you very much. However I want to remind the interesting fact that everything works fine for non-archived artifact. You also can try to extract parquet file on your local computer, put it in Minio and download via API endpoint. As far as I understand, in this case download stream should be the same as if you download the same parquet in tar.gz which is being extracted. Do you agree or there is any subtle difference between downloads with/without tar extraction?
I can confirm it is express problem.
Let me debug further, and trace
Almost there. I suspect it is all the piping around. And the stream is prematurely terminated.
I can't figure out the issue, might need to do even more testing:
@eterna2 Any chance this can be a problem in ingress?
Did you try kubectl port-forward to a full deployment in k8s? Does it give an error too?
@eterna2 Any chance this can be a problem in ingress?
Did you trykubectl port-forwardto afull deployment in k8s? Does it give an error too?
@Bobgy
yes u r right! These is what I did:
This is very strange.
I have never look in-depth on how k8s services do the networking, so I am clueless what might be the issues. But these are my observations
the parquet itself shld not have an issue because I tried with a pipeline and output it as an artifact - argo managed to tarball it properly and u can download the parquet
the folder structure of the provided tarball is abit different from how argo would do it (is this a possible reason?)
http chunked transfer encoding assumes the transfer is complete if it receives an empty chunk (because the size is not known in a streaming protocol). I was able to download the artifact completely if I remove the streaming code and replace it with a single buffer array.
I will see if I can add a content-length header to the http response, and see if it works. But might need 1 more API call to minio to get the object size.
I noticed that if I am using wget <<address>> it works
something like:
wget http://localhost:8080/artifacts/get\?
source\=minio\&namespace\=kubeflow\&bucket\=mlpipeline\&key\=artifacts%2Fkubeflow-ai-wtv67%2Fkubeflow-ai-wtv67-1129671773%2Ffiles.tgz -O test.tgz
Same URL using different browsers don't work!
@eterna2 @Bobgy do you have any clues where to look further or how the bug can be fixed? Thanks.
sorry. this is tough. I can propose a workaround - that is to not automatically unpack tarballs.
A fix might need more effort to drill down to the issue. Cuz from my testing, i suspect might be a combination of code and data issue, that is very specific on how u handle the data chunks (hence the diff when using diff browsers).
I might need to setup wireshark to look at the individual packets, and currently might not have the bandwidth to do that.
Would be happy if anyone else can help to do more testing? I reported my observation above.
@Bobgy what do u think of the workaround - default no unpacking of gzip and tar for artifact for the download links (i.e. this feature will be only used for preview).
Can we provide an extra link for downloading the original tar file before this issue fixed? so that we don't get rid of existing convenient features
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Please unstale, non-text files are not downloadable at the moment
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
I can't figure out the issue, might need to do even more testing: