Pipelines: S3 errors in Pipeline examples for reading training data and artifact storage

Created on 27 Dec 2018  Â·  34Comments  Â·  Source: kubeflow/pipelines

I am working with the Taxi Cab pipeline example and need to replace GCS storage with Minio (S3 compatible) for storing training data, eval data, and to pass data from step to step in argo workflows:
"pipelines/samples/notebooks/KubeFlow Pipeline Using TFX OSS Components.ipynb"

The issue with s3:// protocol support seems to be specific to TFDV/Apache Beam step. Beam does not seem to provide support for S3 in Python client. We are looking for a way right now to change TFDV step to use local/attached storage.

Minio access parameters seem to be properly configured - the validation step is successfully creating several folders in Minio bucket, for example: demo04kubeflow/output/tfx-taxi-cab-classification-pipeline-example-ht94b/validation

The error is on reading or writing any files from the Minio buckets, and it's coming from Tensorflow/Beam tfdv.generate_statistics_from_csv():

File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
    raise ValueError('Unable to get the Filesystem for path %s' % path)
ValueError: Unable to get the Filesystem for path s3://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio files are accessed via s3:// protocol, for example
PipelineTFX4.ipynb.txt
OUTPUT_DIR = 's3://demo04kubeflow/output'

This same step worked fine when train.csv was stored in GCS bucket:
gs://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio credentials were provided as env variables to ContainerOp:

return dsl.ContainerOp(
        name = step_name,
        image = DATAFLOW_TFDV_IMAGE,
        arguments = [
            '--csv-data-for-inference', inference_data,
            '--csv-data-to-validate', validation_data,
            '--column-names', column_names,
            '--key-columns', key_columns,
            '--project', project,
            '--mode', mode,
            '--output', validation_output,
        ],
        file_outputs = {
            'schema': '/schema.txt',
        }
    ).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_ENDPOINT', 
            value=S3_ENDPOINT, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ENDPOINT_URL', 
            value='https://{}'.format(S3_ENDPOINT), 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ACCESS_KEY_ID', 
            value=S3_ACCESS_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_SECRET_ACCESS_KEY', 
            value=S3_SECRET_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_REGION', 
            value='us-east-1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='BUCKET_NAME', 
            value='demo04kubeflow', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_USE_HTTPS', 
            value='1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_VERIFY_SSL', 
            value='1'
    ))

This pipeline example was created from Jupyter notebook running on the same Kubernetes cluster as Kubeflow Pipelines, Argo, and Minio. Please see attached the Jupyter notebook, and two log files from pipeline execution (validate step). All required files (such as train.csv) were uploaded to Minio from the notebook.
tfx-taxi-cab-classification-pipeline-example-wait.log

tfx-taxi-cab-classification-pipeline-example-main.log

PipelineTFX4.ipynb.zip

arecomponents help wanted kinbug lifecyclfrozen platforaws platforother prioritp0

Most helpful comment

@rummens At the same time, work is in progress on a library that allows to mount any S3 bucket into pipeline steps or Jupyter notebooks. The library is based on Goofys file system, and it will be ready as soon as next week. It allows various frameworks such as Beam and Keras to access data in S3 buckets via POSIX interface. It supports S3, Minio, GCS, Ceph:
https://github.com/kahing/goofys
Currently working on documentation and examples.

All 34 comments

/cc @jlewi can you page in the right folks here? we're blocked on using this until it's solved.

ack I'll loop in some folks; but it sounds like the issue is actually outside Kubeflow and is in Apache Beam.

You might want to repost the issue in the Apache Beam
https://issues.apache.org/jira/browse/BEAM-2500

Or in the TF Data Validation
https://github.com/tensorflow/data-validation

If your goal is trying to use pipelines did you consider trying to use some other example? Or creating a new one that doesn't use TFX.

It looks like this is a known issue with Apache Beam and has been open for a long time.
https://issues.apache.org/jira/browse/BEAM-2572

Resolving since this is an issue in beam.

I understand our desire to close these issues, but I'd like to suggest we take ownership over the problem. Obviously, most Kuebfolw deployments will run against S3, rather than GCP, so most deployments will now be a problem.

At _LEAST_ we should file it as a bug over there. Are the TFX aware of the issue?

@mameshini - would you mind filing a bug?

@aronchick We are currently implementing a storage management approach that mounts S3 bucket as Kubernetes volume, using s3fs dynamic provisioner. It's not just Apache Beam, it's Keras and other libraries that can't handle S3 or Minio. We can mount S3/GCS/Minio buckets as volumes and access them as Posix file system, with an optional caching layer. I can share working examples soon, fingers crossed performance seems acceptable. I am waiting on filing a bug because we may be able to solve this problem better with Kubernetes storage provisioners. A lot of Python libraries require a file system to work with.

+1!

@vicaire I would suggest we reopen - we need a solution here.

Got it.

Apologies. Looks like I misunderstood the issue. Reopening. Having things working with S3 is a high priority for us.

Note: Related to volume support: https://github.com/kubeflow/pipelines/issues/801

@mameshini, we are looking forward to your example. Thanks!

I am following up on JIRA ticket https://issues.apache.org/jira/browse/BEAM-2572 and pushing progress for s3 filesystem native support in Apache Beam Python SDK. I will give a hand if necessary since Beam Python will use boto3 which will significantly simplify deployment.

As an alternative, I use NFS as the shared storage and add a fews arguments to make sure
tf-serving deployer are working for non-GKE clusters.

Please check the example here. https://gist.github.com/Jeffwan/5ee66343e48cf52c08c4de98be98cc1d

Change Lists:

  1. Prepare NFS storage and copy all files here to /taxi, create a PV and PVC called efs-claim
  2. Add storage volume efs-claim to every container and remove GCP secret env.
  3. deployer calls google meta data service to get cluster name, to skip this step, you can pass --cluster-name in order to get persist model, --pvc-name is also necessary.
  4. deployer uses pod service account to get access to kubernetes server. Make sure this server account has right RBAC setup. I assume it's kubeflow:default

Thank you for the update @Jeffwan. NFS storage can be acceptable for demos or even real use cases where data sets are relatively small. But NFS does have limitations with cost, performance, scalability. When data sets exceed 100 GB, NFS becomes too expensive, slow, and hard to manage. Even when using managed NFS from cloud providers, we experienced very long delays when simply reading directories with a large number of files. The preferred approach would be to store data in object storage.

We have explored several ways to mount object storage as Posix-like file system, with support for S3, GCS, Minio, or Ceph. Based on our testing, Goofys had the best reliability and performance, and we decided to use it for creating PVs and PVCs for pipeline steps. Preparing to publish a working example soon.

@mameshini Nice, thanks for sharing. Indeed, NFS is not a very good option in HPC/DL area. We internally use Lustre for deep learning training, it can be used as data caching layer and backed by S3 as data repository and provide Posix file system interface. Look forward to your example!

Any recent progress on this? :-)

@rummens Please track status in Apache Beam community https://issues.apache.org/jira/browse/BEAM-2572 . There's a python SDK dependency there to make demo work with S3 natively.

@rummens At the same time, work is in progress on a library that allows to mount any S3 bucket into pipeline steps or Jupyter notebooks. The library is based on Goofys file system, and it will be ready as soon as next week. It allows various frameworks such as Beam and Keras to access data in S3 buckets via POSIX interface. It supports S3, Minio, GCS, Ceph:
https://github.com/kahing/goofys
Currently working on documentation and examples.

Awesome looking forward to it! Basically it will be a PV mounted in each pod, that under the hood is an object storage?

@mameshini This looks very interesting. Do you have an example to share?

Yes the example is now available in this repository:
https://github.com/agilestacks/kubeflow-extensions

It mounts PV into each pod, and under the hood it's object storage (S3,
GCS, Minio).

On Tue, Jul 9, 2019 at 2:01 PM IronPan notifications@github.com wrote:

@mameshini https://github.com/mameshini This looks very interesting. Do
you have an example to share?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubeflow/pipelines/issues/596?email_source=notifications&email_token=ABSIPTA2HL3JXTSRC4BG6EDP6T4DTA5CNFSM4GMJU35KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZRQSUQ#issuecomment-509806930,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABSIPTCUSMJ3B7WOYG44TN3P6T4DTANCNFSM4GMJU35A
.

Can you point me to the documentation of how to create the PVC in the first place? I was only able to find the mounting of it.
Thanks

@rummens You need to deploy Flex plugin to enable goofys support for kubernetes.
Then goofys can be used to mount S3, GCS, or Minio buckets as PVs. Goofys will only mount a specific bucket so you must provide the bucket option, and pre-create the bucket for each volume. An example for PVC is provided in kubeflow-extensions/storage/s3fs/test.yaml, bucket name is "default".
Installation instructions for manual deployment of flex plugin will be added in a few days, automatic install is working on Agile Stacks.

Thanks, I will be looking out for the install instructions and see if it makes sense for us.

@mameshini Hi Mameshini, I am trying to store data from kubeflow to MinIO, but I've got an issue regarding the mounting problem. It seems that I need to modify the yaml file, but I do not know how to modify it, do you have any example? or is there any hint of how do modify it plz?

I am experiencing the same issues as @AnnieWei58

@IronPan @mameshini Can you work together to check the possibility of enabling gcsfuse or goofys or other S3 mounting system as part of KFP deployment?

The CUJ is:
When user adds volumeMount for a specific volume to a Pod, the volume mounts successfully and the files written to that volume appear in GCS in a bucket configured by the cluster admin.

S3 support was merged into Beam and it will be released in 2.19.0. Let's wait on the release and we can update all the examples.

Apache Beam has been released with s3 support. The latest version is 2.20.0.

@gautamkmr Check #3185 The blocker is now on the TFX side.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

TFX v0.22.0 support Apache Beam 2.21.0 with S3 support. https://github.com/tensorflow/tfx#compatible-versions

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

/remove-frozen

Is this issue still relevant? There are similar discussions also going on elsewhere, such as here: https://github.com/kubeflow/pipelines/issues/3405

This is a pretty important issue for us. We want to move all in-cluster persistent storage (mysql DBs, MinIO) off to managed AWS services as managing them in-cluster is becoming quite a pain.

Was this page helpful?
0 / 5 - 0 ratings