Pipelines: Enhancement Request: Add AWS S3 Support for TensorBoard in KFP

Created on 12 Aug 2020 · 10Comments · Source: kubeflow/pipelines

Overview

My team is interested in attempting to integrate TensorBoard into Kubeflow Pipelines (using v1.0.0, standalone installation) we find ourself unable to do so due to our dependency on AWS instead of GCP. I was recommended by someone in the Kubeflow Slack to open this Enhancement Request for adding S3 support for using TensorBoard in Kubeflow Pipelines.

Proposal

A great improvement to the KFP UI would be to see the Start TensorBoard button in the output page of a pipeline run as described in the KFP docs (https://www.kubeflow.org/docs/pipelines/sdk/output-viewer/#tensorboard) _even if_ the TensorBoard log directory has been uploaded to an AWS S3 bucket.

End user requirements to leverage this feature:

Pipeline configuration that creates the /mlpipeline-ui-metadata.json file successfully in the container,
Pipeline step outputs a TensorBoard logdir that is valid (i.e. tensorboard --inspect --logdir /app/logs/fit/… succeeds)
Pipeline step uploads logdir to AWS S3 successfully

Here’s the content of an example metadata file for a S3 logdir:

{
  "outputs": [
    {
      "type": "tensorboard",
      "source": "s3://my-team-bucket/kubeflow/pipeline-x/run-123/app/logs/fit/20200723-124231"
    }
  ]
}

@Bobgy pointed me to https://github.com/kubeflow/pipelines/issues/4208 as a potential future workaround using a mount path (note: not yet merged as of creation of this GitHub issue), but it would be great to have AWS S3 support for this as well.

I'm happy to chip in where I can with design discussions/implementation here, especially with regard to AWS integration in general, since my team is exclusively (and mostly successfully) using AWS instead of GCP for our KFP cluster.

Original Slack channel thread:
https://kubeflow.slack.com/archives/CE10KS9M4/p1595512605179800

platforaws statutriaged

Source

lucinvitae

Most helpful comment

@PatrickXYS we were finally able to get tensorboard working with IAM.

I'll document it below so that other people can leverage the setup. Although I'm happy to add the docs somewhere else if there's a better location.

Similar to https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/env/aws/README.md we have a configmap with the content necessary for the tensorboard launcher:

We're using kustomize to override the VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH variable in an overlay file:

- op: replace
  path: /spec/template/spec/containers/0/env
  value:
    ...
    - name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH
      value: /etc/config/viewer-tensorboard-template.json
    ...

Then we've modified ml-pipeline-ui-configmap like this:

  viewer-tensorboard-template.json: |-
    {
      "metadata": {
        "annotations": {
          "iam.amazonaws.com/role": "ai-rancher/rancher_ai_training_shared"
        }
      },
      "spec": {
          "serviceAccountName": "kubeflow-pipelines-viewer"
      }
    }

After that, the tensorboard viewer pod gets started with the above IAM role and can access our S3 buckets:

$ kubectl -n kubeflow get pods viewer-f4bd94b05ac8e177e75eec0da3bfca29d298289a-deploymentsj6hd -o yaml | grep -C3 iam
kind: Pod
metadata:
  annotations:
    iam.amazonaws.com/role: ai-rancher/rancher_ai_training_shared

I should also note that in our kustomize configs we specify a commonAnnotation:

commonAnnotations:
  iam.amazonaws.com/role: ai-rancher/rancher_ai_training_shared

But we still needed the above viewer-tensorboard-template.json changes to get tensorboard viewer pod access to our S3 bucket for downloading the log dir, otherwise the viewer pod was showing AccessDenied errors and not rendering anything.

All of this is to say that we've been able to create TensorBoard artifacts as follows with S3 paths from within our pipelines, and the viewer pod is able to use IAM roles to download the S3 log dir. Example metadata.json:

{
  "outputs": [
    {
      "type": "tensorboard",
      "source": "s3://invitae-ai-training-shared/kubeflow/experiments/ccccfe49-a23e-4a5b-9684-a3e7c5e26095/runs/17400f2b-11dc-4729-b7dc-f2336a81aadd/logs"
    }
  ]
}

__Also, the KFP docs (here say that _"The pipeline component must write a JSON file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json"_ but we've found that not only can the filename be anything when specifying outputs for python components, but also there needs to be an artifact named mlpipeline-ui-metadata or else outputs do not work, tensorboard outputs included.__ Example working output for a python pipeline component:

    op = dsl.ContainerOp(
        name="Write tensorboard metadata",
        image=docker_image,
        command=["sh", "-c"],
        arguments=[ ... some command that produces metadata.json ... ],
        file_outputs={"mlpipeline-ui-metadata": "metadata.json"},
    )

It would be great to update the KFP for the above metadata.json issue, since debugging this issue cost me a few hours, and I felt a bit mislead by the existing documentation.

cc @nlarusstone since you were asking in the #kubeflow-pipelines slack channel.

lucinvitae on 25 Sep 2020

👍2

All 10 comments

/assign @Jeffwan @PatrickXYS
Who work for AWS.

Bobgy on 13 Aug 2020

Per AWS standalone KFP support has already been added recently, check PR for details.

All the requirements has been addressed. Follow the README for instructions.

Feel free to submit any issues you find during standalone KFP installation.

PatrickXYS on 13 Aug 2020

Hello @PatrickXYS, does that include TensorBoard support? I didn't see anything in that PR related to TensorBoard at first glance. In our last discussion @Bobgy informed me that no such support exists.

lucinvitae on 13 Aug 2020

If you check https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/env/aws/viewer-pod-template.json, this is for ml-pipeline-ui-configmap, which is rendered when you do Start Tensorboard.

The point here is you need to configure those AWS parameters correctly, such as s3 bucket, aws public id, and aws secret key, etc.

PatrickXYS on 13 Aug 2020

👍1

@PatrickXYS thanks, that looks promising. We'll try that out.

lucinvitae on 14 Aug 2020

@PatrickXYS are there instructions on how to use IAM roles with Tensorboard? We don't have long-lived AWS secrets, so we'll need an IAM based solution

nlarusstone on 24 Aug 2020

👍1

Bump ^ we also use IAM roles instead of long-lived AWS secrets.

@PatrickXYS do you know if this is supported? And if it's undocumented but supported, happy to help supply documentation provided we can get it working.

lucinvitae on 22 Sep 2020

We haven't supported IAM role yet. Also, I'm not sure if that's feasible for now, will add in our roadmap.

PatrickXYS on 24 Sep 2020

@PatrickXYS we were finally able to get tensorboard working with IAM.

I'll document it below so that other people can leverage the setup. Although I'm happy to add the docs somewhere else if there's a better location.

Similar to https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/env/aws/README.md we have a configmap with the content necessary for the tensorboard launcher:

We're using kustomize to override the VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH variable in an overlay file:

- op: replace
  path: /spec/template/spec/containers/0/env
  value:
    ...
    - name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH
      value: /etc/config/viewer-tensorboard-template.json
    ...

Then we've modified ml-pipeline-ui-configmap like this:

  viewer-tensorboard-template.json: |-
    {
      "metadata": {
        "annotations": {
          "iam.amazonaws.com/role": "ai-rancher/rancher_ai_training_shared"
        }
      },
      "spec": {
          "serviceAccountName": "kubeflow-pipelines-viewer"
      }
    }

After that, the tensorboard viewer pod gets started with the above IAM role and can access our S3 buckets:

$ kubectl -n kubeflow get pods viewer-f4bd94b05ac8e177e75eec0da3bfca29d298289a-deploymentsj6hd -o yaml | grep -C3 iam
kind: Pod
metadata:
  annotations:
    iam.amazonaws.com/role: ai-rancher/rancher_ai_training_shared

I should also note that in our kustomize configs we specify a commonAnnotation:

commonAnnotations:
  iam.amazonaws.com/role: ai-rancher/rancher_ai_training_shared

{
  "outputs": [
    {
      "type": "tensorboard",
      "source": "s3://invitae-ai-training-shared/kubeflow/experiments/ccccfe49-a23e-4a5b-9684-a3e7c5e26095/runs/17400f2b-11dc-4729-b7dc-f2336a81aadd/logs"
    }
  ]
}

    op = dsl.ContainerOp(
        name="Write tensorboard metadata",
        image=docker_image,
        command=["sh", "-c"],
        arguments=[ ... some command that produces metadata.json ... ],
        file_outputs={"mlpipeline-ui-metadata": "metadata.json"},
    )

It would be great to update the KFP for the above metadata.json issue, since debugging this issue cost me a few hours, and I felt a bit mislead by the existing documentation.

cc @nlarusstone since you were asking in the #kubeflow-pipelines slack channel.

lucinvitae on 25 Sep 2020

👍2

I should note, none of this worked until we bumped to KFP standalone version: v1.0.1
(https://github.com/kubeflow/pipelines/releases/tag/1.0.1)

lucinvitae on 25 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings