My team is interested in attempting to integrate TensorBoard into Kubeflow Pipelines (using v1.0.0, standalone installation) we find ourself unable to do so due to our dependency on AWS instead of GCP. I was recommended by someone in the Kubeflow Slack to open this Enhancement Request for adding S3 support for using TensorBoard in Kubeflow Pipelines.
A great improvement to the KFP UI would be to see the Start TensorBoard button in the output page of a pipeline run as described in the KFP docs (https://www.kubeflow.org/docs/pipelines/sdk/output-viewer/#tensorboard) _even if_ the TensorBoard log directory has been uploaded to an AWS S3 bucket.
End user requirements to leverage this feature:
/mlpipeline-ui-metadata.json file successfully in the container,logdir that is valid (i.e. tensorboard --inspect --logdir /app/logs/fit/… succeeds)logdir to AWS S3 successfullyHere’s the content of an example metadata file for a S3 logdir:
{
"outputs": [
{
"type": "tensorboard",
"source": "s3://my-team-bucket/kubeflow/pipeline-x/run-123/app/logs/fit/20200723-124231"
}
]
}
@Bobgy pointed me to https://github.com/kubeflow/pipelines/issues/4208 as a potential future workaround using a mount path (note: not yet merged as of creation of this GitHub issue), but it would be great to have AWS S3 support for this as well.
I'm happy to chip in where I can with design discussions/implementation here, especially with regard to AWS integration in general, since my team is exclusively (and mostly successfully) using AWS instead of GCP for our KFP cluster.
Original Slack channel thread:
https://kubeflow.slack.com/archives/CE10KS9M4/p1595512605179800
/assign @Jeffwan @PatrickXYS
Who work for AWS.
Hello @PatrickXYS, does that include TensorBoard support? I didn't see anything in that PR related to TensorBoard at first glance. In our last discussion @Bobgy informed me that no such support exists.
If you check https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/env/aws/viewer-pod-template.json, this is for ml-pipeline-ui-configmap, which is rendered when you do Start Tensorboard.
The point here is you need to configure those AWS parameters correctly, such as s3 bucket, aws public id, and aws secret key, etc.
@PatrickXYS thanks, that looks promising. We'll try that out.
@PatrickXYS are there instructions on how to use IAM roles with Tensorboard? We don't have long-lived AWS secrets, so we'll need an IAM based solution
Bump ^ we also use IAM roles instead of long-lived AWS secrets.
@PatrickXYS do you know if this is supported? And if it's undocumented but supported, happy to help supply documentation provided we can get it working.
We haven't supported IAM role yet. Also, I'm not sure if that's feasible for now, will add in our roadmap.
@PatrickXYS we were finally able to get tensorboard working with IAM.
I'll document it below so that other people can leverage the setup. Although I'm happy to add the docs somewhere else if there's a better location.
Similar to https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/env/aws/README.md we have a configmap with the content necessary for the tensorboard launcher:
We're using kustomize to override the VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH variable in an overlay file:
- op: replace
path: /spec/template/spec/containers/0/env
value:
...
- name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH
value: /etc/config/viewer-tensorboard-template.json
...
Then we've modified ml-pipeline-ui-configmap like this:
viewer-tensorboard-template.json: |-
{
"metadata": {
"annotations": {
"iam.amazonaws.com/role": "ai-rancher/rancher_ai_training_shared"
}
},
"spec": {
"serviceAccountName": "kubeflow-pipelines-viewer"
}
}
After that, the tensorboard viewer pod gets started with the above IAM role and can access our S3 buckets:
$ kubectl -n kubeflow get pods viewer-f4bd94b05ac8e177e75eec0da3bfca29d298289a-deploymentsj6hd -o yaml | grep -C3 iam
kind: Pod
metadata:
annotations:
iam.amazonaws.com/role: ai-rancher/rancher_ai_training_shared
I should also note that in our kustomize configs we specify a commonAnnotation:
commonAnnotations:
iam.amazonaws.com/role: ai-rancher/rancher_ai_training_shared
But we still needed the above viewer-tensorboard-template.json changes to get tensorboard viewer pod access to our S3 bucket for downloading the log dir, otherwise the viewer pod was showing AccessDenied errors and not rendering anything.
All of this is to say that we've been able to create TensorBoard artifacts as follows with S3 paths from within our pipelines, and the viewer pod is able to use IAM roles to download the S3 log dir. Example metadata.json:
{
"outputs": [
{
"type": "tensorboard",
"source": "s3://invitae-ai-training-shared/kubeflow/experiments/ccccfe49-a23e-4a5b-9684-a3e7c5e26095/runs/17400f2b-11dc-4729-b7dc-f2336a81aadd/logs"
}
]
}
__Also, the KFP docs (here say that _"The pipeline component must write a JSON file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json"_ but we've found that not only can the filename be anything when specifying outputs for python components, but also there needs to be an artifact named mlpipeline-ui-metadata or else outputs do not work, tensorboard outputs included.__ Example working output for a python pipeline component:
op = dsl.ContainerOp(
name="Write tensorboard metadata",
image=docker_image,
command=["sh", "-c"],
arguments=[ ... some command that produces metadata.json ... ],
file_outputs={"mlpipeline-ui-metadata": "metadata.json"},
)
It would be great to update the KFP for the above metadata.json issue, since debugging this issue cost me a few hours, and I felt a bit mislead by the existing documentation.
cc @nlarusstone since you were asking in the #kubeflow-pipelines slack channel.
I should note, none of this worked until we bumped to KFP standalone version: v1.0.1
(https://github.com/kubeflow/pipelines/releases/tag/1.0.1)
Most helpful comment
@PatrickXYS we were finally able to get tensorboard working with IAM.
I'll document it below so that other people can leverage the setup. Although I'm happy to add the docs somewhere else if there's a better location.
Similar to https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/env/aws/README.md we have a configmap with the content necessary for the tensorboard launcher:
We're using
kustomizeto override theVIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATHvariable in an overlay file:Then we've modified
ml-pipeline-ui-configmaplike this:After that, the tensorboard viewer pod gets started with the above IAM role and can access our S3 buckets:
I should also note that in our kustomize configs we specify a commonAnnotation:
But we still needed the above
viewer-tensorboard-template.jsonchanges to get tensorboard viewer pod access to our S3 bucket for downloading the log dir, otherwise the viewer pod was showingAccessDeniederrors and not rendering anything.All of this is to say that we've been able to create TensorBoard artifacts as follows with S3 paths from within our pipelines, and the viewer pod is able to use IAM roles to download the S3 log dir. Example
metadata.json:__Also, the KFP docs (here say that _"The pipeline component must write a JSON file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be
/mlpipeline-ui-metadata.json"_ but we've found that not only can the filename be anything when specifying outputs for python components, but also there needs to be an artifact namedmlpipeline-ui-metadataor else outputs do not work, tensorboard outputs included.__ Example working output for a python pipeline component:It would be great to update the KFP for the above metadata.json issue, since debugging this issue cost me a few hours, and I felt a bit mislead by the existing documentation.
cc @nlarusstone since you were asking in the #kubeflow-pipelines slack channel.