Models: TF2 OD Training on AI Platform - TPU Distribution Strategy Failure.

Created on 15 Aug 2020 · 6Comments · Source: tensorflow/models

Prerequisites

Please answer the following question for yourself before submitting an issue.

[x] I checked to make sure that this issue has not been filed already.

1. The entire URL of the documentation with the issue

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_training_and_evaluation.md

2. Describe the issue(s)

Under Google Cloud AI Platform, the documentation has "--python-version 3.6" as flags for the training and evaluation commands. Python 3.6 is not compatible with runtime version 2.1 (https://cloud.google.com/ai-platform/training/docs/runtime-version-list) and thus using this command throws a python compatibility error.
The following was highlighted, but not solved in issue #8457. I have compounded it with this documentation issue for brevity's sake, as both issues relate the same command. Upon running the google AI platform TPU training command with python 3.7 + Tensorflow (2.2+), one receive's the following error: InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified and the TPU distribution strategy fails. Full stack trace can be found in #8839

Anyone any ideas? :D

Update: 15/08

From what I can infer from the errors, this is an incompatibility issue with the TPU’s using Runtime 2.1, which uses TF 2.1 and no runtime being available for TF2.2. Given the OD API uses TF2.2^, this seems to be incompatible with running on the AI platform, in contradiction with what the documentation describes.

Am I barking up the wrong tree? If not, maybe this issue should be tagged as a bug.

research docs

Source

Agiledom

👍1

All 6 comments

From what I can infer from the errors, this is an incompatibility issue with the TPU’s using Runtime 2.1, which uses TF 2.1 and no runtime being available for TF2.2. Given the OD API uses TF2.2^, this seems to be incompatible with running on the AI platform

Seems like your hunch may be inline with this comment, where someone was having a similar issue but not on AI Platform.

I added print(f'TF VERSION: {tf.__version__}') to my AI Platform application and it seems that packaging Object Detection API as a dependency forces TF 2.3 to be installed. Based on the AI Platform documentation, my assumption is that we're ending up in a situation where the TPU worker still has Runtime 2.1, while the main Runtime's TF version has been updated to >2.1

xNeophyte on 15 Aug 2020

👍1

@xNeophyte Hey! I saw this comment (and thread) during my research into this issue and it does seem that the AI platform TPU worker runtimes are lagging behind the tensorflow versions. In my mind this makes this a doc's issue as it doesn't seem like you can currently train an OD model with the API platform. Would love to proven wrong!

Agiledom on 15 Aug 2020

👍1

@Agiledom It looks like Runtime 2.2 has been added to AI Platform. I'm hoping that this issue will go away on that version of the runtime. I won't get a chance to try it out for a few days, but when I do I'll let you know how it goes.

xNeophyte on 30 Aug 2020

❤1

@xNeophyte Thanks for this! Unfortunately, there isn't TPU support yet, but it should in theory work for GPU's. I too will run some tests and check back here with my findings.

Agiledom on 31 Aug 2020

Just wanted to confirm that TPU isn't supported yet for runtime 2.2. When I tried, I received this error message from AI Platform:
Error: The specified runtime version '2.2' is not supported for TPU training. Please specify a different runtime version.

xNeophyte on 5 Sep 2020

@Agiledom Looks like TPU Support has been added for Runtime 2.2. Hopefully it'll work now. I'll give it a try sometime in the next couple days.

xNeophyte on 28 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings