The master branch is broken because the docker image used to test on gpus is the one used to test the tensorflow source code itself. This can be considered private API.
Instead, to build our manylinux2010 wheels and test our code, we should use the official way recommended by tensorflow (the same way external users do it).
Here is the official guide: https://github.com/tensorflow/custom-op
The docker images we should use are from the official dockerhub repo:
tensorflow/tensorflow:2.1.0-custom-op-gpu-ubuntu16 # gpu
tensorflow/tensorflow:2.1.0-custom-op-ubuntu16 # cpu
Those images are "manylinux2010 compatible" sort of, since it's ubuntu.
By doing that, our testing environment will become stable again, and we shouldn't suffer sudden build breaks.
Thanks for the issue! I agree using a nosla docker image is not the way we should be doing this.
So couple of things:
1) I'm not convinced of the reason GPU builds are breaking today. Almost looks like a pip networking issue. Not sure why this has just started today. The error:
ERROR: Could not find a version that satisfies the requirement tensorflow==2.1.0 (from -r build-requirements.txt (line 1)) (from versions: none)
ERROR: No matching distribution found for tensorflow==2.1.0 (from -r build-requirements.txt (line 1))
2) We used to use tensorflow/tensorflow:custom-op-ubuntu16 but the images are rarely updated (something we can probably fix with more communication) and we had an issue when TF incremented their cuDNN version. We can build from a base image, but we can't pragmatically pull in a new cuDNN since this requires an NVIDIA account.
3) The new tensorflow/tensorflow:2.1.0-custom-op-gpu-ubuntu16 looks promising and wasn't available before. This should align the correct cuDNN in the container, but I wonder what the release process of this is. Come 2.2 release candidates how long will it take for us to get a new docker image? I suppose we can use the old one provided CUDA and cuDNN match. It also has the needed devtoolset-7 & devtoolset-8 for building manylinux2010 (/dt7/ in the contianer)
@av8ramit Do you happen to know why our GPU tests are failing for the above error? I'm not sure why (as of today) it's unable to locate a TF 2.1.0 from pypi
I've tested the image locally and it's not a network error :(
I've tested the image locally and it's not a network error :(
Looks like it's because python3.8 was added to the container.
Nice find @seanmorgan! I still think we should switch images before they do more drastic changes, if they change the cuda version, the build will break again.
Nice find @seanmorgan! I still think we should switch images before they do more drastic changes, if they change the cuda version, the build will break again.
Yeah #1117 is hanging on by a thread since python3 could change symlink to py38 and then our configure script will break as well as the docker build.
Heads up @yongtang . I know TF IO uses the same image as us for builds and not entirely sure your pipeline but if you're using pip3 -> py38 ; python3 -> python3.6
We would need a simple first pull request as proof of concept. In https://github.com/tensorflow/addons/blob/master/tools/docker/gpu_tests.Dockerfile we should use tensorflow/tensorflow:2.1.0-custom-op-gpu-ubuntu16 as the base image (after the FROM).
@failure-to-thrive would that be something you'd be interested to work on since you seem familiar with the build system?
I'll take a look and fix the image. In the meantime is there any chance you can specify pip3.6. We did recently add python3.8 to the image.
I'll take a look and fix the image. In the meantime is there any chance you can specify
pip3.6. We did recently add python3.8 to the image.
Yeah we're good on our side for the moment. Going forward I think we'll move away from the nosla tensorflow testing image so you won't need to worry about us
Yes I highly recommend that since we offer no support for that. That's our own internal image.
@failure-to-thrive would that be something you'd be interested to work on since you seem familiar with the build system?
Sorry, no. I'm mostly C++ and Python. Other things are casual.
@failure-to-thrive no worries, I'll do it :)
Most helpful comment
Looks like it's because python3.8 was added to the container.