I'm trying to compile Dali with the following command, in the docker directory
BUILD_TF_PLUGIN=YES PYVER=3.7 CUDA_VERSION=10.0 ./build.sh
Dali got compiled and the wheel generated but then the script starts to build the TF plugin and I get the following error.
+ nvidia-docker run --name extract_dali_tf_prebuilt_manylinux1 nvidia/dali:cu100.build_tf_manylinux1 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
./build.sh: line 261: nvidia-docker: command not found
I don't recall reading in the documentation about installing nvidia-docker. Is it really needed for building a plugin within a docker image?
On Ubuntu 20.04. Using docker script.
I've installed nvidia-docker and the build process goes further but yet ends with an error
+ nvidia-docker run --name extract_dali_tf_prebuilt_manylinux1 nvidia/dali:cu100.build_tf_manylinux1 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
++ set -e
++ PYTHON_DIST_PACKAGES=($(python -c "import site; print(site.getsitepackages()[0])"))
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ DALI_TOPDIR=/usr/local/lib/python2.7/dist-packages/nvidia/dali
+++ cat /usr/local/cuda/version.txt
+++ head -1
+++ sed 's/.*Version \([0-9]\+\)\.\([0-9]\+\).*/\1\2/'
++ CUDA_VERSION=100
+++ python ../qa/setup_packages.py -n -u tensorflow-gpu --cuda 100
Traceback (most recent call last):
File "../qa/setup_packages.py", line 6, in <module>
import urllib.parse
ImportError: No module named parse
++ LAST_CONFIG_INDEX=
Hi @kindoblue,
sorry for the confusion with the nvidia-docker, I will update the documentation and script so it can support the new syntax from NVIDIA Container Toolkit (docker run --gpus all).
As for the second error, it looks like the script does not propagate the Python version properly and uses Python 2.7 for the TF Plugin containers. I will try to post a fix soon, will get back to you when I have a PR.
Thanks for reporting that.
I adjusted the scripts and docs a bit in #2214.
On my machine it successfully built both the wheel and TF plugin.
What's worth to mention, we recently started building one wheel that is compatible with several minor Python versions.
It should be enough to use the script as follows, omitting the PYVER:
BUILD_TF_PLUGIN=YES CUDA_VERSION=10.0 ./build.sh
Hi, the PR has been merged, can you check if it helps so we can close the issue.
In vacation now. Next week I would be able to test the fix. Thanks
I tried again. The build process fails trying to build the plugin, with the error:
+ export DALI_TF_BUILDER_CONTAINER_MANYLINUX2010=extract_dali_tf_prebuilt_manylinux2010
+ DALI_TF_BUILDER_CONTAINER_MANYLINUX2010=extract_dali_tf_prebuilt_manylinux2010
+ docker run --gpus all --name extract_dali_tf_prebuilt_manylinux2010 nvidia/dali:cu100.build_tf_manylinux2010 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
++ set -e
++ PYTHON_DIST_PACKAGES=($(python -c "import site; print(site.getsitepackages()[0])"))
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ DALI_TOPDIR=/usr/local/lib/python3.6/dist-packages/nvidia/dali
+++ cat /usr/local/cuda/version.txt
+++ head -1
+++ sed 's/.*Version \([0-9]\+\)\.\([0-9]\+\).*/\1\2/'
++ CUDA_VERSION=100
+++ python ../qa/setup_packages.py -n -u tensorflow-gpu --cuda 100
Traceback (most recent call last):
File "../qa/setup_packages.py", line 409, in <module>
main()
File "../qa/setup_packages.py", line 400, in main
print (cal_num_of_configs(args.use, args.cuda) - 1)
File "../qa/setup_packages.py", line 365, in cal_num_of_configs
ret *= pckg.get_num_of_version(cuda_version)
File "../qa/setup_packages.py", line 140, in get_num_of_version
return len(self.get_all_versions(cuda_version))
File "../qa/setup_packages.py", line 218, in get_all_versions
return self.filter_versions(self.versions[cuda_version])
File "../qa/setup_packages.py", line 106, in filter_versions
return [str(v) for v in versions if v]
File "../qa/setup_packages.py", line 106, in <listcomp>
return [str(v) for v in versions if v]
File "../qa/setup_packages.py", line 46, in __bool__
(not self.python_max_ver or parse(PYTHON_VERSION) <= parse(self.python_max_ver))
TypeError: 'module' object is not callable
++ LAST_CONFIG_INDEX=
I tried to edit the file ../qa/setup_packages.py but apparently is not taken into consideration (and the line numbers don't match) so perhaps is taken from a docker image?
PS: I used the original command line
BUILD_TF_PLUGIN=YES PYVER=3.7 CUDA_VERSION=10.0 ./build.sh
because only now I realize that PYVER=3.7 can be omitted
======================================
PPS: I even tried the following command to workaround the problem
BUILD_TF_PLUGIN=YES PREBUILD_TF_PLUGINS=NO CUDA_VERSION=10.0 ./build.sh
but then I get another error:
Writing nvidia-dali-tf-plugin-cuda100-0.26.0.dev0/setup.cfg
creating dist
Creating tar archive
removing 'nvidia-dali-tf-plugin-cuda100-0.26.0.dev0' (and everything under it)
++ cp dist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz /dali_tf_sdist
/opt/dali/dali_tf_plugin
++ popd
+ docker cp extract_dali_tf_sdist:/dali_tf_sdist/. dali_tf_sdist
+ cp dali_tf_sdist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz wheelhouse/
+ cp 'dali_tf_sdist/dummy/*.tar.gz' wheelhouse/dummy
cp: cannot stat 'dali_tf_sdist/dummy/*.tar.gz': No such file or directory
+ true
+ docker rm -f extract_dali_tf_sdist
extract_dali_tf_sdist
+ rm -rf dali_tf_plugin/whl
+ rm -rf dali_tf_sdist/
+ '[' NO == YES ']'
+ popd
Hmm, the source should be mounted into docker, I'm not sure what is going on in here.
It's controlled by the BUILD_INHOST env variable.
You may try to set REBUILD_BUILDERS env variable to YES so it will rebuild the docker images from scratch, maybe there are some leftovers.
I will check this on Monday if the issue still persists.
There is an additional step that prepare plugin builder image in build.sh. Please relaunch with REBUILD_BUILDERS=YES BUILD_TF_PLUGIN=YES PREBUILD_TF_PLUGINS=NO CUDA_VERSION=10.0 ./build.sh and see if that helps.
First of all I pruned all the docker stuff on my system with the command:
docker system prune -a
Then I issued the command in the DALI/docker directory:
REBUILD_BUILDERS=YES BUILD_TF_PLUGIN=YES CUDA_VERSION=10.0 ./build.sh
Almost immediately the build script fails with an error similar to this one:
https://github.com/gliderlabs/docker-alpine/issues/307
Probably it is due to my system (Ubuntu 20.04) but anyway I modified all the calls (in build.sh)
docker build...
with
docker build --network host...
After having compiled the half world now I have in wheelhouse directory the following files
➜ wheelhouse git:(master) ✗ ls -ltr
total 261760
-rw-r--r-- 1 ice ice 267728670 aug 29 08:45 nvidia_dali_cuda100-0.26.0.dev0-12345-py3-none-manylinux2014_x86_64.whl
-rw-r--r-- 1 ice ice 306643 aug 29 09:41 nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz
I don't see any whl for the dali tensorflow plugin, just a tar.gz. Is it supposed to be like this? Consider that the script is ending with the following output:
Creating tar archive
removing 'nvidia-dali-tf-plugin-cuda100-0.26.0.dev0' (and everything under it)
++ cp dist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz /dali_tf_sdist
/opt/dali/dali_tf_plugin
++ popd
+ docker cp extract_dali_tf_sdist:/dali_tf_sdist/. dali_tf_sdist
+ cp dali_tf_sdist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz wheelhouse/
+ cp 'dali_tf_sdist/dummy/*.tar.gz' wheelhouse/dummy
cp: cannot stat 'dali_tf_sdist/dummy/*.tar.gz': No such file or directory
+ true
+ docker rm -f extract_dali_tf_sdist
extract_dali_tf_sdist
+ rm -rf dali_tf_plugin/whl
+ rm -rf dali_tf_sdist/
+ '[' NO == YES ']'
+ popd
Yes, the Tensorflow Plugin is distributed as source distribution, hence the .tar.gz. If you kept the PREBUILD_TF_PLUGINS unchanged (it's YES by default) it will contain not only sources but the prebuilt plugin libraries.
During installation it will check if the prebuilt libraries are compatible with the Tensorflow distribution you are using and install them. If they are not compatible (for example you have a Tensroflow built on your machine with different compiler than expected), it will attempt to ask the Tensorflow for configuration and build the plugin libraries during installation. If that fails you will be notified what didn't match in the configuration.
I've managed to install the tf plugin with this command (setting CFLAGS because it wanted to compile the thing)
CFLAGS="-I$CUDA_HOME/include $CFLAGS" pip install nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz
Well, it was not smooth but I finally managed to have the dali and the TF plugin compiled. Thanks for the help.
Hi,
I'm glad it works.
I don't think you need to issue docker system prune -a, REBUILD_BUILDERS=YES should rebuild it if needed or use a cached version in your system if there is no change in the code.
DALI 0.26 is available and should include the needed functionality.