When running PY=python3.7 ./pants test examples/tests/python/example_test/tensorflow_custom_op:tensorflow_custom_op, we get the exception undefined symbol: _ZN10tensorflow12OpDefBuilder5InputESs https://travis-ci.org/pantsbuild/pants/jobs/509277221#L2171
I suspected this is because we aren鈥檛 compiling with the flags TensorFlow says you need to use
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++11 -shared zero_out.cc -o zero_out.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} -O2
Stack Overflow suggests this is the issue https://stackoverflow.com/a/49886418, and TF says to do these steps when compiling https://www.tensorflow.org/guide/extend/op#top_of_page
@cosmicexplorer explained this is likely not the issue for these reasons:
(1) we link with -ltensorflow_framework
(2) compiler_option_sets can be used for this sort of thing, but that shouldn't be the solution to this
E.g. the wheel used to build the Py37 wheel is different and causing some issue.
I repro this and am looking into it.
Note that https://www.tensorflow.org/guide/extend/op also says to use self.test_session(), which is deprecated, so there may be more surprises we get to scrape the tensorflow changelog for.
Ugh, it's literally just the -D_GLIBCXX_USE_CXX11_ABI=0 line is out of date on the master version of https://www.tensorflow.org/guide/extend/op. It's supposed to be -D_GLIBCXX_USE_CXX11_ABI=1 (as in, it works in python 3.7 if you set it to that in pants.ini). Thinking now about how we might want to set different flags for different python versions.
@Eric-Arellano could you try cherry-picking https://github.com/cosmicexplorer/pants/tree/fix-tensorflow-py37 and adding PANTS_NATIVE_BUILD_STEP_CPP_COMPILE_SETTINGS_DEFAULT_COMPILER_OPTION_SETS="[]" to the command line when using python 3.7 and telling me if that passes for you?
Separately, I'm concerned there's a serious caching issue with remote sources or wheel unpacking, or something much later in the pipeline (local dist building?), because once I got the test to pass with that it continued to pass even if the compiler options changed.
Yeah, if I do:
> ./pants clean-all
> rm -rf ~/.cache/pants/python_cache/requirements
> PY=python3.7 PANTS_NATIVE_BUILD_STEP_CPP_COMPILE_SETTINGS_DEFAULT_COMPILER_OPTION_SETS="[]" ./pants -ldebug test examples/tests/python/example_test/tensorflow_custom_op:tensorflow_custom_op
...
[run]
============== test session starts ===============
platform linux -- Python 3.7.2, pytest-3.6.4, py-1.8.0, pluggy-0.7.1
rootdir: /home/cosmicexplorer/tools/pants/.pants.d, inifile: /home/cosmicexplorer/tools/pants/.pants.d/test/pytest-prep/CPython-3.7.2/dbded1650eb9e447c6138ee15ce9ea49d5c0bd43/pytest.ini
plugins: cov-2.4.0, timeout-1.2.1
collected 2 items
examples/tests/python/example_test/tensorflow_custom_op/test_zero_out_op.py s.
- generated xml file: /home/cosmicexplorer/tools/pants/.pants.d/test/pytest/examples.tests.python.example_test.tensorflow_custom_op.tensorflow_custom_op/junitxml/TEST-examples.tests.python.example_test.tensorflow_custom_op.tensorflow_custom_op.xml -
====== 1 passed, 1 skipped in 2.90 seconds =======
2019-03-21 19:23:42.217096: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-21 19:23:42.237060: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200700000 Hz
2019-03-21 19:23:42.237623: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x557dcac43ff0 executing computations on platform Host. Devices:
2019-03-21 19:23:42.237665: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
DEBUG] ProjectTree ignore_patterns: None
examples/tests/python/example_test/tensorflow_custom_op ..... SUCCESS
19:23:42 01:04 [junit]
it passes. So I think the default option sets to fix the C++ ABI probably works, but there's a caching issue to uncover here.
I think the caching issue might be an instance of pantsbuild/pex#159.
@cosmicexplorer PY=python3.7 PANTS_NATIVE_BUILD_STEP_CPP_COMPILE_SETTINGS_DEFAULT_COMPILER_OPTION_SETS="[]" ./pants -ldebug test examples/tests/python/example_test/tensorflow_custom_op:tensorflow_custom_op passed with the commit cherry picked! Good fix!
I'm going to continue to skip the test in https://github.com/pantsbuild/pants/pull/7261 but this commit looks like a great followup PR.
I'm honing in on the caching issue now. That should probably work in CI but pants users may continue to see failures unless the caching is fixed. The error revolves around fingerprinted options in subsystems in the native backend affecting the C++ object files and shared libraries, but those options don't end up invalidating the python dist build task, so it's not aware of any changes. This is a general issue with caching in the pants v1 engine in general. One way to address this might be to map targets to their shared library output product, and then somehow mix the shared lib hash (or the fingerprint of the compile/link subsystems?) into the python_dist() target hash when calculating self.invalidated(). This deserves its own issue and PR.
This is still broken! https://travis-ci.org/pantsbuild/pants/jobs/513399449#L1947 fixing now
Most helpful comment
Ugh, it's literally just the
-D_GLIBCXX_USE_CXX11_ABI=0line is out of date on the master version of https://www.tensorflow.org/guide/extend/op. It's supposed to be-D_GLIBCXX_USE_CXX11_ABI=1(as in, it works in python 3.7 if you set it to that inpants.ini). Thinking now about how we might want to set different flags for different python versions.