Addons: `image.rotate` Segfault with TensorFlow 2.2.0rc0 and tf-nightly

Created on 13 Mar 2020  路  21Comments  路  Source: tensorflow/addons

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Colab
  • MacOs 10.15.3

    • TensorFlow version and how it was installed (source or binary):
  • !pip install tensorflow==2.2.0rc0

  • !pip install tf-nightly==* # All nightlies I can see are affectd.

    • TensorFlow-Addons version and how it was installed (source or binary):
  • pip install tensorflow_addons==0.8.3

  • pip install tfa-nightly

    • Python version:
  • Colab: 3.6.9

  • macOS: 3.7.5

    • Is GPU used? (yes/no):
  • No.

Describe the bug

tfa.image.rotate segfaults.

It's the only tfa op I tried.

Code to reproduce the issue

Colab:

https://drive.google.com/open?id=1T5le-3aLhzsxa62uFGOsc6UpzzTEwN3a

Easy to reproduce locally on my macbook.

Other info / logs

See notebook.

I followed it in python up to this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/eager/execute.py#L59

All 21 comments

We're working on the 2.2 release right now, so it would be good to determine if this is a problem in tensorflow, or a problem in tfa.

Thanks @MarkDaoust we've also seen issue with users running tf-nightly. I'll have some time to look into this on the weekend.

@yifeif @gunan @perfinion @angerson Does anything come to mind that would break compaitbility. TFA links against libtesorflow of TF2.1 but I believe we'll still be shipping manylinux2010 whls built with dt7 for 2.2? Not sure why we would be getting segfaults.

@seanpmorgan Since the ABI is not compatible from TF 2.1.0 to 2.2.0, isn't that completely normal that we get segfaults? We compile the nightly against 2.1, so it won't work with 2.2. We just need to recompile right? Or I am missing something here?

When running the notebook with !pip install tfa-nightly I get the warning:

This version of TensorFlow Addons requires TensorFlow 2.1.0.
Detected an installation of version 2.2.0-dev20200201.

While some functions might work, TensorFlow Addons was not tested
with this TensorFlow version. Also custom ops were not compiled
against this version of TensorFlow. If you use custom ops,
you might get errors (segmentation faults for example).

It might help you to fallback to pure Python ops with
TF_ADDONS_PY_OPS . To do that, see
https://github.com/tensorflow/addons#gpucpu-custom-ops

If you encounter errors, do *not* file bugs in GitHub because
the version of TensorFlow you are using is not supported.

So I don't really get why this segfault is a problem.

@seanpmorgan Since the ABI is not compatible from TF 2.1.0 to 2.2.0, isn't that completely normal that we get segfaults? We compile the nightly against 2.1, so it won't work with 2.2. We just need to recompile right? Or I am missing something here?

So to my knowledge TF2.2 and TF2.1 should be compiled in the same way (devtoolset7/manylinux2010 compliant). Linking to the same ops shouldn't raise a segfault. We would like our package to be compatible with multiple versions of TF so that for example: our build matrix 0.8.3 can be compatible with tensorflow>=2.1.

For the moment the version check is the best way to prevent a poor user experience, but it shouldn't be required.

We didn't have this issue when moving from 2.0 -> 2.1. The reason for the incompatibility at that point was we started using the @tf.keras.utils.register_keras_serializable decorator which wasn't available in TF2.0.

I see, I though that even if the same toolchain was used, there was no garantee that the ABI would stay compatible if the C++ code in tensorflow changed. Thanks for the explanation.

I see, I though that even if the same toolchain was used, there was no garantee that the ABI would stay compatible if the C++ code in tensorflow changed. Thanks for the explanation.

Yeah no ABI compatibility between compiler toolchains. When someone uses TFA with conda compiled TF we get a symbol mismatch because the ops are mangled differently (among other possible issues).

If the C++ code in tensorflow is changed it could break if not backwards compatible, but the ops should all be public and stable between minor versions.

When someone uses TFA with conda compiled TF we get a symbol mismatch because the ops are mangled differently (among other possible issues).

Another reason why there should be a standard way(which IMO should be conda) going forward so that we don't run into such issues

Another reason why there should be a standard way(which IMO should be conda) going forward so that we don't run into such issues

Agree https://github.com/tensorflow/community/pull/133 is the fix for that but haven't heard any updates in a while. If that doesn't happen though then conda first would be great, but there's still too many python users not on conda to be excluded

So if that helps:

I tried with tensorflow-cpu to see if we could get the segfault.

pip install tensorflow-cpu==2.1.0
python configure.py --no-deps
bash tools/install_so_files.sh
pip install tensorflow-cpu==2.2.0rc0
pytest tensorflow_addons/image
# segfault here

But recompiling works:

pip install tensorflow-cpu==2.2.0rc0
python configure.py --no-deps
bash tools/install_so_files.sh
pytest tensorflow_addons/image
# tests are passing

I opened a pull request to allow us to transition gracefully to 2.2.0 without having to make an enormous pull request at once: #1303

I hope this help, my knowledge of C++/compiling is limited.

Thanks very helpful! If I'm unable to find a resolution to this we'll likely need to backport the strict version check to 0.8 prior to a 0.9 release pinned to 2.2

So I tested using gdb for an image op and activation op and they both are crash on the same call. Unsure if this is just the first call or the only cause (both are ABSL string calls):

pip install tensorflow_addons
pip install tensorflow==2.2rc0
import tensorflow as tf
import tensorflow_addons as tfa

x = tf.constant([-2.0, -1.0, 0.0, 1.0, 2.0], dtype=tf.dtypes.float32)
tfa.activations.gelu(x, True)
gdb python
run debug_gelu.py

yields:

[New Thread 0x7ffea3fff700 (LWP 31150)]
[New Thread 0x7ffe9bfff700 (LWP 31151)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff9806c6e8 in tensorflow::AttrSlice::Find(absl::string_view) const () from /home/sean-morgan/miniconda3/envs/addons-2.2/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so.2

In order to confirm that that linking to libtensorflow_framework.so.2 should work regardless of which version it was compiled against; I installed TFA==0.6.0 (compiled on TF2.0) and ran the same script above with TF2.1 installed. It completed successfully

pip install tensorflow_addons==0.6.0 --no-deps
pip install tensorflow==2.1
python debug_gelu.py
tf.Tensor([-0.04540229 -0.158808    0.          0.841192    1.9545977 ], shape=(5,), dtype=float32)

@mihaimaruseac I was wondering if you could weight in on this?
1) Is it a correct assumption that linking to libtensorflow_framework.so.2 should be stable if compiled against TF2.1 and linking at runtime of TF2.2?
2) Have there been any changes to ABSL that might have caused this? I know there were some RFCs for TF strings but not sure whats been changed on 2.2

FWIW I checked the activation ops compiled against TF2.0 and they work successfully with TF2.0 or TF2.1 but fail on TF2.2rc0. At this point I'm pretty confident there was a breaking change in the 2.2 libtensorflow

Afaik we don't have compatibility guarantees for libtensorflow. So we cannot link against libtensorflow built at a different version.

Afaik we don't have compatibility guarantees for libtensorflow. So we cannot link against libtensorflow built at a different version.

Thanks! That's rather unfortunate because it means users who want new features from Addons will be required to upgrade their TF core version which is a tough ask for many users. I'll make a post on developers group discussion to describe this issue and see what we can do going forward.

@mihaimaruseac if I understand correctly, tensorflow won't give any backward compatibilities garantees about the ABI until https://github.com/tensorflow/community/pull/133 is implemented, is that right? Is that the case too when the minor version in increased? If we compile against TF 2.1.0, will it work with 2.1.1?

@gabrieldemarmiesse patch number increase should be compatible

There are multiple RFCs for modularization and to add ABI guarantees. For example, I'm working on https://github.com/tensorflow/community/pull/101 and the general ideas are in https://github.com/tensorflow/community/pull/77

Just to be clear, if there are breaking changes to libtensorflow as minor versions increase then ABI stable kernels (https://github.com/tensorflow/community/pull/133) will still crash right? We're seeing this issue when compiling with the same toolchain so the ABI should be the same it's just that underlying op has changed

That's rather unfortunate because it means users who want new features from Addons will be required to upgrade their TF core version which is a tough ask for many users.

Agreed, especially if the user has compiled TF from source

They could change something in their public API, too. For example, TFA creates a helper object declared in the "outdated" TF 2.1 headers but TF 2.2 doesn't expect it of course. C++ name mangling does not handle it.

I'm going to close this issue as the bug was adressed and we can't do anything about it. I'm going to open another one about how to make a better UX for the users affected by the unstability of the ABI.

Was this page helpful?
0 / 5 - 0 ratings