Per #1172 we have only been able to get the precision within 1e-5. Need to determine why this is and if it can be lowered
That is expected because C++ grad implementation is hand-crafted and [most likely] analytically simplified so it doesn't accumulate round-off errors. That is why tf.custom_gradient was introduced:
https://www.tensorflow.org/api_docs/python/tf/custom_gradient
This decorator allows fine grained control over the gradients of a sequence for operations. This may be useful for multiple reasons, including providing a more efficient or numerically stable gradient for a sequence of operations.
But... Why 1e-6? I see a lot of 1e-4 out there. Isn't that enough?
Another interesting finding - it fails only on CPU and float32.
So, my preliminary conclusion after hours of research (I can be wrong): it's quite natural to observe such a discrepancy. float32 is 7 decimal digits of precision. Accumulated round-off errors can climb into 1e-6 with easy.
Thanks a lot @failure-to-thrive for your investigation. Using 10e-6 was @Squadrick 's suggestion there https://github.com/tensorflow/addons/pull/1137#issuecomment-592125603 . @Squadrick , if you agree with @failure-to-thrive , should we close this issue?