While working on eclipse/omr#3811 I made a mistake where I accidentally removed the getSupportsHardwareSQRT function. We then ended up defaulting down to the base implementation which returns false [1]. This was a performance mistake and there should have been no functional side-effects, however there were. We ended up crashing in OpenJ9 running trivial tests.
I took a quick look at the trace files and saw the following trees in the good (return true) and bad cases (return false):
------------------------------
n26n ( 0) treetop
n24n ( 1) dcall java/lang/Math.sqrt(D)D[#362 final static Method] [flags 0x20500 0x0 ] (in FPR_0049)
n23n ( 0) dconst 2.14748e+09 [0x41dfffffffc00000]
------------------------------
[ 0x3ffc97e15c0] LARL GPR_0050, &<LiteralPool Base Address> ; LoadLitPool
[ 0x3ffc97e1700] LD FPR_0049,#531 =X(40e6a09e66689b2e) 0(GPR_0050)
------------------------------
n180n ( 0) treetop
n181n ( 1) dcall java/lang/StrictMath.sqrt(D)D[#413 final native static Method] [flags 0x20500 0x0 ] (in GPR_0129)
n188n ( 0) aladd (in GPR_0129)
n182n ( 0) ==>aloadi (in GPR_0128) (classPointerConstant sharedMemory )
n183n ( 0) lconst 48
n187n ( 1) dconst 2.14748e+09 [0x41dfffffffc00000]
------------------------------
[ 0x3ffc966ca30] LA GPR_0129,#537 48(GPR_0128)
[ 0x3ffc966cb20] SQDBR GPR_0129,GPR_0129
Taking a look at the trace files (sorry I forgot to save these, but they're trivial to reproduce) we seem to inline the java/lang/Math.sqrt(D)D static method to expose the java/lang/StrictMath.sqrt(D)D static native method. Then in [2] when processing the static native method the first implicit argument has to be the jclass of the class containing the static native method. This is why we have the aladd as the first argument.
However the implementation of the intrinsic [3] does not seem to accept static native methods, incorrectly assuming the first child is the value we are to square root. This is obviously a mistake which needs to be fixed properly.
[1] https://github.com/eclipse/omr/blob/52a5bee75361651987e703990773084cecaa07e4/compiler/env/OMRCPU.hpp#L109
[2] https://github.com/eclipse/openj9/blob/e1f439afb0278f314d18fe98c9229906cbc3dc3f/runtime/compiler/il/J9Node.cpp#L413
[3] https://github.com/eclipse/openj9/blob/e1f439afb0278f314d18fe98c9229906cbc3dc3f/runtime/compiler/z/codegen/J9TreeEvaluator.cpp#L9472-L9523
I was able to reproduce this problem by modifying the getSupportsHardwareSQRT to return false, and with the following java program:
public static int value = 9;
public static void main(String[] args)
{
double total = 0;
for (int i = 0; i < 5; i++)
{
total += sq(value);
}
System.out.println(total);
}
public static double sq(double value)
{
double res = StrictMath.sqrt(value);
System.out.println(res);
return res;
}
And I ran it with the following command:
java -Xjit:count=1,disableAsyncCompilation,{testsqrt.sq*}\(traceFull,traceCG,log=trace.log,lastOptIndex=0\),dontInline={*testsqrt.sq*},disableSuffixLogs testsqrt
Instead of using dcall for java_lang_Math_sqrt and java_lang_StrictMath_sqrt, we inline the dsqrt evaluator in the J9RecognizedCallTransformer.cpp. In the dsqrt evaluator, we do the same thing as what's in inlineMathSQRT. Also, for the case that getSupportsHardwareSQRT returns false when using java_lang_StrictMath_sqrt(the situation mentioned in the issue), we anchor and delete the reference of the first child (jclass). For now, the transform and inline only happens in the z machine
After the code change, we have some tests on the situations we might affect. And we have the before/after recognizedCallTransformer of the trees below:
java_lang_StrictMath_sqrt and getSupportsHardwareSQRT=false1.1 before
n5n treetop [ 0x3ff94636480] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n6n dcall java/lang/StrictMath.sqrt(D)D[#360 final native static Method] [flags 0x20500 0x0 ] () [ 0x3ff946364d0] bci=[-1,1,17] rc=2 vc=0 vn=- li=- udi=- nc=2 flg=0x40000
n9n aladd [ 0x3ff946365c0] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=2
n7n aconst 0x1a8d00 (java/lang/StrictMath.class) (classPointerConstant sharedMemory ) [ 0x3ff94636520] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x10000
n8n lconst 48 [ 0x3ff94636570] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n3n dload <parm 0 D>[#358 Parm] [flags 0x40000106 0x0 ] [ 0x3ff946363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n10n dstore <auto slot 2>[#361 Auto] [flags 0x6 0x0 ] [ 0x3ff94636610] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n6n ==>dcall
1.2 after
n40n treetop [ 0x3ff94636f70] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n9n aladd [ 0x3ff946365c0] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=2
n7n aconst 0x1a8d00 (java/lang/StrictMath.class) (classPointerConstant sharedMemory ) [ 0x3ff94636520] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x10000
n8n lconst 48 [ 0x3ff94636570] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n10n dstore <auto slot 2>[#361 Auto] [flags 0x6 0x0 ] [ 0x3ff94636610] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n6n dsqrt [ 0x3ff946364d0] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=1
n3n dload <parm 0 D>[#358 Parm] [flags 0x40000106 0x0 ] [ 0x3ff946363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
java_lang_StrictMath_sqrt and getSupportsHardwareSQRT=true2.1 before
n5n treetop [ 0x3ff86803480] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n dcall java/lang/StrictMath.sqrt(D)D[#360 final native static Method] [flags 0x20500 0x0 ] [ 0x3ff86803430] bci=[-1,1,17] rc=2 vc=0 vn=- li=- udi=- nc=1
n3n dload <parm 0 D>[#358 Parm] [flags 0x40000106 0x0 ] [ 0x3ff868033e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n6n dstore <auto slot 2>[#361 Auto] [flags 0x6 0x0 ] [ 0x3ff868034d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
2.2 after
n6n dstore <auto slot 2>[#361 Auto] [flags 0x6 0x0 ] [ 0x3ff868034d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n dsqrt [ 0x3ff86803430] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=1
n3n dload <parm 0 D>[#358 Parm] [flags 0x40000106 0x0 ] [ 0x3ff868033e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
java_lang_Math_sqrt and getSupportsHardwareSQRT=false3.1 before
n5n treetop [ 0x3ffb4036480] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n dcall java/lang/Math.sqrt(D)D[#360 final static Method] [flags 0x20500 0x0 ] [ 0x3ffb4036430] bci=[-1,1,17] rc=2 vc=0 vn=- li=- udi=- nc=1
n3n dload <parm 0 D>[#358 Parm] [flags 0x40000106 0x0 ] [ 0x3ffb40363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n6n dstore <auto slot 2>[#361 Auto] [flags 0x6 0x0 ] [ 0x3ffb40364d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n ==>dcall
3.2 after
n6n dstore <auto slot 2>[#361 Auto] [flags 0x6 0x0 ] [ 0x3ffb40364d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n dsqrt [ 0x3ffb4036430] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=1
n3n dload <parm 0 D>[#358 Parm] [flags 0x40000106 0x0 ] [ 0x3ffb40363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
java_lang_Math_sqrt and getSupportsHardwareSQRT=true4.1 before
n5n treetop [ 0x3ff90036480] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n dcall java/lang/Math.sqrt(D)D[#360 final static Method] [flags 0x20500 0x0 ] [ 0x3ff90036430] bci=[-1,1,17] rc=2 vc=0 vn=- li=- udi=- nc=1
n3n dload <parm 0 D>[#358 Parm] [flags 0x40000106 0x0 ] [ 0x3ff900363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n6n dstore <auto slot 2>[#361 Auto] [flags 0x6 0x0 ] [ 0x3ff900364d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n ==>dcall
4.2 after
n6n dstore <auto slot 2>[#361 Auto] [flags 0x6 0x0 ] [ 0x3ff900364d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n dsqrt [ 0x3ff90036430] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=1
n3n dload <parm 0 D>[#358 Parm] [flags 0x40000106 0x0 ] [ 0x3ff900363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
Looks good. Can the solution be extended to x86 and Power? Do we support Java semantics in the dsqrt evaluators across platforms? We do on Z, question is for x86 and Power @andrewcraik @gita-omr, thoughts? I'd rather reduce these calls to IL given it exists than intrinsic recognized calls.
Yes, we do support the Java semantics in the dsqrt evaluators on Z. And we have some questions for x86 which @dchopra001 may want to explain. I haven't looked at the Power.
For x86 issues of rounding mode and similar enter the picture - will have to study carefully if such a reduction is possible.
Looking at x86 inlined function and the dsqrt evaluators, they appear to be equivalent [1] [2]. @simonameng I notice on Z we blindly just use libc sqrt function to compute the constant case while x86 inlined function seems to have additional logic for some corner cases. Can we check that we're doing the right thing by writing a few unit tests against those cases?
We need to common up all this logic for handling constant case and reducing to IL (if deemed functionally correct) in a common optimization to avoid such potential bugs.
[1] https://github.com/eclipse/openj9/blob/432df8c642770f89a95f2043d271434331faaafc/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L8984-L9091
[2] https://github.com/eclipse/omr/blob/b8ed7176897ffa6199bb5cbff970a15717dccc9f/compiler/x/codegen/FPTreeEvaluator.cpp#L681-L723
So we looked into reducing this call to a dsqrt IL on the x86 platform. The evaluator is here:
There seems to be some legacy code here guarded by if (value->getKind() != TR_FPR) // Legacy supported for X87, to be deleted. This evaluator was introduced by this PR last year: https://github.com/eclipse/omr/pull/2575. There was a plan to remove this legacy code pending another pull request. See [1] and [2] for more information. We should follow up on that PR and try to get it merged so that this legacy code can be removed as part of @simonameng's work.
The "legacy" path is taken in fpsqrtEvaluator when the value register is not an FPR. In the fall back path we generate a SQRTSDRegReg. On the other hand, the inlineMathSQRT routine will generate a DSQRTReg instruction when the register kind is not an FPR:
https://github.com/eclipse/openj9/blob/432df8c642770f89a95f2043d271434331faaafc/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L9068-L9081
Which one is the correct approach here? When do we prefer one over the other? Is inlineMathSQRT not supported for x87?
@simonameng Can you look into the differences between these instructions? And if they satisfy the Java specification for the StrictMath.sqrt API?
dsqrt is a constant. In this case we use the sqrt helper provided by the C runtime:sqrt library helper obeys the same rules as the java spec. So why not just use that helper? [1] https://github.com/eclipse/omr/pull/2575/files#r212719412
[2] https://github.com/eclipse/omr/pull/2822
I should also mention that we do some handling in the ILGen for StrictMath.sqrt that might need to be looked at again. While processing this JNI call, we do this check:
https://github.com/eclipse/openj9/blob/9fd0c9b532f1122ce2fc3f5405134f9e950190d0/runtime/compiler/il/J9Node.cpp#L303-L304
If this query returns false, then we end up continuing further into the function and add the unexpected firstChild that Filip describes in the initial post. The child is added here:
https://github.com/eclipse/openj9/blob/9fd0c9b532f1122ce2fc3f5405134f9e950190d0/runtime/compiler/il/J9Node.cpp#L395-L444
Adding this child seems to be the default approach for all JNI calls that end up in this routine. However we choose not to do it because we know that the dcall StrictMath.sqrt node will eventually be handled by inlineMathSQRT which is aware of the fact that there is no "extra" child. Not immediately obvious, but it looks like this was done by design. We should probably remove this check in the ILGen since we now handle recognizing and transforming these calls in the optimizer.
So you guys are the StrictMath.sqrt will return the specific value required in the Javadoc specification http://w3.hursley.ibm.com/java/docs/java7/api/java/lang/StrictMath.html#sqrt(double)?
I suspect that on x86 the AVX sqrt instruction will generate a valid result for Math.sqrt but I'd be loathed to say it will be 100% right for StrictMath.sqrt
I suspect that on x86 the AVX sqrt instruction will generate a valid result for Math.sqrt but I'd be loathed to say it will be 100% right for StrictMath.sqrt
java.lang.Math.sqrt maps to java.lang.StrictMath.sqrt [1]. There is no difference between them.
There is a difference in that the Javadoc specification for StrictMath.sqrt is more strict. You could recognize Math.sqrt and do things to it that you can't do to StrictMath.sqrt
So you guys are the StrictMath.sqrt will return the specific value required in the Javadoc specification http://w3.hursley.ibm.com/java/docs/java7/api/java/lang/StrictMath.html#sqrt(double)?
On Z the instruction we are using does follow the same rules.
I suspect that on x86 the AVX sqrt instruction will generate a valid result for Math.sqrt.
Can you point us to this instruction? We tried looking through the ISA but for a lot of the sqrt instructions the ISA doesn't clarify what would happen in the special cases.
As I say AVX may be good enough for Math.sqrt on x86 but I wouldn't bet on it being 100% correct for StrictMath.
@jdmpapin can you liase on the spec and x86 instructions please?
There is a difference in that the Javadoc specification for StrictMath.sqrt is more strict.
Would you be able to point us to the differences? Having looked at the java doc I don't see a difference:
http://w3.hursley.ibm.com/java/docs/java7/api/java/lang/StrictMath.html#sqrt(double)
http://w3.hursley.ibm.com/java/docs/java7/api/java/lang/Math.html#sqrt(double)
The extra line in Math.sqrt's "return" section saying :
If the argument is NaN or less than zero, the result is NaN.
looks to be redundant.
So after very careful reading it looks like the spec on _this_ function is the same. In general it is not true - seems that because they added the nearest to the actual mathematical value clause to the return that is what ties them to be equivalent. There are many others in these classes where that is not true. @jdmpapin is trying to see if he can find the x86 specs - if we are in doubt we'll just not use the reduction to avoid imprecision which can be a nightmare bug to find.
I'm not seeing what the problem here is. At the end of the day the inlineMathSQRT function on x86 behaves identically to the dsqrt evaluator in that we will generate the same instruction sequence for both as we can see in [1] and [2]. Furthermore we already reduce both java.lang.Math.sqrt and java.lang.StrictMath.sqrt to inlineMathSQRT as seen in [3].
The work here is to common up the logic between platforms and do the transformation at a common level to reduce these Java calls into the dsqrt IL. The functionality should be equivalent to what we do today.
[1] https://github.com/eclipse/openj9/blob/432df8c642770f89a95f2043d271434331faaafc/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L8984-L9091
[2] https://github.com/eclipse/omr/blob/b8ed7176897ffa6199bb5cbff970a15717dccc9f/compiler/x/codegen/FPTreeEvaluator.cpp#L681-L723
[3] https://github.com/eclipse/openj9/blob/e25c46b8dc2cf76db966a1052b2d0d730d748fa6/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L10044-L10048
@fjeremic ok if it is already going through the instruction I'm fine with the commoning - I thought the question was whether the instruction would work. I'm trying to look at this and several other things so I clearly misunderstood the question. If the reduction is already happening commoning the implementation is fine with me. If the reduction is not happening we need to be sure the x86 instruction is spec compliant. Thanks for your patience!
For the record, it does look safe to use sqrtsd for Math.sqrt(). The following isn't a guarantee, but it's pretty suggestive, especially combined with the fact that we've already been reducing Math.sqrt() to sqrtsd without any (known) trouble.
The documentation mentions no limitation on the precision of sqrtsd and sqrtss, as compared to the approximate rsqrtss, which has an explicit error bound. Additionally, IEEE requires square root to be correctly rounded (or so the internets claim - I haven't seen it). From the Intel manual volume 1 appendix E, section E.4:
SSE/SSE2/SSE3 extensions are 100% compatible with the IEEE Standard 754 for Binary Floating-Point Arithmetic, satisfying all of its mandatory requirements [...]
There are caveats about special modes, results provided to FP exception handlers, and simultaneous exceptions in the case of vector operations, none of which should affect our use of the instruction.
Further, tables E-15 and E-18 describe situations in which the operand is a denormal or the result is inexact, respectively. Both tables group sqrtss and sqrtsd with the usual arithmetic instructions addss, addsd, subss, subsd, etc., with results rounded to the destination precision. Note that sqrtss and sqrtsd can't overflow or underflow because they reduce the magnitude of the exponent.
That leaves the corner cases.
Math.sqrt().@fjeremic do you see Power producing TR::dsqrt opcode? I don't see it, although dsqrtEvaluator does exist.
@fjeremic do you see Power producing TR::dsqrt opcode? I don't see it, although dsqrtEvaluator does exist.
@simonameng is currently testing this. I'll let her report back. However I do see inlineDoublePrecisionFP being used on Power and it reduces to TR::InstOpCode::fsqrt so I presume it will work the same as x86 and Z.
@simonameng can you detail some of the experiments you performed on Power? Do we currently reduce a java.lang.Math.sqrt call to an TR::InstOpCode::fsqrt instruction? And following the transformation in #6217 do we also generate the same instruction on Power?
@gita-omr @fjeremic Yes. I wrote a unit test and tested on Power machine. Both java.lang.Math.sqrt and java.lang.StrictMath.sqrt will be transformed to dsqrt and using dsqrtEvaluator. The dsqrtEvaluator will generate the TR::InstOpCode::fsqrt instruction.
Excellent, thanks for checking current behavior!
Most helpful comment
@fjeremic ok if it is already going through the instruction I'm fine with the commoning - I thought the question was whether the instruction would work. I'm trying to look at this and several other things so I clearly misunderstood the question. If the reduction is already happening commoning the implementation is fine with me. If the reduction is not happening we need to be sure the x86 instruction is spec compliant. Thanks for your patience!