Openj9: Fix java/lang/Math.sqrt handling on Z

Created on 9 May 2019 · 24Comments · Source: eclipse/openj9

While working on eclipse/omr#3811 I made a mistake where I accidentally removed the getSupportsHardwareSQRT function. We then ended up defaulting down to the base implementation which returns false [1]. This was a performance mistake and there should have been no functional side-effects, however there were. We ended up crashing in OpenJ9 running trivial tests.

I took a quick look at the trace files and saw the following trees in the good (return true) and bad cases (return false):

------------------------------
 n26n     (  0)  treetop                                  
 n24n     (  1)    dcall  java/lang/Math.sqrt(D)D[#362  final static Method] [flags 0x20500 0x0 ] (in FPR_0049)
 n23n     (  0)      dconst 2.14748e+09 [0x41dfffffffc00000]
------------------------------

 [     0x3ffc97e15c0]   LARL    GPR_0050, &<LiteralPool Base Address>    ; LoadLitPool
 [     0x3ffc97e1700]   LD      FPR_0049,#531 =X(40e6a09e66689b2e) 0(GPR_0050)


------------------------------
 n180n    (  0)  treetop                                           
 n181n    (  1)    dcall  java/lang/StrictMath.sqrt(D)D[#413  final native static Method] [flags 0x20500 0x0 ] (in GPR_0129)
 n188n    (  0)      aladd (in GPR_0129)
 n182n    (  0)        ==>aloadi (in GPR_0128) (classPointerConstant sharedMemory )
 n183n    (  0)        lconst 48                                              
 n187n    (  1)      dconst 2.14748e+09 [0x41dfffffffc00000]           
------------------------------

 [     0x3ffc966ca30]   LA      GPR_0129,#537 48(GPR_0128)
 [     0x3ffc966cb20]   SQDBR   GPR_0129,GPR_0129

Taking a look at the trace files (sorry I forgot to save these, but they're trivial to reproduce) we seem to inline the java/lang/Math.sqrt(D)D static method to expose the java/lang/StrictMath.sqrt(D)D static native method. Then in [2] when processing the static native method the first implicit argument has to be the jclass of the class containing the static native method. This is why we have the aladd as the first argument.

However the implementation of the intrinsic [3] does not seem to accept static native methods, incorrectly assuming the first child is the value we are to square root. This is obviously a mistake which needs to be fixed properly.

[1] https://github.com/eclipse/omr/blob/52a5bee75361651987e703990773084cecaa07e4/compiler/env/OMRCPU.hpp#L109
[2] https://github.com/eclipse/openj9/blob/e1f439afb0278f314d18fe98c9229906cbc3dc3f/runtime/compiler/il/J9Node.cpp#L413
[3] https://github.com/eclipse/openj9/blob/e1f439afb0278f314d18fe98c9229906cbc3dc3f/runtime/compiler/z/codegen/J9TreeEvaluator.cpp#L9472-L9523

z bug jit

Source

fjeremic

Most helpful comment

@fjeremic ok if it is already going through the instruction I'm fine with the commoning - I thought the question was whether the instruction would work. I'm trying to look at this and several other things so I clearly misunderstood the question. If the reduction is already happening commoning the implementation is fine with me. If the reduction is not happening we need to be sure the x86 instruction is spec compliant. Thanks for your patience!

andrewcraik on 12 Jun 2019

👍2

All 24 comments

I was able to reproduce this problem by modifying the getSupportsHardwareSQRT to return false, and with the following java program:

public static int value = 9;

public static void main(String[] args) 
    {
    double total = 0;
    for (int i = 0; i < 5; i++)
        {
        total += sq(value);
        }
    System.out.println(total);
    }

public static double sq(double value)
    {
    double res = StrictMath.sqrt(value);
    System.out.println(res);
    return res;
    }

And I ran it with the following command:

java -Xjit:count=1,disableAsyncCompilation,{testsqrt.sq*}\(traceFull,traceCG,log=trace.log,lastOptIndex=0\),dontInline={*testsqrt.sq*},disableSuffixLogs testsqrt

simonameng on 27 May 2019

👍1

Instead of using dcall for java_lang_Math_sqrt and java_lang_StrictMath_sqrt, we inline the dsqrt evaluator in the J9RecognizedCallTransformer.cpp. In the dsqrt evaluator, we do the same thing as what's in inlineMathSQRT. Also, for the case that getSupportsHardwareSQRT returns false when using java_lang_StrictMath_sqrt(the situation mentioned in the issue), we anchor and delete the reference of the first child (jclass). For now, the transform and inline only happens in the z machine

After the code change, we have some tests on the situations we might affect. And we have the before/after recognizedCallTransformer of the trees below:

use java_lang_StrictMath_sqrt and getSupportsHardwareSQRT=false

1.1 before

n5n       treetop                                                                             [     0x3ff94636480] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n6n         dcall  java/lang/StrictMath.sqrt(D)D[#360  final native static Method] [flags 0x20500 0x0 ] ()  [     0x3ff946364d0] bci=[-1,1,17] rc=2 vc=0 vn=- li=- udi=- nc=2 flg=0x40000
n9n           aladd                                                                           [     0x3ff946365c0] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=2
n7n             aconst 0x1a8d00 (java/lang/StrictMath.class) (classPointerConstant sharedMemory )  [     0x3ff94636520] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x10000
n8n             lconst 48                                                                     [     0x3ff94636570] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n3n           dload  <parm 0 D>[#358  Parm] [flags 0x40000106 0x0 ]                           [     0x3ff946363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n10n      dstore  <auto slot 2>[#361  Auto] [flags 0x6 0x0 ]                                  [     0x3ff94636610] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n6n         ==>dcall

1.2 after

n40n      treetop                                                                             [     0x3ff94636f70] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n9n         aladd                                                                             [     0x3ff946365c0] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=2
n7n           aconst 0x1a8d00 (java/lang/StrictMath.class) (classPointerConstant sharedMemory )  [     0x3ff94636520] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x10000
n8n           lconst 48                                                                       [     0x3ff94636570] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n10n      dstore  <auto slot 2>[#361  Auto] [flags 0x6 0x0 ]                                  [     0x3ff94636610] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n6n         dsqrt                                                                             [     0x3ff946364d0] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=1
n3n           dload  <parm 0 D>[#358  Parm] [flags 0x40000106 0x0 ]                           [     0x3ff946363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0

use java_lang_StrictMath_sqrt and getSupportsHardwareSQRT=true

2.1 before

n5n       treetop                                                                             [     0x3ff86803480] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n         dcall  java/lang/StrictMath.sqrt(D)D[#360  final native static Method] [flags 0x20500 0x0 ]  [     0x3ff86803430] bci=[-1,1,17] rc=2 vc=0 vn=- li=- udi=- nc=1
n3n           dload  <parm 0 D>[#358  Parm] [flags 0x40000106 0x0 ]                           [     0x3ff868033e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n6n       dstore  <auto slot 2>[#361  Auto] [flags 0x6 0x0 ]                                  [     0x3ff868034d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1

2.2 after

n6n       dstore  <auto slot 2>[#361  Auto] [flags 0x6 0x0 ]                                  [     0x3ff868034d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n         dsqrt                                                                             [     0x3ff86803430] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=1
n3n           dload  <parm 0 D>[#358  Parm] [flags 0x40000106 0x0 ]                           [     0x3ff868033e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0

use java_lang_Math_sqrt and getSupportsHardwareSQRT=false

3.1 before

n5n       treetop                                                                             [     0x3ffb4036480] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n         dcall  java/lang/Math.sqrt(D)D[#360  final static Method] [flags 0x20500 0x0 ]    [     0x3ffb4036430] bci=[-1,1,17] rc=2 vc=0 vn=- li=- udi=- nc=1
n3n           dload  <parm 0 D>[#358  Parm] [flags 0x40000106 0x0 ]                           [     0x3ffb40363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n6n       dstore  <auto slot 2>[#361  Auto] [flags 0x6 0x0 ]                                  [     0x3ffb40364d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n         ==>dcall

3.2 after

n6n       dstore  <auto slot 2>[#361  Auto] [flags 0x6 0x0 ]                                  [     0x3ffb40364d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n         dsqrt                                                                             [     0x3ffb4036430] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=1
n3n           dload  <parm 0 D>[#358  Parm] [flags 0x40000106 0x0 ]                           [     0x3ffb40363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0

use java_lang_Math_sqrt and getSupportsHardwareSQRT=true

4.1 before

n5n       treetop                                                                             [     0x3ff90036480] bci=[-1,1,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n         dcall  java/lang/Math.sqrt(D)D[#360  final static Method] [flags 0x20500 0x0 ]    [     0x3ff90036430] bci=[-1,1,17] rc=2 vc=0 vn=- li=- udi=- nc=1
n3n           dload  <parm 0 D>[#358  Parm] [flags 0x40000106 0x0 ]                           [     0x3ff900363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0
n6n       dstore  <auto slot 2>[#361  Auto] [flags 0x6 0x0 ]                                  [     0x3ff900364d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n         ==>dcall

4.2 after

n6n       dstore  <auto slot 2>[#361  Auto] [flags 0x6 0x0 ]                                  [     0x3ff900364d0] bci=[-1,4,17] rc=0 vc=0 vn=- li=- udi=- nc=1
n4n         dsqrt                                                                             [     0x3ff90036430] bci=[-1,1,17] rc=1 vc=0 vn=- li=- udi=- nc=1
n3n           dload  <parm 0 D>[#358  Parm] [flags 0x40000106 0x0 ]                           [     0x3ff900363e0] bci=[-1,0,17] rc=1 vc=0 vn=- li=- udi=- nc=0

simonameng on 12 Jun 2019

Looks good. Can the solution be extended to x86 and Power? Do we support Java semantics in the dsqrt evaluators across platforms? We do on Z, question is for x86 and Power @andrewcraik @gita-omr, thoughts? I'd rather reduce these calls to IL given it exists than intrinsic recognized calls.

fjeremic on 12 Jun 2019

Yes, we do support the Java semantics in the dsqrt evaluators on Z. And we have some questions for x86 which @dchopra001 may want to explain. I haven't looked at the Power.

simonameng on 12 Jun 2019

For x86 issues of rounding mode and similar enter the picture - will have to study carefully if such a reduction is possible.

andrewcraik on 12 Jun 2019

Looking at x86 inlined function and the dsqrt evaluators, they appear to be equivalent [1] [2]. @simonameng I notice on Z we blindly just use libc sqrt function to compute the constant case while x86 inlined function seems to have additional logic for some corner cases. Can we check that we're doing the right thing by writing a few unit tests against those cases?

We need to common up all this logic for handling constant case and reducing to IL (if deemed functionally correct) in a common optimization to avoid such potential bugs.

fjeremic on 12 Jun 2019

So we looked into reducing this call to a dsqrt IL on the x86 platform. The evaluator is here:

https://github.com/eclipse/omr/blob/06a198583ca252ecb878ab284ba82b6d3cf64042/compiler/x/codegen/FPTreeEvaluator.cpp#L681-L723

There seems to be some legacy code here guarded by if (value->getKind() != TR_FPR) // Legacy supported for X87, to be deleted. This evaluator was introduced by this PR last year: https://github.com/eclipse/omr/pull/2575. There was a plan to remove this legacy code pending another pull request. See [1] and [2] for more information. We should follow up on that PR and try to get it merged so that this legacy code can be removed as part of @simonameng's work.
The "legacy" path is taken in fpsqrtEvaluator when the value register is not an FPR. In the fall back path we generate a SQRTSDRegReg. On the other hand, the inlineMathSQRT routine will generate a DSQRTReg instruction when the register kind is not an FPR:
https://github.com/eclipse/openj9/blob/432df8c642770f89a95f2043d271434331faaafc/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L9068-L9081

Which one is the correct approach here? When do we prefer one over the other? Is inlineMathSQRT not supported for x87?

@simonameng Can you look into the differences between these instructions? And if they satisfy the Java specification for the StrictMath.sqrt API?

On the Z side, we handle a case where the child of the dsqrt is a constant. In this case we use the sqrt helper provided by the C runtime:
https://github.com/eclipse/openj9/blob/432df8c642770f89a95f2043d271434331faaafc/runtime/compiler/z/codegen/J9TreeEvaluator.cpp#L9917-L9924
On the x86 side however, we don't do this and handle the "special" cases (i.e. NaN normalization etc...) ourselves. There is even a specific comment indicating that we do not rely on the C runtime to provide the correct value:
https://github.com/eclipse/openj9/blob/432df8c642770f89a95f2043d271434331faaafc/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L9036-L9056
Why is this? From a quick look we observed that the sqrt library helper obeys the same rules as the java spec. So why not just use that helper?

[1] https://github.com/eclipse/omr/pull/2575/files#r212719412
[2] https://github.com/eclipse/omr/pull/2822

dchopra001 on 12 Jun 2019

I should also mention that we do some handling in the ILGen for StrictMath.sqrt that might need to be looked at again. While processing this JNI call, we do this check:
https://github.com/eclipse/openj9/blob/9fd0c9b532f1122ce2fc3f5405134f9e950190d0/runtime/compiler/il/J9Node.cpp#L303-L304

If this query returns false, then we end up continuing further into the function and add the unexpected firstChild that Filip describes in the initial post. The child is added here:
https://github.com/eclipse/openj9/blob/9fd0c9b532f1122ce2fc3f5405134f9e950190d0/runtime/compiler/il/J9Node.cpp#L395-L444

Adding this child seems to be the default approach for all JNI calls that end up in this routine. However we choose not to do it because we know that the dcall StrictMath.sqrt node will eventually be handled by inlineMathSQRT which is aware of the fact that there is no "extra" child. Not immediately obvious, but it looks like this was done by design. We should probably remove this check in the ILGen since we now handle recognizing and transforming these calls in the optimizer.

dchopra001 on 12 Jun 2019

👍1

So you guys are the StrictMath.sqrt will return the specific value required in the Javadoc specification http://w3.hursley.ibm.com/java/docs/java7/api/java/lang/StrictMath.html#sqrt(double)?

I suspect that on x86 the AVX sqrt instruction will generate a valid result for Math.sqrt but I'd be loathed to say it will be 100% right for StrictMath.sqrt

andrewcraik on 12 Jun 2019

I suspect that on x86 the AVX sqrt instruction will generate a valid result for Math.sqrt but I'd be loathed to say it will be 100% right for StrictMath.sqrt

java.lang.Math.sqrt maps to java.lang.StrictMath.sqrt [1]. There is no difference between them.

[1] https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/0843c3e6c295fb5df8babf32efc879d33780a097/jdk/src/share/classes/java/lang/Math.java#L338-L344

fjeremic on 12 Jun 2019

👍1

There is a difference in that the Javadoc specification for StrictMath.sqrt is more strict. You could recognize Math.sqrt and do things to it that you can't do to StrictMath.sqrt

andrewcraik on 12 Jun 2019

So you guys are the StrictMath.sqrt will return the specific value required in the Javadoc specification http://w3.hursley.ibm.com/java/docs/java7/api/java/lang/StrictMath.html#sqrt(double)?

On Z the instruction we are using does follow the same rules.

I suspect that on x86 the AVX sqrt instruction will generate a valid result for Math.sqrt.

Can you point us to this instruction? We tried looking through the ISA but for a lot of the sqrt instructions the ISA doesn't clarify what would happen in the special cases.

dchopra001 on 12 Jun 2019

As I say AVX may be good enough for Math.sqrt on x86 but I wouldn't bet on it being 100% correct for StrictMath.

andrewcraik on 12 Jun 2019

@jdmpapin can you liase on the spec and x86 instructions please?

andrewcraik on 12 Jun 2019

There is a difference in that the Javadoc specification for StrictMath.sqrt is more strict.

Would you be able to point us to the differences? Having looked at the java doc I don't see a difference:
http://w3.hursley.ibm.com/java/docs/java7/api/java/lang/StrictMath.html#sqrt(double)
http://w3.hursley.ibm.com/java/docs/java7/api/java/lang/Math.html#sqrt(double)

The extra line in Math.sqrt's "return" section saying :

If the argument is NaN or less than zero, the result is NaN.

looks to be redundant.

dchopra001 on 12 Jun 2019

So after very careful reading it looks like the spec on _this_ function is the same. In general it is not true - seems that because they added the nearest to the actual mathematical value clause to the return that is what ties them to be equivalent. There are many others in these classes where that is not true. @jdmpapin is trying to see if he can find the x86 specs - if we are in doubt we'll just not use the reduction to avoid imprecision which can be a nightmare bug to find.

andrewcraik on 12 Jun 2019

I'm not seeing what the problem here is. At the end of the day the inlineMathSQRT function on x86 behaves identically to the dsqrt evaluator in that we will generate the same instruction sequence for both as we can see in [1] and [2]. Furthermore we already reduce both java.lang.Math.sqrt and java.lang.StrictMath.sqrt to inlineMathSQRT as seen in [3].

The work here is to common up the logic between platforms and do the transformation at a common level to reduce these Java calls into the dsqrt IL. The functionality should be equivalent to what we do today.

[1] https://github.com/eclipse/openj9/blob/432df8c642770f89a95f2043d271434331faaafc/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L8984-L9091
[2] https://github.com/eclipse/omr/blob/b8ed7176897ffa6199bb5cbff970a15717dccc9f/compiler/x/codegen/FPTreeEvaluator.cpp#L681-L723
[3] https://github.com/eclipse/openj9/blob/e25c46b8dc2cf76db966a1052b2d0d730d748fa6/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L10044-L10048

fjeremic on 12 Jun 2019

👍1

andrewcraik on 12 Jun 2019

👍2

For the record, it does look safe to use sqrtsd for Math.sqrt(). The following isn't a guarantee, but it's pretty suggestive, especially combined with the fact that we've already been reducing Math.sqrt() to sqrtsd without any (known) trouble.

The documentation mentions no limitation on the precision of sqrtsd and sqrtss, as compared to the approximate rsqrtss, which has an explicit error bound. Additionally, IEEE requires square root to be correctly rounded (or so the internets claim - I haven't seen it). From the Intel manual volume 1 appendix E, section E.4:

SSE/SSE2/SSE3 extensions are 100% compatible with the IEEE Standard 754 for Binary Floating-Point Arithmetic, satisfying all of its mandatory requirements [...]

There are caveats about special modes, results provided to FP exception handlers, and simultaneous exceptions in the case of vector operations, none of which should affect our use of the instruction.

Further, tables E-15 and E-18 describe situations in which the operand is a denormal or the result is inexact, respectively. Both tables group sqrtss and sqrtsd with the usual arithmetic instructions addss, addsd, subss, subsd, etc., with results rounded to the destination precision. Note that sqrtss and sqrtsd can't overflow or underflow because they reduce the magnitude of the exponent.

That leaves the corner cases.

For negative arguments (not including -0), table E-13 shows that the result is a quiet NaN.
If the argument is (quiet) NaN, table E-10 shows that the result is again a quiet NaN.
I don't see where the results are specified for -0 and +inf, but on my machine I get -0 and +inf respectively, as required for Math.sqrt().

jdmpapin on 13 Jun 2019

@fjeremic do you see Power producing TR::dsqrt opcode? I don't see it, although dsqrtEvaluator does exist.

gita-omr on 18 Jun 2019

@fjeremic do you see Power producing TR::dsqrt opcode? I don't see it, although dsqrtEvaluator does exist.

@simonameng is currently testing this. I'll let her report back. However I do see inlineDoublePrecisionFP being used on Power and it reduces to TR::InstOpCode::fsqrt so I presume it will work the same as x86 and Z.

fjeremic on 19 Jun 2019

@simonameng can you detail some of the experiments you performed on Power? Do we currently reduce a java.lang.Math.sqrt call to an TR::InstOpCode::fsqrt instruction? And following the transformation in #6217 do we also generate the same instruction on Power?

fjeremic on 4 Jul 2019

@gita-omr @fjeremic Yes. I wrote a unit test and tested on Power machine. Both java.lang.Math.sqrt and java.lang.StrictMath.sqrt will be transformed to dsqrt and using dsqrtEvaluator. The dsqrtEvaluator will generate the TR::InstOpCode::fsqrt instruction.