Julia: openlibm acos() gives garbage results on i686 linux with GCC 7.1.0

Created on 8 May 2017 · 37Comments · Source: JuliaLang/julia

While testing out GCC 7.1.0 on the testing buildbots, I have discovered some weird behavior at the intersection of ccall, openlibm and GCC 7.1.0.

If I compile libopenlibm with GCC 7.1.0, then run the following code, I get unrealistic answers (and they change after every invocation, making me think that something is returning garbage):

julia> ol = Libdl.dlopen("/tmp/lib_badlibm.so.2.3")
Ptr{Void} @0x091dcb20

julia> f = Libdl.dlsym(ol, "acos")
Ptr{Void} @0xe64a2d27

julia> ccall(f, Float64, (Float64,), 0.8)
-3.586192412729469e185

julia> ccall(f, Float64, (Float64,), 0.8)
-3.3323564455272705e184

However, if I compile libopenlibm with GCC 6.3.0, I get reasonable answers:

julia> ol = Libdl.dlopen("/tmp/lib_goodlibm.so.2.3")
Ptr{Void} @0x090e2d08

julia> f = Libdl.dlsym(ol, "acos")
Ptr{Void} @0xe64d0d6a

julia> ccall(f, Float64, (Float64,), 0.8)
0.6435011087932843

Attempting to reduce this by writing a C test program fails. It always gives the right result, even when mixing and matching GCC versions between which compiled the executable and the library, or changing optimization levels. Here's the C test file:

#include <stdio.h>
#include <openlibm.h>

int main(void) {
        double x = 0.8;
        double y = acos(x);
        printf("%.16f\n", y);
        return 0;
}

So it appears that this bug is triggered only when:

Openlibm is compiled with GCC 7.1.0
The acos() function is called via ccall within Julia
The platform is i686 linux

Does anyone have any ideas on how to debug this further?

linux upstream

Source

staticfloat

Most helpful comment

Usually gcc bug reports get responses faster and llvm/clang bug reports get subscribers faster....

yuyichao on 11 May 2017

😄2

All 37 comments

The mystery deepens. I was noticing that I could call acos(0.8) directly and get reasonable answers sometimes, and discovered this:

julia> ccall(("acos", Base.libm_name), Float64, (Float64,), 0.8)
0.6324557150988988

julia> ccall(Libdl.dlsym(Libdl.dlopen(Base.libm_name), "acos"), Float64, (Float64,), 0.8)
-3.3830265751689396e305

So perhaps there are multiple things going on here. I should note that the Float32 versions work fine:

julia> ccall(Libdl.dlsym(Libdl.dlopen(Base.libm_name), "acosf"), Float32, (Float32,), 0.8)
0.6435011f0

julia> ccall(("acosf", Base.libm_name), Float32, (Float32,), 0.8)
0.6435011f0

staticfloat on 8 May 2017

👍1

Can you post

the @code_native of f(p, x) = ccall(p, Float64, (Float64,), x) on f(C_NULL, 1.0)?
disassemble of the GCC 7.1 acos
disassemble of a simpler similar C function like double f(double x) { return x * x; }

yuyichao on 8 May 2017

If you can reproduce the issue with f(p, x) = ccall(p, Float64, (Float64,), x) on acos or the f (x * x). You can also try f2(p, x) = (ccall(:jl_breakpoint, Void, ()); ccall(p, Float64, (Float64,), x)). Set a breakpoint on jl_breakpoint and single step until you return from the C function. (The process can be easier if you have rr working there).

yuyichao on 8 May 2017

Native code output:

julia> f(p, x) = ccall(p, Float64, (Float64,), x)
f (generic function with 1 method)

julia> @code_native f(C_NULL, 1.0)
        .text
Filename: REPL[5]
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    8(%ebp), %eax
Source line: 1
        testl   %eax, %eax
        je      L30
        movsd   12(%ebp), %xmm0         # xmm0 = mem[0],zero
        movsd   %xmm0, (%esp)
        calll   *%eax
        addl    $8, %esp
        popl    %ebp
        retl
L30:
        movl    $4052630652, (%esp)     # imm = 0xF18E3C7C
        calll   jl_throw
        subl    $4, %esp
        nopl    (%eax)

Disassembly of GCC 7.1 acos and GCC 6.3 acos are here

staticfloat on 8 May 2017

Simple C function:

// simple.c
double foo(double x) {
        return x * x;
}

Compiled into shared library with GCC 7.1.0:

$ gcc -O3 -o libsimple.so -shared simple.c

Works just fine:

julia> ccall(Libdl.dlsym(Libdl.dlopen("libsimple"), "foo"), Float64, (Float64,), 2.5)
6.25

julia> ccall(("foo", "libsimple"), Float64, (Float64,), 0.8)
0.6400000000000001

Disassembly:

$ gdb -batch -ex 'file libsimple.so' -ex 'disassemble foo'
Dump of assembler code for function foo:
   0x00000430 <+0>:     fldl   0x4(%esp)
   0x00000434 <+4>:     fmul   %st(0),%st
   0x00000436 <+6>:     ret
End of assembler dump.

staticfloat on 8 May 2017

If you can reproduce the issue with f(p, x) = ccall(p, Float64, (Float64,), x)

I can:

julia> acos_func = Libdl.dlsym(Libdl.dlopen(Base.libm_name), "acos")
Ptr{Void} @0xe649dd27

julia> f(p, x) = ccall(p, Float64, (Float64,), x)
f (generic function with 1 method)

julia> f(acos_func, 0.8)
-9.486798276379531e184

You can also try f2(p, x) = (ccall(:jl_breakpoint, Void, ()); ccall(p, Float64, (Float64,), x))

I cannot:

julia> acos_func = Libdl.dlsym(Libdl.dlopen(Base.libm_name), "acos")
Ptr{Void} @0xe5588d27

julia> f(p, x) = ccall(p, Float64, (Float64,), x)
f (generic function with 1 method)

julia> f(acos_func, 0.8)
-1.0837269987879214e180

julia> f(acos_func, 0.8)
0.6324557150988988

julia> f2(p, x) = (ccall(:jl_breakpoint, Void, ()); ccall(p, Float64, (Float64,), x))
f2 (generic function with 1 method)

julia> f2(acos_func, 0.8)
0.6324557150988988

julia> f2(acos_func, 0.8)
0.6324557150988988

staticfloat on 8 May 2017

Can you compile the simple examples with -march=pentium4?

vtjnash on 8 May 2017

If by "simple examples" you mean the simple.c I posted above, adding -march=pentium4 doesn't change anything. The disassembly is exactly the same, and ccall() still works fine.

staticfloat on 8 May 2017

What about compiling openlibm and/or the sample code with -mfpmath=sse.

yuyichao on 8 May 2017

If that fixes the issue, there might still be something wrong with GCC 7's x87 codegen. If it doesn't work or if you want to figure out what's wrong with gcc 7. Maybe breakpoint on acos, and run it in a way that you can reproduce the issue. Dump all registers info register and print the arguments on entering of the function and finish should show you what gdb thinks the return value is, also do a info register after the finish.

yuyichao on 8 May 2017

Also, have you tried if acos of 1.2 (i.e. > 1), or 0.4, -0.4 works?

yuyichao on 8 May 2017

Some of the asm difference seems to come from https://www.mail-archive.com/[email protected]/msg524824.html. Not sure if it's related or if we care...

yuyichao on 8 May 2017

Add to info register also info float.

yuyichao on 8 May 2017

Alright, now we're getting somewhere! Compiling libopenlibm with -mfpmath=sse has resolved this particular issue. I'm having difficulties with gdb; can't seem to get it to finish properly. I can dump the registers at the beginning of the acos function:

Breakpoint 1, 0xe6533d27 in acos () from /src/julia/usr/bin/../lib/libopenlibm.so
(gdb) info register
eax            0xe6533d27       -430752473
ecx            0xf0fc92d0       -251882800
edx            0xf0fc8010       -251887600
ebx            0xf76d4000       -143835136
esp            0xffd733ac       0xffd733ac
ebp            0xffd733e8       0xffd733e8
esi            0xf5171c88       -183034744
edi            0x5517   21783
eip            0xe6533d27       0xe6533d27 <acos>
eflags         0x296    [ PF AF SF IF ]
cs             0x23     35
ss             0x2b     43
ds             0x2b     43
es             0x2b     43
fs             0x0      0
gs             0x63     99
(gdb) info float
  R7: Empty   0x4001e000000000000000
  R6: Empty   0x4000c000000000000000
  R5: Empty   0x4000c000000000000000
  R4: Empty   0xc26be1e687feb9a1c000
  R3: Empty   0x3ffbccccc507b4800000
  R2: Empty   0x3fff8000000000000000
  R1: Empty   0x00000000000000000000
=>R0: Empty   0x00000000000000000000

Status Word:         0x0020                  PE
                       TOP: 0
Control Word:        0x037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest
Tag Word:            0xffff
Instruction Pointer: 0x00:0xf526eb12
Operand Pointer:     0x00:0x00000000
Opcode:              0x0000

But I can't run finish:

(gdb) finish
Run till exit from #0  0xe6533d27 in acos () from /src/julia/usr/bin/../lib/libopenlibm.so
Warning:
Cannot insert breakpoint 0.
Error accessing memory address 0xe786f89f: Input/output error.

0xe6533d28 in acos () from /src/julia/usr/bin/../lib/libopenlibm.so

staticfloat on 8 May 2017

Also, have you tried if acos of 1.2 (i.e. > 1), or 0.4, -0.4 works?

Greater than 1 gives NaN, as we'd expect. 0.4 and -0.4 give the proper values of 1.1592794807274085 and 1.9823131728623846, respectively. One of the things that makes this difficult to debug is that if I compile with debug symbols enabled (-g) the issue seems to disappear. :(

staticfloat on 8 May 2017

Alright, I si'ed my way through the function. Here's the before and after of info register and info float:

Before:

(gdb) info register
eax            0xe6533d27       -430752473
ecx            0xf0fc92d0       -251882800
edx            0xf0fc8010       -251887600
ebx            0xf76d4000       -143835136
esp            0xffd733ac       0xffd733ac
ebp            0xffd733e8       0xffd733e8
esi            0xf5171c88       -183034744
edi            0x5517   21783
eip            0xe6533d27       0xe6533d27 <acos>
eflags         0x296    [ PF AF SF IF ]
cs             0x23     35
ss             0x2b     43
ds             0x2b     43
es             0x2b     43
fs             0x0      0
gs             0x63     99
(gdb) info float
  R7: Empty   0x4001e000000000000000
  R6: Empty   0x4000c000000000000000
  R5: Empty   0x4000c000000000000000
  R4: Empty   0xc2669d9a87feb9a1c000
  R3: Empty   0x3ffbccccc507b4800000
  R2: Empty   0x3fff8000000000000000
  R1: Empty   0x00000000000000000000
=>R0: Empty   0x00000000000000000000

Status Word:         0x0021   IE             PE
                       TOP: 0
Control Word:        0x037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest
Tag Word:            0xffff
Instruction Pointer: 0x00:0xf526eb12
Operand Pointer:     0x00:0x00000000
Opcode:              0x0000

After:

(gdb) info register
eax            0x3fe99999       1072273817
ecx            0xf0fc92d0       -251882800
edx            0x3fe99999       1072273817
ebx            0xf76d4000       -143835136
esp            0xffd733b0       0xffd733b0
ebp            0xffd733e8       0xffd733e8
esi            0xf5171c88       -183034744
edi            0x5517   21783
eip            0xe787296f       0xe787296f <japi1_anonymous_63425+111>
eflags         0x282    [ SF IF ]
cs             0x23     35
ss             0x2b     43
ds             0x2b     43
es             0x2b     43
fs             0x0      0
gs             0x63     99
(gdb) info float
=>R7: Valid   0xc261b03cccd1e0629000 -5.850402010609902041e+183
  R6: Empty   0xc260b03cccd1e0629000
  R5: Empty   0xbd7fc9d807381343f800
  R4: Empty   0xc2669dac87feb9a1c000
  R3: Empty   0x3ffbccccc507b4800000
  R2: Empty   0x3fff8000000000000000
  R1: Empty   0x00000000000000000000
  R0: Empty   0x00000000000000000000

Status Word:         0x3821   IE             PE
                       TOP: 7
Control Word:        0x037f   IM DM ZM OM UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest
Tag Word:            0x3fff
Instruction Pointer: 0x00:0xe6533e7b
Operand Pointer:     0x00:0x00000000
Opcode:              0x0000

For those following along in the disassembly, we returned from offset +348, which is this line

staticfloat on 8 May 2017

Well this was a learning experience.

Our troubles begin on this line. We fld in something at offset 0x20 on the stack, which, it turns out, is uninitialized memory, then don't do anything with it until we reach here at which point the "random" memory blows up. Now, the reason this blows up when using ccall() but doesn't blow up when invoking acos() directly from my C example, is that the uninitialized memory from the C version is really small (~1e-200), which causes the impact of the C uninitialized memory to be very small, whereas the uninitialized memory from the Julia code tends to be pretty large (~1e+180), which of course is pretty noticeable. This is because when we call acos() for the first time in C, it does its little dl_runtime_resolve() dance, which sets up the stack just right such that when acos() uses uninitialized memory, it has very little effect.

To test this, I wrote a new C program with the intent to setup the stack such that it will give very large numbers if interpreted as a double:

// acos_test.c
#include <stdio.h>
#include <openlibm.h>

// Fill the stack with 0x55, 0x55, 0x55...., which is a big double no matter where you start
void prepare_stack() {
        volatile unsigned char stack[256];
        for( unsigned int i=0; i<256; ++i )
                stack[i] = 0x55;
}

int main(void) {
        // Do this once so that we don't have to bother with the dl loading stuff
        volatile double x = 0.8;
        acos(x);

        prepare_stack();
        double y = acos(x);
        printf("%.16f\n", y);
        return 0;
}

And this works, when linked against a bad copy of libopenlibm, we get a huge number out thanks to our careful stack preparation. We can go further, and compile directly against src/e_acos.c and src/e_sqrt.c, being careful to maintain the compiler flags we're interested in (especially the optimization level). Doing this gets us down to the following minimal compiler invocation:

[root@47a48e9c363b openlibm-1581174c85f7b645b15ba1ac1c3a98fb601f0fe7]# gcc -O1 -o acos_test -march=pentium4 -m32 -fno-gnu89-inline -fno-builtin -std=c99 -fPIC -I src -I include ./acos_test.c ./src/e_acos.c ./src/e_sqrt.c && ./acos_test
0.6435011087932843
[root@47a48e9c363b openlibm-1581174c85f7b645b15ba1ac1c3a98fb601f0fe7]# gcc -O1 -fpeephole2 -o acos_test -march=pentium4 -m32 -fno-gnu89-inline -fno-builtin -std=c99 -fPIC -I src -I include ./acos_test.c ./src/e_acos.c ./src/e_sqrt.c && ./acos_test
6066969536937485543883473977334376971721498622912628554206287716489800464666960761304196338155520.0000000000000000

So something is blowing up within the peephole2 optimization pass on linux32. Next up, time to push this through creduce to give us a true minimal working example that we can hopefully give to the GCC guys.

staticfloat on 9 May 2017

Alright, I think I've managed to get a MWE here: https://gist.github.com/staticfloat/d357b985eab757f393fa7e5ff1ee4101. I used creduce to cut libopenlibm down to where it can still calculate acos(0.8) properly without -fpeephole2, but with that optimization flag turned on, you get a crazy answer.

staticfloat on 9 May 2017

@tkelman do you have ideas on how to meaningfully reduce this further, or should we file a GCC bug as-is? I'm hesitant to allow creduce to continue reducing without constraining it to continue calculating a proper acos().

staticfloat on 9 May 2017

that seems reportable to me

tkelman on 9 May 2017

Reported here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80706

To workaround, I am setting -fno-peephole2 on all dependencies on i686 on the testing buildbots.

staticfloat on 10 May 2017

Didn't -mfpmath=sse also fix it?

yuyichao on 10 May 2017

Oh, I totally forgot about that. That's a better fix, thanks.

staticfloat on 10 May 2017

How did you find that it was peephole2 that caused the problem here?

Looks like gcc devs are looking into it, and have a potential patch already.

tkelman on 11 May 2017

How did you find that it was peephole2 that caused the problem here?

I noticed that -O1 didn't exhibit problems, and -O2 did, so I do something similar to the following:

for flag in $list_of_o2_optimization_flags; do
    gcc -o test -O1 $flag test.c && ./test
done

And I found that -fpeephole2 caused problems. :P

Looks like gcc devs are looking into it, and have a potential patch already.

I am absolutely astounded by the speed with which they have responded. That is really impressive.

staticfloat on 11 May 2017

makes sense. where's the list of flags from?

sometimes they're fast, especially when given a good test case - the suitesparse -O3 internal compiler error on alpine linux only took about a day to fix IIRC

tkelman on 11 May 2017

Usually gcc bug reports get responses faster and llvm/clang bug reports get subscribers faster....

yuyichao on 11 May 2017

😄2

This was fixed upstream, but hasn't made it into a gcc release (7.2?) yet. Should we be warning distros not to use gcc 7.1 without patching this?

tkelman on 2 Jun 2017

We could also probably move to applying the end-result patch to the test buildbots' gcc build, as long as they're using 7.1, instead of turning off an optimization

tkelman on 2 Jun 2017

Should we be warning distros not to use gcc 7.1 without patching this?

Yeah, probably. I'm not sure where we'd make this warning though.

We could also probably move to applying the end-result patch to the test buildbots' gcc build, as long as they're using 7.1, instead of turning off an optimization

@yuyichao had a better idea, which was tell GCC -mfpmath=sse, which shifts floating point math away from using the 387 coprocessor opcodes to SSE opcodes instead. The GCC docs say:

The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80 bits. This is the default choice for the x86-64 compiler, Darwin x86-32 targets, and the default choice for x86-32 targets with the SSE2 instruction set when -ffast-math is enabled.

This still passes our test suite of course, and I remember us going through some troubles a while back to try and ensure that 80-bit temporaries didn't screw up our floating-point tests. Despite not having done any performance tests to see exactly how much this changes things, I think this is a change we can live with.

staticfloat on 4 Jun 2017

I thought that we solved that one with -march=pentium4 ?

tkelman on 4 Jun 2017

I think that did indeed solve our 80-bit precision problems by enabling sse2 instructions, but there is apparently still some internal difference in some FP instructions because this bug can still be triggered even with -march=pentium4. I don't think we set -ffast-math, so I'm not surprised there's still some difference. What I was getting at in my previous message is that setting -mfpmath=sse may have no downsides for us, since we are already trying to use sse-style floating-point math anyway, rather than the 80-bit 387 math. If there are no downsides, then we may as well just keep that flag enabled (or even add it into Make.inc on i686 builds)

staticfloat on 4 Jun 2017

I think we solves the issue in jit code by requiring pentium4. If I understand the GCC doc correctly, since -mfpmath=sse and -mfpmath=x87 produces different result, it needs to be set manually even though the sse result is probably what most people want these days.

Unless we expect external users of openlibm on sse-less x86 hardware, I think we should just set -march=pentium4 and -mfpmath=sse in openlibm directly. FWIW, it seems that it is currently explicitly setting -march=i387....

yuyichao on 4 Jun 2017

I think we solves the issue in jit code by requiring pentium4.

I believe JIT code is influenced by us setting JULIA_CPU_TARGET, and the C code (such as openlibm) is influenced by us setting -march when compiling through clang or gcc.

Unless we expect external users of openlibm on sse-less x86 hardware

Nope, we explicitly decided to not support anything older than pentium4. 😇

I think we should just set -march=pentium4 and -mfpmath=sse in openlibm directly

👍

FWIW, it seems that it is currently explicitly setting -march=i387....

I don't think so, I think it sets -march=i686 by default, which while better, is still not what we do in Julia. I guess this is because the openlibm test suite is designed to support 80-bit fp registers, whereas we don't tolerate that nonsense in the Julia test suite.

staticfloat on 4 Jun 2017

Is this still relevant to Julia given that https://github.com/JuliaLang/julia/pull/23283 has been merged?