While testing out GCC 7.1.0 on the testing buildbots, I have discovered some weird behavior at the intersection of ccall, openlibm and GCC 7.1.0.
If I compile libopenlibm with GCC 7.1.0, then run the following code, I get unrealistic answers (and they change after every invocation, making me think that something is returning garbage):
julia> ol = Libdl.dlopen("/tmp/lib_badlibm.so.2.3")
Ptr{Void} @0x091dcb20
julia> f = Libdl.dlsym(ol, "acos")
Ptr{Void} @0xe64a2d27
julia> ccall(f, Float64, (Float64,), 0.8)
-3.586192412729469e185
julia> ccall(f, Float64, (Float64,), 0.8)
-3.3323564455272705e184
However, if I compile libopenlibm with GCC 6.3.0, I get reasonable answers:
julia> ol = Libdl.dlopen("/tmp/lib_goodlibm.so.2.3")
Ptr{Void} @0x090e2d08
julia> f = Libdl.dlsym(ol, "acos")
Ptr{Void} @0xe64d0d6a
julia> ccall(f, Float64, (Float64,), 0.8)
0.6435011087932843
Attempting to reduce this by writing a C test program fails. It always gives the right result, even when mixing and matching GCC versions between which compiled the executable and the library, or changing optimization levels. Here's the C test file:
#include <stdio.h>
#include <openlibm.h>
int main(void) {
double x = 0.8;
double y = acos(x);
printf("%.16f\n", y);
return 0;
}
So it appears that this bug is triggered only when:
Openlibm is compiled with GCC 7.1.0
The acos() function is called via ccall within Julia
The platform is i686 linux
Does anyone have any ideas on how to debug this further?
The mystery deepens. I was noticing that I could call acos(0.8) directly and get reasonable answers sometimes, and discovered this:
julia> ccall(("acos", Base.libm_name), Float64, (Float64,), 0.8)
0.6324557150988988
julia> ccall(Libdl.dlsym(Libdl.dlopen(Base.libm_name), "acos"), Float64, (Float64,), 0.8)
-3.3830265751689396e305
So perhaps there are multiple things going on here. I should note that the Float32 versions work fine:
julia> ccall(Libdl.dlsym(Libdl.dlopen(Base.libm_name), "acosf"), Float32, (Float32,), 0.8)
0.6435011f0
julia> ccall(("acosf", Base.libm_name), Float32, (Float32,), 0.8)
0.6435011f0
Can you post
@code_native of f(p, x) = ccall(p, Float64, (Float64,), x) on f(C_NULL, 1.0)?acosdouble f(double x) { return x * x; }If you can reproduce the issue with f(p, x) = ccall(p, Float64, (Float64,), x) on acos or the f (x * x). You can also try f2(p, x) = (ccall(:jl_breakpoint, Void, ()); ccall(p, Float64, (Float64,), x)). Set a breakpoint on jl_breakpoint and single step until you return from the C function. (The process can be easier if you have rr working there).
Native code output:
julia> f(p, x) = ccall(p, Float64, (Float64,), x)
f (generic function with 1 method)
julia> @code_native f(C_NULL, 1.0)
.text
Filename: REPL[5]
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl 8(%ebp), %eax
Source line: 1
testl %eax, %eax
je L30
movsd 12(%ebp), %xmm0 # xmm0 = mem[0],zero
movsd %xmm0, (%esp)
calll *%eax
addl $8, %esp
popl %ebp
retl
L30:
movl $4052630652, (%esp) # imm = 0xF18E3C7C
calll jl_throw
subl $4, %esp
nopl (%eax)
Disassembly of GCC 7.1 acos and GCC 6.3 acos are here
Simple C function:
// simple.c
double foo(double x) {
return x * x;
}
Compiled into shared library with GCC 7.1.0:
$ gcc -O3 -o libsimple.so -shared simple.c
Works just fine:
julia> ccall(Libdl.dlsym(Libdl.dlopen("libsimple"), "foo"), Float64, (Float64,), 2.5)
6.25
julia> ccall(("foo", "libsimple"), Float64, (Float64,), 0.8)
0.6400000000000001
Disassembly:
$ gdb -batch -ex 'file libsimple.so' -ex 'disassemble foo'
Dump of assembler code for function foo:
0x00000430 <+0>: fldl 0x4(%esp)
0x00000434 <+4>: fmul %st(0),%st
0x00000436 <+6>: ret
End of assembler dump.
If you can reproduce the issue with f(p, x) = ccall(p, Float64, (Float64,), x)
I can:
julia> acos_func = Libdl.dlsym(Libdl.dlopen(Base.libm_name), "acos")
Ptr{Void} @0xe649dd27
julia> f(p, x) = ccall(p, Float64, (Float64,), x)
f (generic function with 1 method)
julia> f(acos_func, 0.8)
-9.486798276379531e184
You can also try f2(p, x) = (ccall(:jl_breakpoint, Void, ()); ccall(p, Float64, (Float64,), x))
I cannot:
julia> acos_func = Libdl.dlsym(Libdl.dlopen(Base.libm_name), "acos")
Ptr{Void} @0xe5588d27
julia> f(p, x) = ccall(p, Float64, (Float64,), x)
f (generic function with 1 method)
julia> f(acos_func, 0.8)
-1.0837269987879214e180
julia> f(acos_func, 0.8)
0.6324557150988988
julia> f2(p, x) = (ccall(:jl_breakpoint, Void, ()); ccall(p, Float64, (Float64,), x))
f2 (generic function with 1 method)
julia> f2(acos_func, 0.8)
0.6324557150988988
julia> f2(acos_func, 0.8)
0.6324557150988988
Can you compile the simple examples with -march=pentium4?
If by "simple examples" you mean the simple.c I posted above, adding -march=pentium4 doesn't change anything. The disassembly is exactly the same, and ccall() still works fine.
What about compiling openlibm and/or the sample code with -mfpmath=sse.
If that fixes the issue, there might still be something wrong with GCC 7's x87 codegen. If it doesn't work or if you want to figure out what's wrong with gcc 7. Maybe breakpoint on acos, and run it in a way that you can reproduce the issue. Dump all registers info register and print the arguments on entering of the function and finish should show you what gdb thinks the return value is, also do a info register after the finish.
Also, have you tried if acos of 1.2 (i.e. > 1), or 0.4, -0.4 works?
Some of the asm difference seems to come from https://www.mail-archive.com/[email protected]/msg524824.html. Not sure if it's related or if we care...
Add to info register also info float.
Alright, now we're getting somewhere! Compiling libopenlibm with -mfpmath=sse has resolved this particular issue. I'm having difficulties with gdb; can't seem to get it to finish properly. I can dump the registers at the beginning of the acos function:
Breakpoint 1, 0xe6533d27 in acos () from /src/julia/usr/bin/../lib/libopenlibm.so
(gdb) info register
eax 0xe6533d27 -430752473
ecx 0xf0fc92d0 -251882800
edx 0xf0fc8010 -251887600
ebx 0xf76d4000 -143835136
esp 0xffd733ac 0xffd733ac
ebp 0xffd733e8 0xffd733e8
esi 0xf5171c88 -183034744
edi 0x5517 21783
eip 0xe6533d27 0xe6533d27 <acos>
eflags 0x296 [ PF AF SF IF ]
cs 0x23 35
ss 0x2b 43
ds 0x2b 43
es 0x2b 43
fs 0x0 0
gs 0x63 99
(gdb) info float
R7: Empty 0x4001e000000000000000
R6: Empty 0x4000c000000000000000
R5: Empty 0x4000c000000000000000
R4: Empty 0xc26be1e687feb9a1c000
R3: Empty 0x3ffbccccc507b4800000
R2: Empty 0x3fff8000000000000000
R1: Empty 0x00000000000000000000
=>R0: Empty 0x00000000000000000000
Status Word: 0x0020 PE
TOP: 0
Control Word: 0x037f IM DM ZM OM UM PM
PC: Extended Precision (64-bits)
RC: Round to nearest
Tag Word: 0xffff
Instruction Pointer: 0x00:0xf526eb12
Operand Pointer: 0x00:0x00000000
Opcode: 0x0000
But I can't run finish:
(gdb) finish
Run till exit from #0 0xe6533d27 in acos () from /src/julia/usr/bin/../lib/libopenlibm.so
Warning:
Cannot insert breakpoint 0.
Error accessing memory address 0xe786f89f: Input/output error.
0xe6533d28 in acos () from /src/julia/usr/bin/../lib/libopenlibm.so
Also, have you tried if acos of 1.2 (i.e. > 1), or 0.4, -0.4 works?
Greater than 1 gives NaN, as we'd expect. 0.4 and -0.4 give the proper values of 1.1592794807274085 and 1.9823131728623846, respectively. One of the things that makes this difficult to debug is that if I compile with debug symbols enabled (-g) the issue seems to disappear. :(
Alright, I si'ed my way through the function. Here's the before and after of info register and info float:
Before:
(gdb) info register
eax 0xe6533d27 -430752473
ecx 0xf0fc92d0 -251882800
edx 0xf0fc8010 -251887600
ebx 0xf76d4000 -143835136
esp 0xffd733ac 0xffd733ac
ebp 0xffd733e8 0xffd733e8
esi 0xf5171c88 -183034744
edi 0x5517 21783
eip 0xe6533d27 0xe6533d27 <acos>
eflags 0x296 [ PF AF SF IF ]
cs 0x23 35
ss 0x2b 43
ds 0x2b 43
es 0x2b 43
fs 0x0 0
gs 0x63 99
(gdb) info float
R7: Empty 0x4001e000000000000000
R6: Empty 0x4000c000000000000000
R5: Empty 0x4000c000000000000000
R4: Empty 0xc2669d9a87feb9a1c000
R3: Empty 0x3ffbccccc507b4800000
R2: Empty 0x3fff8000000000000000
R1: Empty 0x00000000000000000000
=>R0: Empty 0x00000000000000000000
Status Word: 0x0021 IE PE
TOP: 0
Control Word: 0x037f IM DM ZM OM UM PM
PC: Extended Precision (64-bits)
RC: Round to nearest
Tag Word: 0xffff
Instruction Pointer: 0x00:0xf526eb12
Operand Pointer: 0x00:0x00000000
Opcode: 0x0000
After:
(gdb) info register
eax 0x3fe99999 1072273817
ecx 0xf0fc92d0 -251882800
edx 0x3fe99999 1072273817
ebx 0xf76d4000 -143835136
esp 0xffd733b0 0xffd733b0
ebp 0xffd733e8 0xffd733e8
esi 0xf5171c88 -183034744
edi 0x5517 21783
eip 0xe787296f 0xe787296f <japi1_anonymous_63425+111>
eflags 0x282 [ SF IF ]
cs 0x23 35
ss 0x2b 43
ds 0x2b 43
es 0x2b 43
fs 0x0 0
gs 0x63 99
(gdb) info float
=>R7: Valid 0xc261b03cccd1e0629000 -5.850402010609902041e+183
R6: Empty 0xc260b03cccd1e0629000
R5: Empty 0xbd7fc9d807381343f800
R4: Empty 0xc2669dac87feb9a1c000
R3: Empty 0x3ffbccccc507b4800000
R2: Empty 0x3fff8000000000000000
R1: Empty 0x00000000000000000000
R0: Empty 0x00000000000000000000
Status Word: 0x3821 IE PE
TOP: 7
Control Word: 0x037f IM DM ZM OM UM PM
PC: Extended Precision (64-bits)
RC: Round to nearest
Tag Word: 0x3fff
Instruction Pointer: 0x00:0xe6533e7b
Operand Pointer: 0x00:0x00000000
Opcode: 0x0000
For those following along in the disassembly, we returned from offset +348, which is this line
Well this was a learning experience.
Our troubles begin on this line. We fld in something at offset 0x20 on the stack, which, it turns out, is uninitialized memory, then don't do anything with it until we reach here at which point the "random" memory blows up. Now, the reason this blows up when using ccall() but doesn't blow up when invoking acos() directly from my C example, is that the uninitialized memory from the C version is really small (~1e-200), which causes the impact of the C uninitialized memory to be very small, whereas the uninitialized memory from the Julia code tends to be pretty large (~1e+180), which of course is pretty noticeable. This is because when we call acos() for the first time in C, it does its little dl_runtime_resolve() dance, which sets up the stack just right such that when acos() uses uninitialized memory, it has very little effect.
To test this, I wrote a new C program with the intent to setup the stack such that it will give very large numbers if interpreted as a double:
// acos_test.c
#include <stdio.h>
#include <openlibm.h>
// Fill the stack with 0x55, 0x55, 0x55...., which is a big double no matter where you start
void prepare_stack() {
volatile unsigned char stack[256];
for( unsigned int i=0; i<256; ++i )
stack[i] = 0x55;
}
int main(void) {
// Do this once so that we don't have to bother with the dl loading stuff
volatile double x = 0.8;
acos(x);
prepare_stack();
double y = acos(x);
printf("%.16f\n", y);
return 0;
}
And this works, when linked against a bad copy of libopenlibm, we get a huge number out thanks to our careful stack preparation. We can go further, and compile directly against src/e_acos.c and src/e_sqrt.c, being careful to maintain the compiler flags we're interested in (especially the optimization level). Doing this gets us down to the following minimal compiler invocation:
[root@47a48e9c363b openlibm-1581174c85f7b645b15ba1ac1c3a98fb601f0fe7]# gcc -O1 -o acos_test -march=pentium4 -m32 -fno-gnu89-inline -fno-builtin -std=c99 -fPIC -I src -I include ./acos_test.c ./src/e_acos.c ./src/e_sqrt.c && ./acos_test
0.6435011087932843
[root@47a48e9c363b openlibm-1581174c85f7b645b15ba1ac1c3a98fb601f0fe7]# gcc -O1 -fpeephole2 -o acos_test -march=pentium4 -m32 -fno-gnu89-inline -fno-builtin -std=c99 -fPIC -I src -I include ./acos_test.c ./src/e_acos.c ./src/e_sqrt.c && ./acos_test
6066969536937485543883473977334376971721498622912628554206287716489800464666960761304196338155520.0000000000000000
So something is blowing up within the peephole2 optimization pass on linux32. Next up, time to push this through creduce to give us a true minimal working example that we can hopefully give to the GCC guys.
Alright, I think I've managed to get a MWE here: https://gist.github.com/staticfloat/d357b985eab757f393fa7e5ff1ee4101. I used creduce to cut libopenlibm down to where it can still calculate acos(0.8) properly without -fpeephole2, but with that optimization flag turned on, you get a crazy answer.
@tkelman do you have ideas on how to meaningfully reduce this further, or should we file a GCC bug as-is? I'm hesitant to allow creduce to continue reducing without constraining it to continue calculating a proper acos().
that seems reportable to me
Reported here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80706
To workaround, I am setting -fno-peephole2 on all dependencies on i686 on the testing buildbots.
Didn't -mfpmath=sse also fix it?
Oh, I totally forgot about that. That's a better fix, thanks.
How did you find that it was peephole2 that caused the problem here?
Looks like gcc devs are looking into it, and have a potential patch already.
How did you find that it was peephole2 that caused the problem here?
I noticed that -O1 didn't exhibit problems, and -O2 did, so I do something similar to the following:
for flag in $list_of_o2_optimization_flags; do
gcc -o test -O1 $flag test.c && ./test
done
And I found that -fpeephole2 caused problems. :P
Looks like gcc devs are looking into it, and have a potential patch already.
I am absolutely astounded by the speed with which they have responded. That is really impressive.
makes sense. where's the list of flags from?
sometimes they're fast, especially when given a good test case - the suitesparse -O3 internal compiler error on alpine linux only took about a day to fix IIRC
Usually gcc bug reports get responses faster and llvm/clang bug reports get subscribers faster....
This was fixed upstream, but hasn't made it into a gcc release (7.2?) yet. Should we be warning distros not to use gcc 7.1 without patching this?
We could also probably move to applying the end-result patch to the test buildbots' gcc build, as long as they're using 7.1, instead of turning off an optimization
Should we be warning distros not to use gcc 7.1 without patching this?
Yeah, probably. I'm not sure where we'd make this warning though.
We could also probably move to applying the end-result patch to the test buildbots' gcc build, as long as they're using 7.1, instead of turning off an optimization
@yuyichao had a better idea, which was tell GCC -mfpmath=sse, which shifts floating point math away from using the 387 coprocessor opcodes to SSE opcodes instead. The GCC docs say:
The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80 bits. This is the default choice for the x86-64 compiler, Darwin x86-32 targets, and the default choice for x86-32 targets with the SSE2 instruction set when -ffast-math is enabled.
This still passes our test suite of course, and I remember us going through some troubles a while back to try and ensure that 80-bit temporaries didn't screw up our floating-point tests. Despite not having done any performance tests to see exactly how much this changes things, I think this is a change we can live with.
I thought that we solved that one with -march=pentium4 ?
I think that did indeed solve our 80-bit precision problems by enabling sse2 instructions, but there is apparently still some internal difference in some FP instructions because this bug can still be triggered even with -march=pentium4. I don't think we set -ffast-math, so I'm not surprised there's still some difference. What I was getting at in my previous message is that setting -mfpmath=sse may have no downsides for us, since we are already trying to use sse-style floating-point math anyway, rather than the 80-bit 387 math. If there are no downsides, then we may as well just keep that flag enabled (or even add it into Make.inc on i686 builds)
I think we solves the issue in jit code by requiring pentium4. If I understand the GCC doc correctly, since -mfpmath=sse and -mfpmath=x87 produces different result, it needs to be set manually even though the sse result is probably what most people want these days.
Unless we expect external users of openlibm on sse-less x86 hardware, I think we should just set -march=pentium4 and -mfpmath=sse in openlibm directly. FWIW, it seems that it is currently explicitly setting -march=i387....
I think we solves the issue in jit code by requiring pentium4.
I believe JIT code is influenced by us setting JULIA_CPU_TARGET, and the C code (such as openlibm) is influenced by us setting -march when compiling through clang or gcc.
Unless we expect external users of openlibm on sse-less x86 hardware
Nope, we explicitly decided to not support anything older than pentium4. 馃槆
I think we should just set -march=pentium4 and -mfpmath=sse in openlibm directly
馃憤
FWIW, it seems that it is currently explicitly setting
-march=i387....
I don't think so, I think it sets -march=i686 by default, which while better, is still not what we do in Julia. I guess this is because the openlibm test suite is designed to support 80-bit fp registers, whereas we don't tolerate that nonsense in the Julia test suite.
Is this still relevant to Julia given that https://github.com/JuliaLang/julia/pull/23283 has been merged?
Would still be worth reporting upstream to openlibm if you haven't already.
This was a compiler bug; it's been fixed in GCC 7.2.0 (I just verified that is indeed the case), so I think we can reasonably lay this to rest.
Most helpful comment
Usually gcc bug reports get responses faster and llvm/clang bug reports get subscribers faster....