Trying to run the official Julia docker image stretch images fails for Julia versions 1.0.1
, 1.0.0
and 0.7
on my iMac Pro with a 3 GHz Intel Xeon W cpu fails with the following message:
$ docker run -ti julia:1.0.1-stretch julia
Invalid instruction at 0x7ff98889e8c2: 0x62, 0xf1, 0x7d, 0x48, 0xef, 0xc0, 0x41, 0x56, 0x41, 0x55, 0x41, 0x54, 0x55, 0x53, 0x48
signal (4): Illegal instruction
in expression starting at no file:0
dot_compute at /usr/local/julia/bin/../lib/julia/libopenblas64_.so (unknown line)
unknown function (ip: 0xf)
Allocations: 2956 (Pool: 2948; Big: 8); GC: 0
$ docker run -ti julia:1.0.0-stretch julia
Invalid instruction at 0x7f320623a8c2: 0x62, 0xf1, 0x7d, 0x48, 0xef, 0xc0, 0x41, 0x56, 0x41, 0x55, 0x41, 0x54, 0x55, 0x53, 0x48
signal (4): Illegal instruction
in expression starting at no file:0
dot_compute at /usr/local/julia/bin/../lib/julia/libopenblas64_.so (unknown line)
unknown function (ip: 0xf)
Allocations: 2949 (Pool: 2940; Big: 9); GC: 0
docker run -ti julia:0.7-stretch julia
Invalid instruction at 0x7f251587e8c2: 0x62, 0xf1, 0x7d, 0x48, 0xef, 0xc0, 0x41, 0x56, 0x41, 0x55, 0x41, 0x54, 0x55, 0x53, 0x48
signal (4): Illegal instruction
in expression starting at no file:0
dot_compute at /usr/local/julia/bin/../lib/julia/libopenblas64_.so (unknown line)
unknown function (ip: 0xf)
Allocations: 2983 (Pool: 2973; Big: 10); GC: 0
The julia:0.6.2-stretch
tag seems to be the latest working version.
I'm a bit lost how I can debug this further. Happy to run things on my machine if it helps.
Or is this an issue that should be filed upstream with openblas?
THe invalid instruction is vpxord %zmm0,%zmm0,%zmm0
an avx512 instruction. If you have a debugger, what would help is to run p jl_dump_host_cpu()
and also double check if the sigill is raised in the openblas library (bt
when you get the sigill in the debugger).
Without docker (where Julia 1.0.1 works) p jl_dump_host_cpu()
gives me:
CPU: skylake
Features: sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, rdrnd, fsgsbase, bmi, avx2, bmi2, rtm, mpx, avx512f, avx512dq, rdseed, adx, clflushopt, clwb, avx512cd, avx512bw, avx512vl, sahf, lzcnt, prfchw, xsaveopt, xsavec, xsaves
This is when I launch Julia via docker using lldb:
$ lldb -- docker run -ti julia:1.0.1-stretch julia
(lldb) target create "docker"
Current executable set to 'docker' (x86_64).
(lldb) settings set -- target.run-args "run" "-ti" "julia:1.0.1-stretch" "julia"
(lldb) target stop-hook add
Enter your stop hook command(s). Type 'DONE' to end.
> bt
> disassemble --pc
Stop hook #1 added.
(lldb) r
* thread #1, stop reason = signal SIGSTOP
* frame #0: 0x0000000006486000 dyld`_dyld_start
dyld`_dyld_start:
-> 0x6486000 <+0>: popq %rdi
0x6486001 <+1>: pushq $0x0
0x6486003 <+3>: movq %rsp, %rbp
0x6486006 <+6>: andq $-0x10, %rsp
Process 1154 launched: '/usr/local/bin/docker' (x86_64)
Invalid instruction at 0x7f2f6cdfa8c2: 0x62, 0xf1, 0x7d, 0x48, 0xef, 0xc0, 0x41, 0x56, 0x41, 0x55, 0x41, 0x54, 0x55, 0x53, 0x48
signal (4): Illegal instruction
in expression starting at no file:0
dot_compute at /usr/local/julia/bin/../lib/julia/libopenblas64_.so (unknown line)
unknown function (ip: 0xf)
Allocations: 2957 (Pool: 2948; Big: 9); GC: 0
Process 1154 exited with status = 132 (0x00000084)
Please let me know if this helps, I have no idea what I'm doing here. I'm using lldb because I couldn't get gdb to work because of code signing issues.
You should run the function on the julia that is not working rather than the working one. Also, you should run the debugger on julia rather than docker.....
Understood, here's the output running gdb
inside docker:
root@d63c2d5414dc:/# gdb julia
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from julia...done.
(gdb) r
Starting program: /usr/local/julia/bin/julia
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3768700 (LWP 557)]
[New Thread 0x7fffe2182700 (LWP 558)]
[New Thread 0x7fffe1981700 (LWP 559)]
[New Thread 0x7fffe1180700 (LWP 560)]
[New Thread 0x7fffe097f700 (LWP 561)]
[New Thread 0x7fffe017e700 (LWP 562)]
[New Thread 0x7fffdf97d700 (LWP 563)]
[New Thread 0x7fffdf17c700 (LWP 564)]
Thread 1 "julia" received signal SIGILL, Illegal instruction.
0x00007fffe39008c2 in dot_compute () from /usr/local/julia/bin/../lib/julia/libopenblas64_.so
(gdb) bt
#0 0x00007fffe39008c2 in dot_compute () from /usr/local/julia/bin/../lib/julia/libopenblas64_.so
#1 0x00007fffe3900b65 in ddot_k_SKYLAKEX () from /usr/local/julia/bin/../lib/julia/libopenblas64_.so
#2 0x00007fffe2abd7fb in dpotf2_U () from /usr/local/julia/bin/../lib/julia/libopenblas64_.so
#3 0x00007fffe2aba68a in dpotrf_U_single () from /usr/local/julia/bin/../lib/julia/libopenblas64_.so
#4 0x00007fffe2abaeb9 in dpotrf_U_parallel () from /usr/local/julia/bin/../lib/julia/libopenblas64_.so
#5 0x00007fffe289ac3d in dpotrf_64_ () from /usr/local/julia/bin/../lib/julia/libopenblas64_.so
#6 0x00007fffebc4fafe in potrf! () at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/lapack.jl:3012
#7 macro expansion () at logging.jl:313
#8 japi1_check_22172 () at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/blas.jl:137
#9 0x00007fffeb8bd892 in julia___init___22217 () at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/LinearAlgebra/src/LinearAlgebra.jl:381
#10 0x00007fffeb8bda2b in jfptr___init___22218.clone_1 () at array.jl:769
#11 0x00007ffff76af176 in jl_apply_generic (args=args@entry=0x7fffffffe828, nargs=nargs@entry=1) at /buildworker/worker/package_linux64/build/src/gf.c:2184
#12 0x00007ffff76e41f5 in jl_apply (nargs=1, args=0x7fffffffe828) at /buildworker/worker/package_linux64/build/src/julia.h:1537
#13 jl_module_run_initializer (m=0x7fffec2e6de0 <jl_system_image_data+3246880>) at /buildworker/worker/package_linux64/build/src/toplevel.c:90
#14 0x00007ffff76c9cc7 in _julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME) at /buildworker/worker/package_linux64/build/src/init.c:813
#15 0x00007ffff76ca45b in julia_init__threading (rel=JL_IMAGE_JULIA_HOME) at /buildworker/worker/package_linux64/build/src/task.c:302
#16 0x0000000000401508 in main (argc=<optimized out>, argv=<optimized out>) at /buildworker/worker/package_linux64/build/ui/repl.c:227
And running p jl_dump_host_cpu()
(gdb) p jl_dump_host_cpu()
CPU: skylake
Features: sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, rdrnd, fsgsbase, bmi, avx2, bmi2, rtm, sahf, lzcnt, prfchw, xsaveopt
$1 = void
OK, good. I believe this is an openblas bug then. Since the crash is indeed inside openblas (i.e. the backtrace is correct) and the julia cpu dispatch correctly detected the features available.
Taking a guess here. It seems that somehow your docker setup is disabling the avx512 support. I searched around briefly and couldn't find any bug report/config about it but it should be in principle possible. Looking at OpenBLAS's dispatch code confirms this. There's only a compile time option to disable avx512, the runtime check is only based on cpu model and not the available feature, in particular nothing that checks for OS support. (edit: They should also have detection for the CPU feature via cpuid
but that probably won't help here)
Should we keep an issue open to track the upstream bug and make sure that we configure OpenBLAS correctly in the circumstances where this can be an issue?
Is there anything else I can do to get this resolved? I think it would be great if we could get the official Julia docker images working on Skylake and agree with @StefanKarpinski that it would be good to keep an issue open.
Please report to openblas instead.
Have the same issue on the iMac Pro. Can anyone point me to the issue in the OpenBLAS repository or something? Do we have a temporary patch?
Please report to openblas instead.
Also, @yuyichao, this doesn't make much sense. Just because the water upstream is foul doesn't mean you forgo purifying it downstream. This causes a workflow disruption for Julia users, and I think you or @StefanKarpinski should re-open this issue.
It doesn't make sense to keep one issue open for each downstream especially when there's nothing actionable here. Closing also doesnt mean nothing will ever be done here. Ping back or opening a similar issue for upgrading the dependency after it's fixed upstream is certainly fine and welcome.
If there's really no way to patch this on our end, that might make sense. But one consequence of closing the issue prematurely is that you can't find out one way or the other.
I'll defer to you and Stefan. But as far as I can tell, the only consequence of this has been that (a) OpenBLAS is still unaware of the issue, and (b) the Julia community has not been searching for a workaround.
Keeping this open won't help anything. Openblas dev won't see it here. That's exactly why I said reporting to them instead which you regarded as nonsense.
Openblas being unaware (if still the case) is the consequence of no one reporting there, not this one being closed.
No workaround being present is because the only way to do that without just disabling avx512 is to make the correct fix, which isn't going to be a workaround anymore and should be submitted to openblas first at which point including the patch here is very welcome and doesn't need any issue.... I even outlined the necessary check here.
Also note that if I was able to reproduce I would have submitted the issue Myself. I can't, I don't have the hardware or environment so I didn't. I can only help with the analysis above purely based on the observation that my code works but openblasZ's doesn't. Anyone that can reproduce and test this are in a much better position to report it. If it wasn't reported and you don't want to report it either, leave a comment here and I'll do it. I didn't think that is going to be the case so I didn't ask before.
No worries, I'll file the issue.
I didn't think raising the issue upstream is nonsense; just that we should keep this open to try and find a patch in the meantime.
OK, I'll take your word for it. If there's nothing we can do, I agree that there's no use keeping this around.
See (1).
Just make sure you link this issue for the debugging here. That should also be enough to create a back link from here to that issue.
Also note that I said there's nothing much to do here "without just diabling avx512". Disabling it for everyone or downgrading openblas is certainly possible as a workaround but I really don't think that's a good idea. (It was under the assumption though that this should have been reported to openblas 2 months ago and there should have been a patch that we can at least carry a long time ago...... Oh well.) In another word,
make sure that we configure OpenBLAS correctly in the circumstances where this can be an issue?
There's no way to only do this "in the circumstances where this can be an issue". The way to do that will be the right patch. (unless we want to provide an avx512 enabled version and a non-avx512 enabled version, which I really doubt, and in that case it'll also be a buildbot issue not a julia issue...............)
Errrrrrrr, well, I still don't see a related OpenBLAS issue being reported so I just did it https://github.com/xianyi/OpenBLAS/issues/1947.
As I mentioned above and also in the OpenBLAS issue, I really can't help anything else about this issue so anyone that actually want this to be fixed should comment and help testing there.
@yuyichao sorry for not reporting the issue upstream. I didn鈥檛 because I don鈥檛 really understand what鈥檚 going on. I should have at least asked for help reporting it here and will do so next time.
Anyway, I鈥檓 commenting on the upstream issue now and happy to run things on my machine.
Unlike the above, I have no nobler explanation other than that I told myself I'd do it after lunch, and never got around to it. Thanks so much for taking the initiative to make this happen.
I've found a workaround for this issue which allows me to start Julia in docker container on my iMac Pro: set environment variable OPENBLAS_CORETYPE
to something other than skylakex
, for example haswell
.
$ docker run -it --rm -e OPENBLAS_CORETYPE=haswell julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.1.0 (2019-01-21)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
Julia>
@sashkab nice, thanks for that.
Also openblas 0.3.6 was just released and contains the fix: https://github.com/xianyi/OpenBLAS/releases/tag/v0.3.6
What's the process of updating to that?
Also openblas 0.3.6 was just released
Oh, wow. Perfect timing. I checked prior to posting this and it wasn't available yet .
I ran into this issue too. Can we re-open this so this issue can be fixed (by upgrading to openblas 0.3.6 or 0.3.7)?
Most helpful comment
I've found a workaround for this issue which allows me to start Julia in docker container on my iMac Pro: set environment variable
OPENBLAS_CORETYPE
to something other thanskylakex
, for examplehaswell
.