First introduced by https://github.com/rust-lang/rust/pull/47828#issuecomment-364319910.
Symptom: The following 3 test cases involving error E0432 will fail.
failures:
[compile-fail] compile-fail\rfc-2126-extern-in-paths\single-segment.rs
[compile-fail] compile-fail\use-keyword.rs
[compile-fail] compile-fail\use-mod-2.rs
Mainly affects Windows and macOS machines (maybe because they are tested first).
Current instances:
| PR | Image |
|----|-------|
| https://github.com/rust-lang/rust/pull/47828#issuecomment-363678879 | check i686-pc-windows-gnu |
| https://github.com/rust-lang/rust/pull/47828#issuecomment-363936811 | check x86_64-apple-darwin |
| https://github.com/rust-lang/rust/pull/47828#issuecomment-364316073 | check i686-pc-windows-gnu |
| https://github.com/rust-lang/rust/pull/47657#issuecomment-364632724 | check i686-pc-windows-gnu |
| https://github.com/rust-lang/rust/pull/48092#issuecomment-364678367 | check i686-pc-windows-gnu |
| https://github.com/rust-lang/rust/pull/48127#issuecomment-364739852 | check x86_64-apple-darwin |
| https://github.com/rust-lang/rust/pull/47614#issuecomment-364751349 | check x86_64-apple-darwin |
| https://github.com/rust-lang/rust/pull/47752#issuecomment-364758508 | check x86_64-windows-gnu |
| https://github.com/rust-lang/rust/pull/47804#issuecomment-364805485 | check x86_64-apple-darwin |
| https://github.com/rust-lang/rust/pull/47804#issuecomment-364961617 | check x86_64-apple-darwin |
| https://github.com/rust-lang/rust/pull/48158#issuecomment-365107533 | check x86_64-apple-darwin |
| https://github.com/rust-lang/rust/pull/47906#issuecomment-365151270 | check x86_64-apple-darwin |
The second victim (#47657) contains nothing useful in the log https://ci.appveyor.com/project/rust-lang/rust/build/1.0.6293/job/k1wbokf5c59o4ghj.
cc @alexcrichton @eddyb
Does compiletest really hide ICEs now? How did this happen 😞?
There's some more information as well at the top of https://github.com/rust-lang/rust/pull/47828#issuecomment-364319910, notably that this has been seen at least once on x86_64-apple-darwin.
The fact that it's spurious yet deterministic about these three tests is pretty disturbing, and sort of points to maybe this being a nondeterministic miscompilation in librustc_resolve itself. I think it's pretty certain this was introduced via the LLVM 6 upgrade, but given the lack of ability to reproduce or anything else my preferred course of action here is:
@kennytm if this starts bouncing every other PR though we can certainly reconsider!
@eddyb I've verified that the ICE is swallowed by compile-test 😢. However, even if it is not swallowed, the ICE is useless without RUST_BACKTRACE=1 😓
Edit: Wow, even worse, the test case for #48000 passed successfully even though it is still ICE-ing. check_no_compiler_crash is not effective anymore?
Edit²: I think this (ICE being ignored) is caused by #47634
Edit³: Nevermind, the ICE detection is functioning properly with #48048. The ICE in #48000 never failed with --error-format json. But it specifically catches the words "error: internal compiler error".
Edit⁴: The "hide ICE" thing should be fixed in #48127.
Ok I've been debugging this with @eddyb today and we may (?) have made some progress. @eddyb has been able to deterministically reproduce the test failure on his machine and furthermore has a reduced test case which panics on his machine. Unfortunately I have been unable to reproduce this on my machine.
What we have found though is also fascinating. After lots of sharing of IR we've found that this particular snippet of IR will optimize differently on his machine than on mine. (using the same version of LLVM!) This points to me as a memory access violation in LLVM or something like that. @eddyb, however, shared the literal LLVM binaries with me and I was again unable to reproduce on my machine!
I did find out, however, that valgrind reported no violations on the LLVM I compiled myself whereas it did report a violation on the binary that @eddyb shared with me. This, being very suspicious, pointed at maybe a miscompile of LLVM itself. (or maybe undefined behavior in LLVM?)
@eddyb was locally using clang 4 for compiling LLVM and I was using gcc 5.4.0. After switching to clang 4 locally I was able to reproduce the valgrind violation, although I'm still unable to reproduce the miscompile of the Rust code itself.
Given that we're seeing this failure across three builders, the two MinGW ones and 64-bit OSX my current suspicion is that this is basically just undefined behavior in LLVM itself, only exploited on newer version of the compilers we use to compile it than what I was using locally. This I think would explain the various symptoms of:
Unfortunately this still isn't a huge amount to go on. I'm going to try to hone in on the valgrind error here and see where that leads me. I'm sort of just praying at this point that it leads to the cause of these bugs.
Bisection of the valgrind error locally points to https://github.com/llvm-mirror/llvm/commit/f45aefe37d39215e57c493313ab8727bf4f3c055.
Unfortunately that LLVM commit does not cleanly revert, so I've manually reverted it instead. Reverting that commit makes the valgrind error go away locally for me, and we're currently confirming with @eddyb whether it fixes the miscompile locally.
In the meantime though @eddyb also found that removing this assume fixed the miscompile, so we should probably do that in the meantime anyway.
Ok @kennytm has removed the assume in #48209, and maybe that will fix this issue? If not we can try to pursue the LLVM fix perhaps.
The valgrind violation seems harmless. It's from assert((getNumBuckets() & (getNumBuckets()-1)) == 0 and getNumBuckets() is:
unsigned getNumBuckets() const {
return Small ? InlineBuckets : getLargeRep()->NumBuckets;
}
What happens here is that clang happens to ultimately put the branch on Small after the branch that compares the results from getLargeRep()->NumBuckets. That means that the first branch may depend on uninitialized memory, because if Small is true, the large rep part never gets initialized.
That should be fine though, because InlineBuckets is a template parameter (8 here), and for the case that Small is true, the assertion could statically be proven to hold. So all you need is to check whether Small is true, or else whether the condition holds for getLargeRep()->NumBuckets. So the result for NumBuckets only matters if Small is false, and in that case, the memory is initialized.
See also the remark under "Preparing your program" on http://valgrind.org/docs/manual/quick-start.html which says that such bogus violations are expected at higher optimization levels.
@alexcrichton I'm pretty sure that commit just made the jump threading pass find more opportunities to perform optimizations. So without that commit, it just never gets to hit the code path where the (bogus) violation is reported.
@dotdash gah bummer! I was wondering if it'd be something like that... (I know we have tons of those "errors" in rustc with valgrind)
If that's the case though it's sort of fascinating because it means that a binary on @eddyb's machine optimize IR differently than when it was on my machine... That still implies to me some level of undefined behavior but maybe not the kind flagged by valgrind?
@dotdash hm so fascinatingly it looks like the patch I gisted above actually does fix the compile on @eddyb's machine...
Ok some more information on this. I've finally been able to reproduce the error that @eddyb was seeing. Given the exact binaries from @eddyb I was originally unable to reproduce the issue he had on his machine. He realized, though, that our machines differed in glibc versions. Notably @eddyb had glibc 2.26 and I had glibc 2.23. Testing more versions revealed that with the binaries @eddyb gave me glibc 2.25 worked ok and glibc 2.26 was the first bad one.
With this new information I reran bisection and it fascinatingly pointed at the same commit. I truly have no idea what is going on here.
This may be a bug that only "just happens" to show up on glibc 2.26 though. The bots that are reproducing this, OSX and MinGW, are not using glibc. Still digging...
I've filed https://bugs.llvm.org/show_bug.cgi?id=36386 upstream to hopefully address this issue. I don't think we can be 100% sure that this is the exact same issue that we're seeing on OSX/Windows but I think it's as close as we're gonna get for the time being.
This comment seems to hint at the same issue: https://bugs.llvm.org/show_bug.cgi?id=32981#c2
Also related to JumpThreading and non-determinism due to pointer comparisons would tally with the glibc changes (some change could result in malloc giving out chunks of memory in a different order) as well as why it affects platforms without glibc.
I'm going to preemptively close this as I believe symptom has been fixed for us with commenting out the assume and I think further investigation should probably be coordinated with https://bugs.llvm.org/show_bug.cgi?id=36386
Most helpful comment
Bisection of the valgrind error locally points to https://github.com/llvm-mirror/llvm/commit/f45aefe37d39215e57c493313ab8727bf4f3c055.