Bots are very frequently nowadays failing with:
note: LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x000006BA)'
Normally on some run-pass test. The test that fails is almost always different.
No idea what's going on, and the internet seems to think that this is either one of:
I'm pretty sure the bots aren't hitting OOM as these are tiny tests, I'm also pretty sure that nothing is stale b/c we clean everything (although I could be wrong), and I find it hard to believe that you can't run link.exe in parallel.
No idea what's going on :(
cc @vadimcn @retep998
Theory: We have two concurrent builds running on the same machine. They both hit the run-pass test suite at the same time. They then try to compile the same test case at the same time. Finally, this chokes mspdbsrv.
That seems to align with the vague descriptions on the internet about parallel builds, explains why we started seeing this recently (an uptick in the amount of things we test), explains the nondeterminism, and explains the different test cases.
Testing locally seems to confirm this as well. A bunch of parallel compilers compiling the same fn main() {} in main.rs will fault eventually on 32-bit MSVC. They do not, however, fault on 64-bit MSVC. Confirming now we've never retried 64-bit MSVC because of this bug and then I'll just limit at most one 32-bit MSVC build per buildslave.
To update, looks like the 32-bit linker on MSVC will simply choke if it is ever invoked in parallel. We only have one 32-bit MSVC bot which confused me for a bit, but this'd explain why we're failing in run-pass, we're invoking a lot of things in parallel...
For posterity, this was my test program. This fails on 32-bit MSVC today (the 1.8 release) and will hopefully pass once #33155 lands.
use std::process::*;
use std::io::*;
use std::path::*;
use std::fs::File;
use std::fs;
use std::thread;
const N: usize = 2;
const M: usize = 100;
fn main() {
let dirs = (0..N).map(|n| {
PathBuf::from(format!("dir{}", n))
}).collect::<Vec<_>>();
for dir in dirs.iter() {
fs::create_dir_all(&dir).unwrap();
File::create(dir.join("main.rs")).unwrap()
.write_all(b"fn main() {}").unwrap();
}
let threads = dirs.iter().cloned().map(|m| {
thread::spawn(move || {
for _ in 0..M {
let mut cmd = Command::new("rustc");
cmd.current_dir(&m).arg("--crate-name").arg(&m).arg("main.rs");
println!("{:?}", cmd);
let status = cmd.status().unwrap();
assert!(status.success());
}
})
}).collect::<Vec<_>>();
for thread in threads {
thread.join().unwrap();
}
}
From what I could glean on the internet, it seems that the culprit is mspdbsrv running out of memory. Why only on 32-bit builder? Because it probably uses 32-bit tools, so mspdbsrv is limited to 2G of memory. Care to check? (I don't have a 32-bit Windows handy).
From what I could glean on the internet, it seems that the culprit is mspdbsrv running out of memory.
Yeah unfortunately I found the internet pretty unreliably in diagnosis here. I could definitely repro locally and didn't see OOM anywhere (maybe I misssed it?).
Why only on 32-bit builder?
No idea! But actually, I'm not sure :(. The program above will succeed for x86_64-pc-windows-msvc and fail for i686-pc-windows-msvc, though.
Care to check?
Unfortunately I don't have access to a 32-bit only machine either :(
Did you know, the point of mspdbsrv.exe is actually to serialize requests to PDB files to _avoid_ issues that arising when invoking the compiler in parallel?
Ok, after some more investigation, I think I see what's going on now. While the observations here have been, true, they've basically all been red herrings unfortunately. I can safely say that link.exe can indeed be run in parallel, it just apparently handles crashes and exits of mspdbsrv.exe really badly.
So as a diagnosis, here's what I think is happening:
link.exe is run, it wants to talk to mspdbsrv.exe for some reason (presumably to generate a pdb file).mspdbsrv.exe isn't running, link.exe appears to spawn it. Not only that, it appears to do so in a synchronized fashion. That is if two concurrent link.exe invocations are run in parallel on a system only one mspdbsrv.exe seems to get spawned. You can notice this in that the first link.exe command will take a lot longer than the next.mspdbsrv.exe is killed, all "connected" link.exe instances will die horribly. I've seen lots of the RPC errors mentioned here and even segfaults.Given that information, why did I think that concurrent link.exe invocations are bad? Well it turns out that I was always using rustc via rustup.rs (even when I was testing on the bots), which uses job objects. This means that the entire process tree was killed whenever rustup's rustc wrapper exited. So what I was witnessing was:
rustc in parallel.link.exe.link.exe instances decided to spawn mspdbsrv.exe, and all the others patiently waited while that happened.link.exe finished, causing the first rustc to finish, which in turn caused the rustup rustc wrapper to finish, when then _tore down the spawned mspdbsrv.exe_.mspdbsrv.exe exiting very well.Ok, so that explains why I thought you can't run rustc in parallel. It also explains why sometimes I could reproduce the bug locally and others I could not (sometimes I had mspdbsrv.exe running around and sometimes I didn't). So how do we explain the buildbots? And how do we explain why this has all been happening recently? Let's take a look!
make process and following process tree are all in one job object to make sure that processes are not persistent between builds.mspdbsrv.exe, if spawned during our build, will get killed when a build finishes. We currently have only 2 builds that I know of that spawn link.exe in 32-bit mode auto-win-msvc-32-opt and auto-win-msvc-32-cross-opt. I think the cargotest builder _would_, but it's primarily 64-bit and disables cross-compiled tests. Looking at some logs, a curious trend emerges. Both the opt and cross-opt builders are always running on the same machine when opt fails. Even more interestingly, they always fail at the same time!Presumably what's happening is that the cross-opt builder process spawned mspdbsrv.exe, and then opt started using it. When cross-opt exits it kills mspdbsrv.exe and then opt dies along with it as it was trying to use link.exe. The failure doesn't happen if opt spawns mspdbsrv.exe (it takes longer) or if cross-opt exits between tests on opt.
Note that this also explains why we just started seeing these failures recently. The cross-opt builder is relatively new.
Funny story, first hit on google for "isolate mspdbsrv.exe" is this article which is hilariously exactly what's happening to us. Unfortunately there aren't many replies...
As for a solution... well... I guess we can run one 32-bit build per machine? Have a whole slave dedicated to the cross-opt builder? Drawing some blanks...
Could we just leave mspdbsrv.exe running? It seems to me like we are killing it, so... (note, I know nothing)
@alexcrichton well done. wow that's tricky.
Apparently, there is an undocumented environment variable _MSPDBSRV_ENDPOINT_, which allows to start multiple instances of mspdbsrv. I did a quick test, and it appears to work.
(found here https://codereview.chromium.org/83803003)
Well it turns out that I was always using rustc via rustup.rs (even when I was testing on the bots), which uses job objects. This means that the entire process tree was killed whenever rustup's rustc wrapper exited.
A couple days ago on #rust-internals...
(13:58:16) WindowsBunnyEngineer: if mspdbsrv is shared between concurrent builds
(13:58:35) WindowsBunnyEngineer: since mspdbsrv is added to the job object of the process that started it...
(13:58:57) WindowsBunnyEngineer: when that build terminates, the mspdbsrv it started will get killed
(13:59:03) WindowsBunnyEngineer: while another build could be relying on it still
Thanks for the link @vadimcn! For now I've opted to modify our job object management to just "leak" mspdbsrv.exe (basically what @ubsan mentioned), but it's good to have a solution like that in our back pocket.
This has been deployed to the bots, and the current build passed, so let's just pray we never see this again.
I'm seeing this when running cargo test (link fails when compiling hyper).
cargo 0.18.0-nightly (5db6d64 2017-03-03)
rustc 1.17.0-nightly (b1e31766d 2017-03-03)
cargo test --target i686-pc-windows-msvc
error: linking with `link.exe` failed: exit code: 1318
|
= note: "C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\link.
exe" "/NOLOGO" "/NXCOMPAT" "/LARGEADDRESSAWARE" "/SAFESEH" "/LIBPATH:C:\\Users\\
n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i
686-pc-windows-msvc\\lib" "C:\\Users\\n\\Documents\\imageflow\\target\\debug\\bu
ild\\hyper-d6d1e087e41dda9f\\build_script_build-d6d1e087e41dda9f.build_script_bu
ild.o" "/OUT:C:\\Users\\n\\Documents\\imageflow\\target\\debug\\build\\hyper-d6d
1e087e41dda9f\\build_script_build-d6d1e087e41dda9f.exe" "/OPT:REF,NOICF" "/DEBUG
" "/LIBPATH:C:\\Users\\n\\Documents\\imageflow\\target\\debug\\deps" "/LIBPATH:C
:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\
rustlib\\i686-pc-windows-msvc\\lib" "C:\\Users\\n\\Documents\\imageflow\\target\
\debug\\deps\\librustc_version-1324ff5c3ace67a8.rlib" "C:\\Users\\n\\Documents\\
imageflow\\target\\debug\\deps\\libsemver-f260b12f97c06fb2.rlib" "C:\\Users\\n\\
.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i686
-pc-windows-msvc\\lib\\libstd-13ee61d82f393677.rlib" "C:\\Users\\n\\.rustup\\too
lchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i686-pc-windows-
msvc\\lib\\libpanic_unwind-48ea9b793aac0c7c.rlib" "C:\\Users\\n\\.rustup\\toolch
ains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i686-pc-windows-msv
c\\lib\\libunwind-d47ebac29f9bd14f.rlib" "C:\\Users\\n\\.rustup\\toolchains\\nig
htly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\l
iblibc-c9fd286480cea820.rlib" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-0
3-04-i686-pc-windows-msvc\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\librand-83ea
7d0fdec3fdde.rlib" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-p
c-windows-msvc\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\libcollections-6abea59f
9a3a2e81.rlib" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-wi
ndows-msvc\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\liballoc-18b5b7bd96a82034.r
lib" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc
\\lib\\rustlib\\i686-pc-windows-msvc\\lib\\liballoc_system-a1b08813244b60c5.rlib
" "C:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\l
ib\\rustlib\\i686-pc-windows-msvc\\lib\\libstd_unicode-1caf05b3f5a5e05c.rlib" "C
:\\Users\\n\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\
rustlib\\i686-pc-windows-msvc\\lib\\libcore-046c133680ee59d2.rlib" "C:\\Users\\n
\\.rustup\\toolchains\\nightly-2017-03-04-i686-pc-windows-msvc\\lib\\rustlib\\i6
86-pc-windows-msvc\\lib\\libcompiler_builtins-056daeb08ac9c69e.rlib" "advapi32.l
ib" "ws2_32.lib" "userenv.lib" "shell32.lib" "msvcrt.lib"
= note: LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x000006B
A)'
error: aborting due to previous error
error: Could not compile `hyper`.
Build failed, waiting for other jobs to finish...
error: build failed
This appears to be part of a larger pattern of issues when --target is specified, but has the same value as the default target displayed by rustup show.
Likely a separate bug, but is has the same manifestation.
@nathanaeljones are you using rustup? If so you may be running into https://github.com/rust-lang-nursery/rustup.rs/issues/746
Yes, I'm using rustup.
On Mar 11, 2017 9:20 AM, "Alex Crichton" notifications@github.com wrote:
@nathanaeljones https://github.com/nathanaeljones are you using rustup?
If so you may be running into rust-lang-nursery/rustup.rs#746
https://github.com/rust-lang-nursery/rustup.rs/issues/746—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rust/issues/33145#issuecomment-285878406,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAGln8nM8MzFLjcFze-D9Z4_MVGFnccvks5rksnTgaJpZM4INPEc
.
FWIW, this goes away if I drop --target.
I see the same errors with rustc 1.16.0 (30cf806ef 2017-03-10) and cargo-0.17.0-nightly (f9e5481 2017-03-03) when doing a simple cargo build, cargo run, or cargo publish. Does not happen always and on the second or third try it mostly works.
error: linking with
link.exefailed: exit code: 1318
= note: LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x000006BA)
= note: LINK : fatal error LNK1318: Unexpected PDB error; RPC (23) '(0x00000005) (sometimes)
We're having this issue with Rust 1.24, 1.25 and 1.26 at least, on Cobalt and Gutenberg
@alexcrichton
More details: https://ci.appveyor.com/project/Keats/gutenberg/build/1.0.540
I only had this problem recently, after 1.26 was released and it was happening to 100% of my Windows builds.
From the matrix above, it seems that it does work for Gutenberg for 1.24 and 1.25 at least (I'm re-running that PR now to ensure it's repeatable).
The appveyor config is https://github.com/Keats/gutenberg/blob/6b99fce795c373575493455a83e5abcfa5bbe653/appveyor.yml and it does use the --target, I'll try removing it once the first re-run is done
Edit: second run fails only on 1.26 as well https://ci.appveyor.com/project/Keats/gutenberg/build/1.0.541
Sorry I don't have a lot of time to dig into this right now, can you open a new dedicated issue for this with updated information? Any way to reproduce it locally and/or minimization on AppVeyor would also be greatly appreciated!
Done as https://github.com/rust-lang/rust/issues/50778
I'll try a bit more tomorrow to figure it out
Most helpful comment
Ok, after some more investigation, I think I see what's going on now. While the observations here have been, true, they've basically all been red herrings unfortunately. I can safely say that
link.execan indeed be run in parallel, it just apparently handles crashes and exits ofmspdbsrv.exereally badly.So as a diagnosis, here's what I think is happening:
link.exeis run, it wants to talk tomspdbsrv.exefor some reason (presumably to generate a pdb file).mspdbsrv.exeisn't running,link.exeappears to spawn it. Not only that, it appears to do so in a synchronized fashion. That is if two concurrentlink.exeinvocations are run in parallel on a system only onemspdbsrv.exeseems to get spawned. You can notice this in that the firstlink.execommand will take a lot longer than the next.mspdbsrv.exeis killed, all "connected"link.exeinstances will die horribly. I've seen lots of the RPC errors mentioned here and even segfaults.Given that information, why did I think that concurrent
link.exeinvocations are bad? Well it turns out that I was always usingrustcvia rustup.rs (even when I was testing on the bots), which uses job objects. This means that the entire process tree was killed whenever rustup'srustcwrapper exited. So what I was witnessing was:rustcin parallel.link.exe.link.exeinstances decided to spawnmspdbsrv.exe, and all the others patiently waited while that happened.link.exefinished, causing the firstrustcto finish, which in turn caused the rustuprustcwrapper to finish, when then _tore down the spawnedmspdbsrv.exe_.mspdbsrv.exeexiting very well.Ok, so that explains why I thought you can't run rustc in parallel. It also explains why sometimes I could reproduce the bug locally and others I could not (sometimes I had
mspdbsrv.exerunning around and sometimes I didn't). So how do we explain the buildbots? And how do we explain why this has all been happening recently? Let's take a look!makeprocess and following process tree are all in one job object to make sure that processes are not persistent between builds.mspdbsrv.exe, if spawned during our build, will get killed when a build finishes. We currently have only 2 builds that I know of that spawnlink.exein 32-bit modeauto-win-msvc-32-optandauto-win-msvc-32-cross-opt. I think the cargotest builder _would_, but it's primarily 64-bit and disables cross-compiled tests. Looking at some logs, a curious trend emerges. Both theoptandcross-optbuilders are always running on the same machine whenoptfails. Even more interestingly, they always fail at the same time!Presumably what's happening is that the
cross-optbuilder process spawnedmspdbsrv.exe, and thenoptstarted using it. Whencross-optexits it killsmspdbsrv.exeand thenoptdies along with it as it was trying to uselink.exe. The failure doesn't happen ifoptspawnsmspdbsrv.exe(it takes longer) or ifcross-optexits between tests onopt.Note that this also explains why we just started seeing these failures recently. The
cross-optbuilder is relatively new.Funny story, first hit on google for "isolate mspdbsrv.exe" is this article which is hilariously exactly what's happening to us. Unfortunately there aren't many replies...
As for a solution... well... I guess we can run one 32-bit build per machine? Have a whole slave dedicated to the cross-opt builder? Drawing some blanks...