As I've developed this Parallel application, I've noticed that everyone who uses the application with an AMD processor always reports significantly slower runtime performance than those using Intel processors, as in the case of my personal benchmark listed within. The performance difference is so drastic that my AMD FX-8120 is magnitudes slower than my mobile Intel i5, to the point that the original Perl implementation is much faster than the Rust implementation on AMD hardware, and yet on Intel hardware it is 40x slower than the Rust implementation.
I can confirm this on rustc 1.11.0 (9b21dcd6a 2016-08-15)
Sysprof recording showing where the time is spent is attached at https://github.com/mmstick/parallel/issues/22
Also can confirm this on rustc 1.13.0-nightly (4f9812a59 2016-09-21)
sysprofs-intel.zip Another sysprof recording but from my Intel laptop (ignore chrome streaming in the background). There's certainly something wonky happening with AMD hardware, looking at the provided sysprof from @Shnatsel.
I would need to know:
perf
in both systems, are the same functions hot in both systems? target=native
? (It might well be that the assembly generated by LLVM's generic target benefits intel in your particular case).Here's the result of perf record -F 99
with both the debug and release build on my Intel laptop. Command executed was seq 1 10000 | parallel echo > /dev/null
.
Here's results from my AMD FX-8120 desktop:
Compiling with -C target_cpu=native
makes no difference to performance.
Compilation is simply being done by cargo build --release
And these are measurements taken with using just four threads instead of eight for the FX-8120.
You should also use perf archive
to collect the objects used in your run, because perf.data
alone doesn't contain any symbol information.
Anyway, even without symbols I can see that your AMD machine is mostly busy within just one particular function in the kernel:
Overhead Command Shared Object Symbol
82.29% parallel [kernel.kallsyms] [k] 0xffffffff812f8c37
1.23% parallel [kernel.kallsyms] [k] 0xffffffff812f8c3a
0.58% parallel libpthread-2.24.so [.] 0x0000000000009c0b
0.41% parallel [kernel.kallsyms] [k] 0xffffffff810fdf43
0.41% parallel [kernel.kallsyms] [k] 0xffffffff812f8adc
0.33% parallel [kernel.kallsyms] [k] 0xffffffff811e48bb
0.29% sh [kernel.kallsyms] [k] 0xffffffff810c3cba
0.27% parallel [kernel.kallsyms] [k] 0xffffffff8117a771
0.26% parallel [kernel.kallsyms] [k] 0xffffffff81301729
0.26% parallel [kernel.kallsyms] [k] 0xffffffff815d7337
Whereas the Intel machine is much more spread out:
Overhead Command Shared Object Symbol
10.70% parallel [kernel.vmlinux] [k] 0xffffffffaa0378e4
10.24% parallel libpthread-2.22.so [.] 0x0000000000009a47
9.28% sh ld-2.22.so [.] 0x0000000000017273
5.82% sh [kernel.vmlinux] [k] 0xffffffffaa49df6a
5.24% parallel ld-2.22.so [.] 0x0000000000015d68
4.99% parallel [kernel.vmlinux] [k] 0xffffffffaa105044
4.93% parallel [kernel.vmlinux] [k] 0xffffffffaa0b9057
4.85% parallel [kernel.vmlinux] [k] 0xffffffffaa0d760f
4.68% sh libc-2.22.so [.] 0x000000000007c55a
4.59% sh [kernel.vmlinux] [k] 0xffffffffaa0ae454
But even the full Intel report doesn't _any_ samples in the parallel
object itself, so it doesn't seem your benchmark is even stressing Rust generated code. With so much in the kernel and ld.so
, I think you're mostly just hammering on process creation. If I understand right, you're launching 10,000 distinct echo
processes, so that probably dwarfs the actual parallel
work.
(And note that if you're doing this through sh -c
, then it will actually get the shell's echo
builtin, which would explain why the report only shows sh
. Try with /bin/echo
if you want that to be a truly distinct process too.)
Perhaps that's hitting a pathological case in the AMD case with something like page table setup, for instance. But we shouldn't just guess -- run perf report
yourself to see what that 82% symbol is.
Does that mean that it's an issue with the Linux kernel being poorly optimized for AMD processors with that specific function?
Possibly, but let the data guide you. See what that symbol is first. Then you can also hit a
in perf report
to annotate that function with the actual instructions that recorded samples.
Have you tried the same benchmark with GNU parallel? I'll bet it will face the same problem, and if not you can try to figure out what it's doing differently.
Using the same benchmark, GNU Parallel takes 30s compared to the 36s spent with the Rust implementation on my AMD FX-8120 using four cores. With my i5 2410M laptop, GNU Parallel takes 54 seconds to do what the Rust implementation does in 5 seconds.
Not exactly sure what I'm looking at, but I have a new perf data with the archive data alongside it.
amd_perf_data.zip
Hmm, that new profile doesn't have the same 82% cluster. Is it still slow compared to Intel?
Overhead Command Shared Object Symbol
14.20% parallel libpthread-2.24.so [.] pthread_mutex_lock
10.69% sh libc-2.24.so [.] __strchr_sse2
5.04% sh libc-2.24.so [.] strlen
4.61% parallel libpthread-2.24.so [.] __pthread_mutex_unlock_usercnt
4.13% parallel [kernel.vmlinux] [k] entry_SYSCALL_64
3.60% sh ld-2.24.so [.] strcmp
3.49% sh ld-2.24.so [.] do_lookup_x
3.47% parallel [kernel.vmlinux] [k] page_fault
3.13% parallel parallel [.] sdallocx
2.74% sh libc-2.24.so [.] malloc
I can also see your kernel updated, from 4.7.3-2-ARCH before to 4.7.5-1-ARCH now. (Hit i
in perf report
.) If there was a kernel problem, maybe they've fixed it...
Indeed, it's still slow compared to Intel. This is an eight core FX @ 4GHz that's taking 36s to do what an i5 2410M @ 2.3 GHz does in 5s, and manages to run slower than GNU Parallel which does the same job in 30s. The wall time should be 3s to be comparable to the performance improvement that Intel sees (from 54s with GNU to 5s with Rust).
Sadly I have no clue how to use perf
(links to howtos would be appreciated), but sysprof
shows a 70% cluster in copy_page
in the kernel:
Here's the sysprof capture so you can inspect it yourself
I am on Linux 4.4.0-38-generic #57-Ubuntu
For whatever reason perf archive
command seems to be missing on Ubuntu despite being listed in perf --help
; I get perf: 'archive' is not a perf-command. See 'perf --help'.
error.
However, I could get perf report
to work, and it points at copy_page
as well: 81,58% parallel [kernel.kallsyms] [k] copy_page
and everything else is under 1%
Your sysprof shows that copy_page
is occurring under do_huge_pmd_page
, which is related to the kernel's transparent huge pages. You might cat /sys/kernel/mm/transparent_hugepage/enabled
and compare that between your Intel and AMD machines. On Fedora 24 I see always [madvise] never
, meaning it's currently only using huge pages if advised to. Try never
and see what happens.
# echo never >/sys/kernel/mm/transparent_hugepage/enabled
(run that as root)
If that helps you, it doesn't _necessarily_ mean that you need to leave that disabled all the time. Spawning so many processes in such a short time is a pretty unrealistic workload. Do some performance testing with more typical use cases before deciding how to tune your system.
We're pretty far afield from anything Rust can do about this, unless you find evidence of something else that Rust may be doing unoptimally. This doesn't look like a codegen issue to me.
[always] madvise never
for me by default.
# echo never >/sys/kernel/mm/transparent_hugepage/enabled
helped, it is fast even on AMD now. Thanks!
To confirm that the CPU is a red herring: If you echo always >/sys/kernel/mm/transparent_hugepage/enabled
on an Intel system, do you see an equivalent slow down?
Tried it in "madvise" mode and it's still fast, so Rust is probably doing everything right here, i.e. not advising the kernel to use huge pages.
Seems that the issue was unrelated to rustc. Closing.
I started looking into it and apparently this isn't the first case of a jemalloc-using application interacting poorly with transparent huge pages. Here's one from a little over a year ago:
https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/
(In the linked case, it was leak-like behaviour because THP was forcing 2MiB pages on jemalloc and it was assuming that applying MADV_DONTNEED
to ranges smaller than 2MiB was enough for them to be returned to the OS.)
I compiled my kernel with support for transparent_hugepages and tried out always
on my Intel laptop. Results were that my Intel processor became just as slow, so it's not specific to AMD. It would also appear that Solus OS already knows about issues with this being set to always, as Solus OS has been using madvise
by default for a while now.
@Shnatsel You can use perf by running sudo perf record -F 99 <COMMAND>
followed by sudo perf archive
. You'll get a perf.data
file from record
and symbols from archive
. Then you can use perf script
to view it.
I can confirm that this is a bug with jemalloc.
Forcing the use of system default allocator as described in https://doc.rust-lang.org/nightly/book/custom-allocators.html gives good fork performance even with transparent huge pages enabled.
Too bad it is an unstable language feature. Is there any way to get jemalloc off my lawn without using unstable language features, like a compiler switch?
cc @sfackler, another data point for jemalloc...
@Shnatsel is there a tracking issue in jemallocs repo for this?
I think you should report a bug against your Linux distribution that it should be using madvise, not always.
@gnzlbg I have not opened one. I have no clue where the jemalloc bug tracker is
Most helpful comment
[always] madvise never
for me by default.# echo never >/sys/kernel/mm/transparent_hugepage/enabled
helped, it is fast even on AMD now. Thanks!