Wgpu-rs: It seems slow, am I doing something wrong?

Created on 25 Aug 2020 · 8Comments · Source: gfx-rs/wgpu-rs

Doing some benchmarks, wgpu-rs seems to be quite slow.

Benchmark of multiplying every value in a vector (n=10,000,000) by a value (run with cargo test --release -- --test-threads=1):

---- tests::cpu_sscal stdout ----
6725 micros
thread 'main' panicked at 'assertion failed: false', src\lib.rs:72:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- tests::vulkano_sscal stdout ----
593 micros
thread 'main' panicked at 'assertion failed: false', src\lib.rs:135:9

---- tests::webgpu_sscal stdout ----
143451 micros
thread 'main' panicked at 'assertion failed: false', src\lib.rs:237:9
Panic in Arbiter thread.

wgpu-rs is drasticaly slower than vulkano and a few times slower than doing the operation on the cpu.

Since this seems pretty drastic I wonder if I am missing something.

Project: webgpu_vs_vulkano.zip (it will take quite a while to build)

question

Source

JonathanWoollett-Light

Most helpful comment

I'm now getting good performance, all these changes have seemed to work, thank you.

JonathanWoollett-Light on 28 Aug 2020

🎉2

All 8 comments

Thank you for providing the test case! I had a look, and I found the following issues:

The dispatch work was different. You had dispatch(1,1,1) in Vulkano, but dispatch(x.len() as u32, 1,1) in wgpu.
That's the main problem with the comparison.
The number of CPU copies was excessive. In wgpu, you collected the data first into a vec, then did the work, then mapped the buffer, then collected the data into another heap allocated buffer. The first and last steps are not necessary. When uploading data, if you are very concerned about the cost of initial copy, you can create a staging buffer manually, and write directly into its mapping. I modified the code to do that in
webgpu_vs_vulkano-fixed.zip

Results on my machine:

---- tests::cpu_sscal stdout ----
3604 micros
thread 'main' panicked at 'assertion failed: false', src/lib.rs:88:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- tests::vulkano_sscal stdout ----
1688 micros
thread 'main' panicked at 'assertion failed: false', src/lib.rs:152:9

---- tests::webgpu_sscal stdout ----
8740 micros
thread 'main' panicked at 'assertion failed: false', src/lib.rs:274:9

Vulkano uses a CPU-side buffer that's visible to GPU, while in wgpu you are using a separate GPU side buffer, which gets copied to, and then from. The copies take as much (if not more) time than the actual work in your case. This can't be fixed if you are targeting the pure WebGPU and want to run on the Web. However, if you only care about the native, you can enable MAPPABLE_PRIMARY_BUFFERS extension and get the same behavior as your Vulkano code.

All in all, there is nothing that WebGPU does today to make your code slower, minus the limitations on how you get the data in and out. In the future, we'll do the bounds checking in the shaders, so that will introduce a little bit of overhead for this case specifically.

Going to close the issue now that the numbers are clarified. Please feel free to continue discussion!

kvark on 25 Aug 2020

1.

Using same for all tests with ARRAY_SIZE=2097152 and dispatch number 2028 I get:

test tests::cpu_saxpy ... 1931 micros
ok
test tests::cpu_sscal ... 1579 micros
ok
test tests::vulkano_saxpy ... 2296 micros
ok
test tests::vulkano_sscal ... 2034 micros
ok
test tests::webgpu_saxpy ... 6956 micros
ok
test tests::webgpu_sscal ... 5630 micros
ok

(changed to using --nocapture instead of failing the tests and added an additional test)

2.

When implementing the change you illustrated with staging_buffer_in I can't see any affect upon the resultant times.

3.

I have enabled this feature but I do not know what changes to make to the code after this.
What code should I change to take advantage of this?

JonathanWoollett-Light on 26 Aug 2020

When implementing the change you illustrated with staging_buffer_in I can't see any affect upon the resultant times.

The difference should be the following. If going through create_buffer_init, you are providing data upfront, and it gets copied into the mapped buffer. When doing it with staging_buffer_in, you are mapping the buffer and writing to it directly, so that's technically one less CPU copy.

I have enabled this feature but I do not know what changes to make to the code after this.
What code should I change to take advantage of this?

With this change, you do everything on a single buffer (with MAP_READ | MAP_WRITE | STORAGE) usage, instead of having 3 separate buffers.

kvark on 26 Aug 2020

Just reran using a single buffer and it seems to be a lot slower:

test tests::cpu_saxpy ... 2315 micros
ok
test tests::cpu_sscal ... 884 micros
ok
test tests::vulkano_saxpy ... 2626 micros
ok
test tests::vulkano_sscal ... 2007 micros
ok
test tests::webgpu_saxpy ... 7043 micros
ok
test tests::webgpu_sscal ... 3782188 micros
ok

webgpu_sscal is using a single buffer here.

Project: webgpu_vs_vulkano.zip

Anything I'm missing here?

JonathanWoollett-Light on 26 Aug 2020

Thank you for pushing this further! I'm equally interested to see what the problem is, although I'm not expecting any surprises.

1) In terms of testing, cargo test -- --nocapture --test-threads=1 should be used, since otherwise you get too much noise trying to access the same GPU from multiple tests.

2) Focusing on the webgpu_sscal case, where the PRIMARY_MAPPABLE_BUFFERS is used, the main differentiating factor is the chosen adapter. Your platform has 2 GPUs: an integrated one and a discrete one. By default, wgpu currently picks the integrated one, while Vulkano picks the discrete. That's your main cause of slowdown. Changing this to wgpu::PowerPreference::HighPerformance produces equal numbers (I modified some logging):

cpu_sscal: 50428 micros
vulkano_sscal: 2351 micros
webgpu_sscal: 2289 micros

3) I also see that the code still heap-allocates the results vector in wgpupath, which is a waste (and also not matching the vulkano test).

4) Another minor thing that makes a difference on Windows - initialization takes time. So a fair comparison would use BackendBit::VULKAN instead of BackendBit::PRIMARY, to avoid waiting for D3D12 libraries to link.

5) The order of tests matter. It appears that later tests tend to be slower, possibly because the driver needs to do some maintenance after the previous tests. I.e. changing the name of tests from "webgpu_xxx" to "awebgpu_xxx" makes it show results on par or better than vulkano.

kvark on 26 Aug 2020

Think I've got something a bit weird going on here looking at your results.
Running the webgpu_sscal test in my case even with wgpu::PowerPreference::HighPerformance still runs very slowly. I specifically ran it using cargo test webgpu_sscal --release -- --nocapture in case of issues like you mention in point 5 where affecting it (still getting times like: 4104752 micros).

I also see that the code still heap-allocates the results vector in wgpupath, which is a waste (and also not matching the vulkano test).

I'm not sure what I should change to fix this?

Really appreciate if you could take look at this (or send your version of the project where you got the times you posted) (gotta say once again apologies for being a little spammy and thank you so much for helping with this stuff, kinda hard to emphasize this enough)

Project: webgpu_vs_vulkano.zip

JonathanWoollett-Light on 26 Aug 2020

Thank you again for persisting (and not minding me closing the issue), this is turning out to be quite an adventure!

Here come a few more gotchas.

(1) If you are using MAPPABLE_PRIMARY_BUFFER, and you want to achieve the same code path as Vulkano, You generally don't want to use the create_buffer_init helper (which is meant for convenience and doesn't take into account the needs of specific extensions).
create_buffer_init (or BufferDescriptor::mapped_at_creation) is smart enough to know that it can avoid a temporary staging buffer if the destination can already be mapped for writing. But if your destination is STORAGE | MAP_READ, then it still creates a temporary buffer, and it becomes a waste (note, again, that the ability to do that in an extension, so the original mapping doesn't try to accommodate this case). So one thing I did with your code is adding MAP_WRITE to buffers, so that I know there is not going to be an extra copy.

(2) I also noticed that the measured time didn't include the full workload, but specifically only the submission part. The critical difference with Vulkano was that in wgpu path you combined the encoder finalization with submission:

queue.submit(Some(encoder.finish());

This was unfair, because finish() takes time, and in Vulkano path you had the finalization outside of the timer. I fixed it by introducing this line prior to starting the timer:

let cmd_buf = encoder.finish();

(3) finally, I switched readonly: true for the first buffer in "saxpy" test, since it's not written by the shader.

I'm not sure what I should change to fix this?

Just remove the let result line.

With these fixes (as well as stuff mentioned in the previous comments), and running by:

cargo test --release -- --nocapture --test-threads=1

I got the following numbers:

test tests::awebgpu_saxpy ... 1560 micros
test tests::awebgpu_sscal ... 1256 micros
test tests::cpu_saxpy ... 1116 micros
test tests::cpu_sscal ... 958 micros
test tests::vulkano_saxpy ... 5229 micros
test tests::vulkano_sscal ... 2022 micros

The situation that we can be 2-3 times faster than Vulkano is just saying how poor the testing methodology is. But it proves the point that wgpu can be made to run as fast, at least. Modified code is here:
webgpu_vs_vulkano-fixed2.zip

kvark on 27 Aug 2020

🚀1

I'm now getting good performance, all these changes have seemed to work, thank you.

JonathanWoollett-Light on 28 Aug 2020

🎉2

Was this page helpful?

0 / 5 - 0 ratings