Wgpu-rs: Compute shader performance gains missing from examples.

Created on 28 Apr 2020 · 6Comments · Source: gfx-rs/wgpu-rs

There are a lot of use case for GPU acceleration for general purpose computing. The principals used to calculate fractals such as the mandelbrot transfers directly to areas such as physics simulations, neural network iterations, video compression, etc.

One use case that I see wgpu showing a lot of promise for, is distributed computing. Folding at Home produces incredible value to to medical research, with people freely donating phenominal amounts of GPU computing power to COVID-19 research efforts. By enabling these GPU intensive general computing tasks in the browser, it opens some very interesting doors.

By demonstrating a wgpu being put to use to make better use of the GPU through browsers, the utility that it offers is showcased more comprehensively, and empowers devs to make better use of this tool.

enhancement

Source

Ben-PH

Most helpful comment

Hi 👋 we're definitely interested in adding some more compute shader examples to showcase the full potential of compute.

So far we've mostly been porting across rendering examples from an older graphics API (https://github.com/gfx-rs/gfx/tree/pre-ll/examples) with a focus towards simple examples as references for people learning how to use wgpu.

PRs for more compute examples would be very welcome :) Are there any specific compute examples you had in mind?

grovesNL on 28 Apr 2020

👍2 ❤1

All 6 comments

Hi 👋 we're definitely interested in adding some more compute shader examples to showcase the full potential of compute.

PRs for more compute examples would be very welcome :) Are there any specific compute examples you had in mind?

grovesNL on 28 Apr 2020

👍2 ❤1

@raphlinus is looking at high performance prefix sum/scan, and they were wondering if anything like this would be even implementable on WebGPU.
So it sounds like a good candidate to me - even if it's a basic implementation.

kvark on 28 Apr 2020

My ash-based implementation is at https://github.com/linebender/piet-gpu/tree/prefix . There are several challenges to porting this to wgpu.

One, reasonably straightforward is that this uses the Vulkan memory model. See this commit for a conversion to memory barriers. The missing piece is changing line 22 to read layout(set = 0, binding = 2) volatile buffer WorkBuf { (ie adding a volatile storage qualifier).

The far more difficult issue is dealing with subgroup size, which is assumed to be 32 in this code. I've successfully run it on Intel (using quite new drivers) using the VK_EXT_subgroup_size_control extension, but it won't give correct results otherwise. It's also not tuned for size 64 subgroups (AMD) though I think it would be reasonably straightforward to adapt.

I'll share a very rough draft / outline of a blog post, as it will probably help give some context.

raphlinus on 28 Apr 2020

👍1

@grovesNL

If I thought myself able to do it, I'd be happy to do so myself. Not there yet, though.

For example purposes, I was thinking of something visual, even though compute shaders don't normally do that, in order to provide more _impact_ to the viewer. Don't worry, I describe a strategy to provide raw-data after. Essentially, re-create a mouse-interactive Julia set, but only delegate to the GPU the parts that are _strictly_ compute shader:

struct Pixel {
    r: u8,
    g: u8,
    b: u8,
}

struct Image<'a> {
     pixles: &'a [Pixel]
}

// Make two distinct pixel buffers. One for cpu, one for gpu
let cpu_image = Image::new(cpu_px_buf);
let gpu_image = Image::new(gpu_px_buf);

// common function to draw image to screen. Allows GPU to be used only 
// for mutating the Image struct
fn draw(img: &Image, canvas: /*... */) {/* ... */ }

// function to calculate pixels using cpu only, then uses common draw() fn
fn cpu_rend(gpu_image: &mut Image, real: f32, imaginary: f32) {
    // snipped - likely to look like:
    // cpu_img.pixels.iter_mut().enumerate().map(|(i, px| px.calc_color(i, real, imaginary)
}

// function to calculate and render pixels using gpu compute 
// shader or equivalent
fn gpu_rend(gpu_image: &mut Image) {/*send gpu_image.pixels to the gpu to calculate, then use common draw() fn */ }

// snipped: make two canvases. Each to be drawn to screen by the cpu by 
// an Image struct
let cpu_canvas = /* snipped */;
let gpu_canvas = /* snipped */;

/* add events */
cpu_canvas.add_event_listener_with_callback("mousemove", /*uses cpu_rend */);
gpu_canvas.add_event_listener_with_callback("mousemove", /*uses gpu_rend */);

Optionally:

Add another canvas that does the drawing on mouse-move, but without any calculations, so as to emphasize how much of a perf-difference can be attributed to just the gpu calculation.
On mouse-click inside a given canvas, start recording performances. second click stops recording and prints out some stats (e.g. average time per calculation, avg per draw, total, etc)

Intended key take-away for user:

Moving the mouse over the cpu_canvas will demo a reference - about 1fps, depending
Moving the mouse over gpu_canvas will demo performance gains (should be significantly better)

Ben-PH on 28 Apr 2020

Yeah we could probably add some kind of GPU vs. CPU performance comparison if it would be useful for people.

What would be the benefit of calculating the Julia set per-fragment in a compute shader vs. a fragment shader (e.g. https://www.shadertoy.com/view/4dfGRn)? Compute shaders might an advantage over fragment shaders when it's possible to share some data per workgroup (like the prefix sum implementation mentioned above) but I'm not sure we can share much for the Julia set.

grovesNL on 28 Apr 2020

Julia is just the one that came to mind. It's a popular demo project, needing to get it performant is what got me into shader programming to start with, and would probably make a strong first-impression both qualitatively and quantitatively.

There's really no benefit to rendering with a fractal shader over calculating with a compute, other than to visually, and interactively show off what a compute shader can do. I would argue that this is worth discussing in an example use-case, as it will allow someone to go from implementing a fractal-shader implementation to compute shader much more smoothly. Once they've done that, they're on the road to general-purpose GPU programming generally.

Ben-PH on 28 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings