This is a tracking issue for the unstable core_simd
feature in the standard library. SIMD support is quite a thorny topic, but it's largely covered in https://github.com/rust-lang/rfcs/pull/1199, being implemented in https://github.com/rust-lang/rust/pull/27169, and @huonw will be creating an external crate for full-fledged SIMD support.
cc @huonw
Note that #26403 may well be a blocker issue for 100% safe + composable SIMD
This issue now tracks the simd
, simd_ffi
, and repr_simd
features.
Status update:
In the compiler:
cfg(target_feature = "...")
detection is subpar, e.g. it doesn't detect features when an explicit -C target-cpu="..."
is set, it doesn't handle disabling features like -C target-feature="-sse2"
, nor does it handle (AFAICT) for custom target specsIn https://github.com/huonw/simd:
etc
) should be upgraded to handle emitting the actual wrappers as well as the raw extern
blocks that it can currently doI'm intending to work on the simd crate first, starting with the rewrite of the autogenerator, but I've currently got a thesis to work on.
@huonw How's the thesis going? :) Any progress on this issue, to relay to those interested in SIMD stabilization?
@BurntSushi, @nikomatsakis, and I talked about this recently at the work week, and our thoughts are:
simd
crate, but those are at least the bare bones for moving forward.rust
#[simd_intrinsic = "name"]
pub fn foo(a: A, b: B) -> C {
foo(a, b)
}
That is, we'd probably stabilize a new #[simd_intrinsic]
(ish) attribute whose name would closely follow the standard naming conventions (e.g. those in C compilers). The function would look like normal Rust and look like it recurses into itself but the compiler would understand that _direct calls to the function_ are actually implemented inline, so this isn't actually infinite recursion.
simd_lt
intrinsic can have any number of SIMD types instantiated, but a pairing like String
and Vec<u8>
would be nonsensical. This may be difficult to solve in a "pure" fashion but may be worth stomaching to stabilize the SIMD intrinsics (which in general should never be called in favor of the simd
crate itself)#[repr(simd)]
, we may want to remove support for tagging a generic structure. This does not appear to be used in the simd
crate today and may not be necessary at all, in which case it's probably just complications that we don't want to have to think about today.All of this was discussed hopefully with an eye to start the process of stabilization soon-ish, and then we can all get SIMD on stable Rust!
cc @eddyb, you likely have many opinions as well!
@alexcrichton Ahh, I ignored the multiple-definition option in my recent comment.
I think it's a great solution for integer and floating-point intrinsics, but I didn't consider stabilization of any intrinsic to be possible, hence why I tried to only think of options where libcore
hosts all intrinsics.
I am still wary about stabilizing intrinsics, but #[simd_intrinsic]
seems focused in scope, so I can see how that works. Although, would it be restricted to things that are _definitely_ about SIMD?
There are various platform intrinsics that don't do anything with vectors, such as prefetch
.
Other than that, this seems like a good move forward, without the complexities I was worried about.
@eddyb hm yeah I'm not sure if #[simd_intrinsic]
is the best name, certainly up for debate! I would figure that all intrinsics would be defined through a similar mechanism, but the SIMD intrinsics were namespaced somehow so they're the only ones that we stabilize. I wouldn't want to stabilize, for example, intrinsics like prefetch (for now).
There are other useful intrinsics like crc32
that are explicitly part of SSE 4.2 but aren't necessarily SIMD.
Oh interesting! I'd be ok punting on those for now in favor of just dealing with the SIMD pieces, but we can relatively easily reevaluate to do something different though.
So I had a really interesting conversation with @sunfishcode on the topic of SIMD, and in particular the design of SIMD in WASM. The high-level summary was two points:
Some other interesting points that he raised:
1. SIMD has many audiences with diverse needs, and you can't necessarily accommodate them all very well with just one API:
- codec authors want the raw intrinsics because they use them in clever and unexpected ways;
- HPC people want higher-level abstractions but don't need access to every trick in the book;
- high-performance demands also require raw intrinsics, because they don't mind investing the time to reshape the algorithm for each platform.
1. One way to support these diverse needs, which has been considered for WASM, is to offer the "union" of features across platforms, but offer a way to query which features are "fast" (the idea is that the "slow" features will be emulated). In Rust I would expect we may want similar things, though perhaps the "slow" paths would just trap? (It's probably a bug if you actually wind up executing one of them.)
On the topic of intrinsics, I feel overall pretty good about some kind of attribute that can be applied to a fn
to indicate that the compiler should compile it via pure instructions. Such functions would have to have appropriate argument/return types (roughly like today). If the argument/return types are not generic, this seems very harmless to me, as we can check it purely at the definition site (as @alexcrichton noted).
However, I feel mildly less good about the generic versions, since these cannot be checked until trans time, which means we have to face two annoying choices:
However, it does seem that there is a third way out: we could remove all support for generic intrinsics, and instead have people define their own traits that map to these operations. For example, today the simd
crate does something roughly like this:
#[simd_intrinsic(...)]
fn simd_eq<T,U>(t: T, u: T) -> U;
unsafe trait Simd {
type EqType;
}
fn generic_eq<T:Simd>(t: T, u: T) -> T::EqType {
simd_eq(t, t)
}
unsafe impl Simd for u32x4 { ... } // etc
It seems like we could instead do:
trait Simd { // no longer an unsafe trait
type EqType;
// we now include a method for the various simd operations we might want to do:
fn eq(x: &Self, y: &Self) -> Self::EqType;
...
}
#[simd_intrinsic]
fn eq_u32x4(x: u32x4, y: u32x4) -> boolx4 {...}
impl Simd for u32x4 {
#[inline(always)]
fn eq(x: &Self, y: &Self) -> Self::EqType {
eq_u32x4(x, y)
}
}
I'm probably getting some of the details wrong (have to consult the crate for the precise names involved) but hopefully you get the idea. Basically, the compiler only supports monotype intrinsics, and the wrapper crate adds (using normal trait methods) any generic dispatch needed.
The function would look like normal Rust and look like it recurses into itself but the compiler would understand that direct calls to the function are actually implemented inline, so this isn't actually infinite recursion.
Is there a good reason for making the function recurse into itself? It seems like unnecessary repetition to me. Would a macro like intrinsic!()
, similar to unreachable!()
, be possible?
- codec authors want the raw intrinsics because they use them in clever and unexpected ways;
- HPC people want higher-level abstractions but don't need access to every trick in the book;
- high-performance demands also require raw intrinsics, because they don't mind investing the time to reshape the algorithm for each platform.
I agree. This is one of the papercuts of the current state: most of the platform-specific intrinsics are there with their usual names, except for a few basic arithmetic operations, which are simd_add
and such. I think it would be better to expose all of the raw platform intrinsics and build a higher-level cross-platform simd_add
on top of that with #[cfg(target_feature)]
. A crate like simd
could build on top of that by providing fallback (e.g. two SSE adds if AVX is not available). It wouldn’t be generic, but does it need to be? I can’t think of a #[repr(simd)]
type that is not just an n-tuple of the scalar type. And for the low-level intrinsics the types have little meaning anyway (e.g. _mm256_cmp_ps
returns a vector of floats, but actually they are bitmasks).
Is there a good reason for making the function recurse into itself?
Maybe it's contrived, but casting the function to a function pointer would naturally give you a pointer to a function which contains the intrinsic operation.
except for a few basic arithmetic operations, which are
simd_add
and such
There's a very good reason for keeping those that way: they're basic LLVM operations (i.e. simd_add
is just the integer/float add
you get for +
but with vector arguments) and LLVM can optimize them, _unlike_ arbitrary intrinsics, which are function calls and get lowered in target codegen.
Can anyone provide an overview of the status of this? I was talking with someone whose GitHub name I don't know on IRC, and there was some indication that no one is handling further development of this feature. I have enough experience with X86 SIMD that I could probably help.
I like @nikomatsakis approach, except that sometimes you need to be able to treat f32x4
as i32x4
or similar on at least X86. This is because some of the shuffles aren't implemented for f32
. If the compiler provides intrinsics for all possible vector types for this case, then it should be fine.
One other possibility that comes to mind now that we're close to it is to finish type-level integers, then make generic intrinsics with declarations like this:
fn simd_mul<T>(v1: T, v2: T) -> T
where std::mem::size_of<T>(): platform_simd_size, std::mem::align_of<T>(): platform_simd_align {
//magic code
}
This of course depends on how close we are to having type-level integers, but it should be checkable well before trans in any sane implementation of type-level integers I can think of. Just a thought.
This is because some of the shuffles aren't implemented for
f32
.
LLVM shuffles don't care what the element types are, and neither do the Rust intrinsics exposing them.
@eddyb
People were talking about exposing the platform intrinsics explicitly, which was my point here.
If you drop the cross-platform shuffles in favor of putting it all in a crate and also drop the weird semi-generic nature of the original RFC, this does indeed become a problem.
@camlorn afaik, nobody is carrying this forward, but I would very much like to see progress! I still basically stand by my previous comment, though I think @eddyb suggested (perhaps on IRC) the idea of applying the special attribute directly to the method in the impl, and that seems even better (perhaps just making it a lang item -- it would mean though that this lang item can be applied multiple times).
I have no objection to exposing the platform intrinsics explicitly, but it also doesn't seem like a _required_ ingredient. It'd be great to make progress on the wrapper library, and adding in platform-specific names feels orthogonal to me. (Right? This is a bit out of cache.)
I'm not exactly sure what's the best next step. Perhaps a new RFC is warranted, just to lay out the plan clearly? At minimum some kind of canonical write-up feels appropriate. Hopefully the changes vis-a-vis today are relatively minimal.
@nikomatsakis
I like the idea of cross platform intrinsics a great deal, and tbh I need to read the whole thread before I'm at full understanding.
It seems to me that you could provide only the platform specific intrinsics, get the optimizer doing a good job with eliminating temporary moves, get type-level integers, and then add a #[inline(force)]
that libs can use to make the code efficient.
As I understand it, we almost have type-level integers. And @pcwalton is working on the needed optimizer stuff.
But that said, I have no problem with the original RFC. I started at the bottom of this thread and read up, however, and it seems to me that people are no longer convinced that this is a good way. Perhaps this impression changes once I read the whole thing.
@BurntSushi I knew I saw something somewhere! See https://github.com/rust-lang/rust/issues/27731#issuecomment-226792671 above.
Hate to just jump in out of the blue, but since there hasn't been an update in a while, is there any news on getting simd support in?
@jonathandturner No big updates, but @BurntSushi continues to plug away at it. If he follows his typical pattern, one morning he'll show up, open a massive, beautiful PR, and we'll be totally set :-)
FYI this issue is a bit of a dumping ground for SIMD features that we don't know what to do with yet.
simd_ffi
, which is being discussed here: https://github.com/rust-lang/rust/issues/53346repr(simd)
, which we currently never plan to stabilize, but instead expose it through a std library API like the one proposed in https://github.com/rust-lang/rfcs/pull/2366stdsimd
it tracks the core::arch
APIs for platforms that are not available on stable yet (ARM, NVPTX, WASM, etc.) I don't think it is worth it to clean up this issue. As parts of the above get proposed for stabilization they will get their own tracking issues. The only thing that might be worth doing is splitting repr(simd)
into its own feature and making a clear statement that it is not planned for stabilization.
@gnzlbg want to file some follow-up tracking issues and re-point unstable features to the focused tracking issues? Agreed that this tracking issue isn't really serving much purpose nowadays!
Did anything significant change between 1.37.0 and current nightly concerning this feature? I have been playing with using repr(simd) in cgmath and find that it works as expected building with 1.37.0, but when switching to nightly-2019-09-04 without changing anything else, repr(simd) appears to be completely ignored. I see the difference in the LLVM code which is "align 16" with 1.37.0 bu "align 4" in nightly.
There was a LLVM update.
That might well be the reason. Are there any CI builds that would detect this kind of issue? Has anyone else observed problems with repr(simd) in nightly? Or can anyone confirm that it still works for them? I see the problem at least back to 2019-08-01, but I have no idea whether I am doing anything special that might trigger the problem.
Just confirming for the sake of documentation: we don't actually publicly (and stably?) expose any repr(simd)
types, right?
false, the types for the SIMD that we currently have stable are all repr(simd)
Which are those? All the references to simd types on stable are dead links or closed RFCs, and nothing shows up in std's rustdoc instance.
The structs here,
but not cpuid, just the __mFOO
types
Just confirming for the sake of documentation: we don't actually publicly (and stably?) expose any repr(simd) types, right?
@Gankra We don't really document what repr(simd)
means, nor document that as a public property of those types. We just provide these types, and they happen to be repr(simd)
as a private implementation detail.
It has calling-convention implications in the sysv x64 ABI (It gives the type the SSE/SSEUP class), so it's kinda important for FFI. Not hugely important but it feels in scope for the new FFI section I'm writing for the rustonomicon. (e.g. gcc's __float128 is treated equivalently to repr(simd)
, and I believe can be bridged to our __m128 type).
It's plausible that we don't want to admit this is true, but by default I'm inclined to document it.
@Gankra notice that these types are not allowed on FFI - their usage there errors on declarations, and there is a bug in the checker that allows them on definitions, but that bug should be closed.
Ah, excellent!
Compiling playground v0.0.1 (/playground)
error: use of SIMD type `Test` in FFI is highly experimental and may result in invalid code
--> src/main.rs:9:19
|
9 | fn hello(val: Test);
| ^^^^
|
= help: add `#![feature(simd_ffi)]` to the crate attributes to enable
Ok then I think it's ok to pretend repr-simd doesn't exist for the purposes of ffi/abi docs.
There is a merged RFC that allows SIMD types in FFI, but only when doing so is "ok" w.r.t. the ABI: https://github.com/rust-lang/rust/issues/63068 There is currently no implementation of this RFC, but according to it, e.g., __m128
and __m256
are only usable in FFI when the corresponding features are available, and they are guaranteed to map in FFI to a 128-bit and 256-bit wide vector, respectively.
What are blockers for stabilizing intrinsics for other targets? In RustCrypto we are particularly interested in using AArch64 crypto extension.
The Rust Embedded Working Group is trying to push some foundational crates to 1.0 in 2020. Having core::asm::nop
and some sort of core::asm::memory_barrier
would really help with that for both Cortex-M and MSP430.
@newpavlov for the crypto
extension, a pre-RFC in internals would be the right place to start, since there is some "design" work to do there (e.g. whether to add an overreaching crypto
feature, or just one feature for crc32, sha, etc.).
@thejpster pretty much the same, pre-RFC in internals. I'm not sure why core::asm::nop
is necessary, nor what semantics it would have in the abstract machine (e.g. fn nop() {}
?), and for memory_barrier
, you probably want to argue why compiler_barrier
, atomic::fence
, etc. aren't enough. Also for Cortex-M there are some of these operations in core::arm
available already, but none of them are in the path towards stabilization due to other design issues with the API of the ACLE intrinsics.
@gnzlbg a NOP is required, for example, on the TI TM4C123 when initialising peripherals, as it is a hard fault to access the peripheral within three clock cycles of it being enabled. We have a workaround involving linking with pre-compiled assembly, but it's ugly and introduces an extra jump. I'd suggest WFI, WFE and SEV operations are equally important on Thumb ARM targets. Perhaps also SVC. I wouldn't want to see these held up by discussions on the more exotic stuff as these opcodes are used by all Thumb ARM projects.
The request for a barrier (https://docs.rs/msp430/0.2.0/msp430/asm/fn.barrier.html) came from the MSP430 team. I don't know why the existing core barrier functions aren't suitable. Paging @cr1901 .
The barrier seems to just be a compiler fence. You should be able to do it on stable with compiler_fence(SeqCst)
.
I have also had need of a 2 cycle stall before. Write some MMIO and then wait for the DMA to kick in.
@gnzlbg a NOP is required, for example, on the TI TM4C123 when initialising peripherals, as it is a hard fault to access the peripheral within three clock cycles of it being enabled.
Presumably this should be named volatile_nop
?
Would it be more expensive to use volatile loads/stores from/to a stack variable?
(presumably LLVM can't optimize those away. or at least I hope it can't)
@eddyb volatile
operations are not re-ordered across other volatile
operations (but other operations can be). IIUC the use case, @thejpster wants absolutely nothing to be reordered across this nop
. This particular case is probably better suited to inline assembly than to a specific compiler intrinsic (EDIT: for the barrier, the already stable compiler_fence
is probably what they want).
"Absolutely nothing reordered" is unimplementable, but IIUC that is not required: only access to peripherals needs to wait three cycles, and those accesses are presumably volatile operations (MMIO).
Yeah, I've fudged it with
volatile_mmio_write_to_DMA_activation();
let x = 0_i32;
(&x as *const i32).read_volatile();
(&x as *const i32).read_volatile();
@Lokathor did that work ?
I imagine that the compiler will emit two asm instructions for two loads after the call to volatile_mmio_write_to_DMA_activation()
, but you are not guaranteed that those two instructions will take in total 2 cycles to complete, e.g., since there are no data-dependencies between them, and the CPU can do some instruction-level parallelism for loads, both could complete in 1 single cycle.
The CPU of the project does not have ILP, it's ancient. I said it was fudging it :3
@thejpster Apologies for taking so long to get back to you.
I don't think I need core::asm::nop
, as I was using it in place of a barrier. core::asm::volatile_nop
may be valuable. core::sync::atomic::compiler_fence
is currently broken on msp430, but will _probably_ work fine as a barrier once it works.
Looks like I made offhand comments about how putting nop
and barrier
into core::asm
would remove all the remaining assembly in msp430 crates. This is one of the last standing issues left for me to attempt getting msp430 into stable.
I did some testing tonight; I've been only using nop
for the purposes of barriers it seems. I think your use case of a volatile_nop
to force a peripheral delay may be valid for msp430 as well, but I don't have any examples offhand. _It appears that barrier
can be used in any place where I was using a generic nop
._ They _are_ in fact nearly the same code :).
As @Amanieu mentions:
The barrier seems to just be a compiler fence. You should be able to do it on stable with compiler_fence(SeqCst).
Unfortunately, the LLVM backend for MSP430 can't handle [compiler_fence
]( for some reason, while it _can_ handle the barrier
we defined in the msp430
crate. Consider the following example:
#![no_main]
#![no_std]
extern crate panic_msp430; // For now, we only have an infinitely-looping panic handler.
use core::sync::atomic::{compiler_fence, Ordering};
use msp430_rt::entry;
#[allow(unused)]
// Bring interrupt vectors into scope so the linker can see them; enabling the "rt"
// feature of msp430g2553 transitively enables the "device" feature of msp430-rt.
// This prevents default interrupt vectors from being generated.
use msp430g2211;
#[entry]
fn main() -> ! {
compiler_fence(Ordering::SeqCst);
loop {
// Application begins here.
}
}
This will die with a semi-recent nightly Rust:
William@DESKTOP-H0PMN4M MINGW64 ~/Projects/MSP430/msp430g2211-quickstart
$ xargo build --target=msp430-none-elf --release
Compiling msp430g2211-quickstart v0.1.0 (C:\msys64\home\William\Projects\MSP430\msp430g2211-quickstart)
LLVM ERROR: Cannot select: 0x5052560: ch = AtomicFence 0x29a76d8, Constant:i16<7>, Constant:i16<0>
0x5052490: i16 = Constant<7>
0x50524f8: i16 = Constant<0>
In function: main
error: could not compile `msp430g2211-quickstart`.
To learn more, run the command again with --verbose.
William@DESKTOP-H0PMN4M MINGW64 ~/Projects/MSP430/msp430g2211-quickstart
$ rustc -V
rustc 1.42.0-nightly (760ce94c6 2020-01-04)
I don't normally handle the LLVM side of things, but I may have to this time. Once I know the issue, I'm guessing the stable compiler_fence
will work fine.
Is there a smaller subset of intrinsics for non-x86 platforms that could be fast-tracked for stabilization? Like vector math stuff.
@jethrogb What platform do you need intrinsics for? The main issue at the moment is that non-x86 intrinsics are not fully implemented yet.
Myself I would appreciate more vector math / neon stuff available on Stable.
@Amanieu what do you mean by “not fully implemented”? I can imagine the following
Reasons 1 and 2 are clearly a blocker to stabilization of those specific functions. I don't think reason 3 should necessarily block stabilization of functions that do exist? Reason 3 also applies to x86 which does have a lot of stable functions.
And yeah I was thinking mostly about ARM vector stuff.
From this comment it seems that only about 500 out of 4300 NEON intrinsics have actually been implemented. This is very incomplete and I would be against stabilizing the intrinsics in their current state.
On NEON even basic functionality such as vdupq_n_u32 and all of the memory store instructions (!) are missing.
We're at about 12% neon coverage. I'm not aware that people are actively working on in, but some of the arm devs said that they would like to get to it "soon".
Most helpful comment
@jonathandturner No big updates, but @BurntSushi continues to plug away at it. If he follows his typical pattern, one morning he'll show up, open a massive, beautiful PR, and we'll be totally set :-)