On x86 hardware, for SIMD instructions, there are effectively two encoding formats: "legacy" and "vex".
There are some minor differences between these two encodings, including the number of parameters they take and whether the memory operand has aligned or unaligned semantics.
As a brief summary:
ins reg, reg/[mem]), the first serves as both a source and destination, and the last as a source which can generally be a register or memory address. The memory address form has "aligned" semantics and will cause an exception to be thrown if the address is not "naturally aligned" (generally 16-bytes).ins reg, reg, reg/[mem]), the first being the destination, the second being a source, and the last as a source which can generally be a register or memory address. The memory address form has "unaligned" semantics and will not cause an exception to be thrown, regardless of the input.Today, we expose both Load (which has unaligned semantics) and LoadAligned (which has aligned semantics). Given that a user will often want to generate the "most efficient" code possible and that the JIT, in order to preserve semantics (not silently get rid of an exception that would otherwise be thrown), changes which method it will "fold" depending on the encoding it is currently emitting, it may be beneficial to expose an intrinsic which allows the user to specify that "This address is aligned, do whatever load is most efficient". This would ensure that it can be folded regardless of the current encoding.
namespace System.Runtime.Intrinsics.X86
{
public partial abstract class Sse
{
// New APIs:
public Vector128<float> LoadVector128Unsafe(float* address);
// Existing APIs:
public Vector128<float> LoadVector128(float* address);
public Vector128<float> LoadAlignedVector128(float* address);
}
public partial abstract class Sse2
{
// New APIs:
public Vector128<double> LoadVector128Unsafe(double* address);
public Vector128<byte> LoadVector128Unsafe(byte* address);
public Vector128<sbyte> LoadVector128Unsafe(sbyte* address);
public Vector128<short> LoadVector128Unsafe(short* address);
public Vector128<ushort> LoadVector128Unsafe(ushort* address);
public Vector128<int> LoadVector128Unsafe(int* address);
public Vector128<uint> LoadVector128Unsafe(uint* address);
public Vector128<long> LoadVector128Unsafe(long* address);
public Vector128<ulong> LoadVector128Unsafe(ulong* address);
// Existing APIs:
public Vector128<double> LoadVector128(double* address);
public Vector128<byte> LoadVector128(byte* address);
public Vector128<sbyte> LoadVector128(sbyte* address);
public Vector128<short> LoadVector128(short* address);
public Vector128<ushort> LoadVector128(ushort* address);
public Vector128<int> LoadVector128(int* address);
public Vector128<uint> LoadVector128(uint* address);
public Vector128<long> LoadVector128(long* address);
public Vector128<ulong> LoadVector128(ulong* address);
public Vector128<double> LoadAlignedVector128(double* address);
public Vector128<byte> LoadAlignedVector128(byte* address);
public Vector128<sbyte> LoadAlignedVector128(sbyte* address);
public Vector128<short> LoadAlignedVector128(short* address);
public Vector128<ushort> LoadAlignedVector128(ushort* address);
public Vector128<int> LoadAlignedVector128(int* address);
public Vector128<uint> LoadAlignedVector128(uint* address);
public Vector128<long> LoadAlignedVector128(long* address);
public Vector128<ulong> LoadAlignedVector128(ulong* address);
}
}
This shows the semantics of each API under the legacy and VEX encoding.
Most insrtuctions support having the last operand be either a register or memory operand: ins reg, reg/[mem]
When folding does not happen the load is an explicit separate instruction:
For example, Sse.Add(input, Sse.Load(address)) would generate
* movups xmm1, [address]
* addps xmm0, xmm1
While, folding allows the load to be folded into the calling instruction:
For example, Sse.Add(input, Sse.Load(address)) would generate
* addps xmm0, [address]
On Legacy hardware, the folded-form of the instructions assert that address is aligned (generally this means (address % 16) == 0).
On VEX hardware, the folded-form does no such validation and allows any input.
| | Unaligned Input | Aligned Input |
| ------ | --------------- | ------------- |
| Legacy | No Folding | No Folding |
| VEX | Folds | Folds |
| | Unaligned Input | Aligned Input |
| ------ | ------------------ | ------------- |
| Legacy | Folds, Throws | Folds |
| VEX | No Folding, Throws | No Folding |
| | Unaligned Input | Aligned Input |
| ------ | ---------------------- | ------------- |
| Legacy | Folds, Throws | Folds |
| VEX | Folds, Throws in Debug | Folds |
Some open ended questions:
COMPlus_EnableAVX=0)Store, StoreUnaligned, and StoreAligned counterparts or just keep Store and StoreAligned?Avx/Avx2?Avx/Avx2 instructions, they require the VEX encoding and therefore always have "unaligned" semantics, but that Avx/Avx2 define and expose LoadAlignedVector256/StoreAligned instructions anywaysCC. @eerhardt, @fiigii, @CarolEidt for thoughts
Today, we expose both
Load(which has unaligned semantics) andLoadAligned(which has aligned semantics).
I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load, as the proposed name is quite long.
"This address is aligned, do whatever load is most efficient".
For the name Load would match.
Should we also expose
StoreAlignedOrUnalignedmethods?
If load is exposed, the same should be done for store. Hence rename Store to StoreUnaligned and use the name Store for the method analogous to this proposal.
Should we also expose equivalents on
Avx/Avx2?
In my proposal for method-names this wouldn't be in symmetry with Avx/Avx2 anymore, so yes there should be the same methods -- even if they just forward to the unaligned methods.
I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load,
I was thinking the same thing.
Updated to be LoadVector128, LoadAlignedVector128, and LoadUnalignedVector128.
Sorry for the delay to reply.
I don't quite understand this proposal. Currently, LoadAligned can be folded into the consumer instruction with SSE and VEX encoding, but Load (LoadUnaligned in this proposal) only can be folded with VEX encoding. This behavior also matches C++ compilers.
So, what is the difference between the proposed Load intrinsic and the existing LoadAligned?
Meanwhile, guarantee memory alignment should be user's responsibility, and the dynamic assertion ("Asserts (address % 16) == 0 when optimizations are disabled") does not always work.
Additionly, on modern Intel CPUs, the unaligned load/store won't be slower than aligned load/store (if the memory access hits in one cache line), so we suggest always using unaligned load/store. That is why Load intrinsic maps to the unaligned load instruction.
I don't quite understand this proposal.
The JIT has to preserve the semantics of the underlying instruction. Currently LoadAligned will cause a hardware exception to be raised if the input address is not aligned; LoadUnaligned (presently named just Load) currently has unaligned semantics and will not raise an exception, regardless of the alignment of the input address.
When using the legacy encoding, the vast majority of the instructions which allow a memory operand have "aligned" semantics and will cause a hardware exception to be raised if the input address is not aligned.
When using the VEX encoding, the reverse is true, and the vast majority of the instructions which allow a memory operand have "unaligned" semantics and will not cause a hardware exception to be raised if the input address is not aligned.
This means that, when we emit using the legacy encoding, we can only fold and emit efficient code for LoadAligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we can emit: addps xmm0, [pRight]; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we have to instead emit movups xmm1, [pRight]; addps xmm0, xmm1
And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]
We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).
So, what is the difference between the proposed Load intrinsic and the existing LoadAligned
We can expect that users will want to emit the most efficient code possible; However, we can also expect that users,,where possible, will be using various well-known tricks to get their data aligned so that they don't incur the penalty for reading/writing across a cache-line or page boundary (which will occur every 4 reads/writes on most modern processors, when working with 128-bit data).
This leads users to writing algorithms that do the following (where possible):
// Process first few elements to become aligned
// Process data as aligned
// Process trailing elements that don't fit in a single `VectorXXX<T>`
They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).
However, it is also not unreasonable that they will want to emit efficient codegen if the user happens to be running on some pre-AVX hardware (as is the case with the "Benchmarks Games" which run on a Q6600 -- which it is also worth noting doesn't have fast unaligned loads). If the user wants to additionally support this scenario, they need to write a helper method which effectively does the following:
public static Vector128<float> EfficientLoad(float* address)
{
if (Avx.IsSupported)
{
// Must assert, since this would fail for non-AVX otherwise
Debug.Assert((address % 16) == 0);
return Sse.LoadUnaligned(address);
}
else
{
return Sse.LoadAligned(address);
}
}
Which also depends on the JIT being able to successfully recognize and fold this code.
So, this proposal suggests providing this helper method built-in by renaming the existing Load method to be LoadUnaligned and to expose a new method Load (I had originally suggested something like LoadAlignedOrUnaligned, but it was suggested I change it to the current) which does effectively the above. That is, on hardware with VEX encoding support, it uses LoadUnaligned and when optimizations are disabled it will validate that the input address is aligned; while on older hardware that uses the legacy encoding, it will use LoadAligned. This ensures that the code can always be folded efficiently and it validates the semantics remains the same (without hurting perf when users compile their executables with optimizations enabled). We also still have the LoadUnaligned and LoadAligned methods explicitly for the cases where the user can't always do the "optimal" thing (sometimes you just have to work with data that can be unaligned).
And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]
We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).
Ah, I see the problem now. But we should fold LoadAligned with VEX encoding, as it does not throw exception.
They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).
This assert is useless. address could be either aligned or unaligned in different runs.
as it does not throw exception.
That is the problem and why we can't fold it. LoadAligned emits movaps which always throws if the address is unaligned. If you fold it into a VEX-encoded instruction, it will no longer throw. This means that if you passed in an unaligned address, the exception would silently disappear whenever the JIT folded it.
This assert is useless. address could be either aligned or unaligned in different runs.
The point of the assert is to help catch places where the user is calling Load without pinning the data and ensuring that it is properly aligned. It will still be possible for a user to do this incorrectly, but it would be an explicit opt-in.
We could potentially use another name, such as LoadUnsafe, or something that conveys that the data must be aligned but we won't always validate that it is.
I like this proposal, and agree that the assert, while not guaranteeing to catch all potential dynamic data configurations, is very likely to catch nearly all cases.
Given that there is a great deal of expertise required to use the hardware intrinsics effectively, I don't think it's necessary to encumber this with the Unsafe suffix. Rather, documenting its behavior and asserting in the unoptimized case should be sufficient.
Ah, looked the manual again, yes, vmovaps can still throw exceptions with VEX-encoding, but other instructions won't. Thanks.
I don't think it's necessary to encumber this with the Unsafe suffix
Agree, actually other Load* are not safe neither 😄
and this is more a nicety for if people are supporting older hardware or working around a JIT bug (COMPlus_EnableAVX=0)
Are there enough such people to justify adding more intrinsics?
Should we also expose equivalents on Avx/Avx2?
It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic. But that would imply adding more intrinsics. Or perhaps AVX Load should simply be renamed to LoadUnaligned, though the longer name sucks. Same for stores.
But to quote @CarolEidt "Given that there is a great deal of expertise required to use the hardware intrinsics effectively", I don't think such difference in semantics justifies adding more intrinsics.
Should these instructions have a Debug.Assert((address % 16) == 0)?
Where does this assert go exactly? Is Load a true intrinsic or a helper? Who's using debug builds of corelib, besides the CoreCLR team?
I don't think such difference in semantics justifies adding more intrinsic.
Agree, I suggest that we can fold more existing load intrinsic rather than adding more intrinsics.
LoadAligned because folding LoadUnaligned would throw exceptions that the users did not expect.LoadAligned and LoadUnaligned both. The individual LoadAligned may throw exceptions but folded ones won't, but I think this is fine. We assume that users "expect" exceptions from LoadAligned and handle in their own code, and folded LoadAligned just throws exceptions with 0% possibility.The above codegen strategy is already adopted by all the mainstream C/C++ compilers.
Are there enough such people to justify adding more intrinsics?
Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.
It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic
Yes, if this API is approved, we would (at a minimum rename) the AVX intrinsic to be LoadUnaligned. I already have a note for the API review about whether we should have Store mirror these names.
Where does this assert go exactly? Is Load a true intrinsic or a helper?
A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.
and folded LoadAligned just throws exceptions with 0% possibility
I don't think this is acceptable without an explicit user opt-in. The runtime has had a very strict policy on not silently removing side-effects
The above codegen strategy is already adopted by all the mainstream C/C++ compilers.
C/C++ makes a number of optimizations that we have, at this point, decided not to do and to instead address via Analyzers or look at more in depth later.
We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).
The runtime has had a very strict policy on not silently removing side-effects
It is not removing side-effects, it has no any side-effects (hardware exceptions).
It is not removing side-effects, it has no any side-effects (hardware exceptions).
All exceptions, including HardwareExceptions are considered side-effects.
The movaps raises the hardware #GP exception if its input is not properly aligned and the JIT interecepts that and converts it into a System.AccessViolationException.
Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.
I didn't ask about cases where you tried to do that. I asked if there are enough people who are using .NET Core on old or gimped CPU and expect things to work perfectly rather than just reasonably.
A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.
Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned. Whether it's a good idea to do that in debug builds it's debatable. Some people may end up disabling optimizations to workaround a JIT bug or to be able to get a correct stacktrace for a hard to track down bug. Now they risk getting exceptions that with optimizations enabled would not occur.
The runtime has had a very strict policy on not silently removing side-effects
Please do not mislead people. The runtime does not impose any semantics on intrinsics, it's the other way around. LoadAligned may be defined as "throws an exception if the address is unaligned" or as "may throw an exception if the address is unaligned". That may or may not be a good idea. But it doesn't have anything to do with any runtime policy.
We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).
Indeed. Squabbles over names resulted in string intrinsic being removed from .NET 3. Yet adding non-deterministic intrinsics to support old CPUs is doable.
Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned.
I believe the result of this proposal is just forcing most people to change Load to LoadUnaligned in their programs.
Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.
I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches
for (;IsUnaligned(addr); addr++)
{
// processing data until the memory get aligned
}
for (; addr < end; addr += Vector128<int>.Count)
{
var v = LoadAlignedVector128(addr);
// or LoadUnalignedVector128(addr), whatever, we recommand always using LoadUnalignedVector128
......
}
Or
for (; addr < end; addr += Vector128<int>.Count)
{
var v = LoadUnalignedVector128(addr);
......
}
Using which approach is based on profiling and it is much more important than folding loads for SSE encoding.
I didn't ask about cases where you tried to do that
I wasn't saying that with regards to personal attempts.
Now they risk getting exceptions that with optimizations enabled would not occur
Correct, but the API is explicitly documented to do as such and they have the option to enforce the desired behavior by explicitly using one of the other two APIs (LoadAligned or LoadUnaligned).
By not having all three, users either have to write their own wrappers for something that we could (and in my opinion should) reasonably handle.
Additionally, if we only expose two APIs and make one of them "sometimes validate alignment", then we can't provide a guarantee to the end user that they can enforce the semantics where required.
Please do not mislead people. The runtime does not impose any semantics on intrinsics
I don't think this is misleading at all. We have had a very clear contract and design for the platform-specific hardware intrinsics that each is tied to a very particular instruction.
LoadAligned is documented to emit movaps and LoadUnaligned (currently just called Load) will emit movups. This gives a very clear story and guarantees that, if someone uses this API, they will get exactly that instruction.
Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.
It is worth noting, however, that we have also tried to determine common patterns, use-cases, and limitations with the decision to tie a given hardware intrinsic to a particular instruction. When we have found such limitations, we have discussed what we could do about it (both the "ideal" and realistic scenarios) and determined an appropriate solution. So far, out of all the scenarios that have been brought up, we have determined to deviate only for CreateScalarUnsafe and ToVector256Unsafe (where we say that the upper bits will be non-deterministic, rather than explicitly zeroed; but for which we also expose "safe" versions that do explicitly zero the upper bits). This may be another case, that is more trivial to resolve via an additional API and that is worthwhile to do.
Squabbles over names resulted in string intrinsic being removed from .NET 3
They have not necessarily been removed from netcoreapp3.0, instead we decided that we would pull them until we can determine the appropriate way to expose these APIs as there were concerns over the useability and understandability of said APIs. The underlying instructions behave quite differently from most other intrinsics we've exposed and so they require some additional thought. This is no different from any other API that we decide to expose in that we don't want to ship something that has a bad surface area and we are still working towards making sure these APIs ship in a good and useable state.
Yet adding non-deterministic intrinsics to support old CPUs is doable.
As mentioned above, both cases where we exposed an additional non-deterministic API were discussed in length.
I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches
Once you have pinned the memory, it is guaranteed not to move and you can rely on the assert. Anyone dealing with perf-sensitive scenarios and larger blocks of memory (for example, ML.NET) should be pinning their data anyways, as the GC could move the data otherwise and that can mess with the cache, alignment, etc.
Using which approach is based on profiling and it is much more important than folding loads.
The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" has multiple pages/sections all discussing this. The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:
All this additional API does is give the user the option to choose between:
LoadUnaligned)LoadAligned)Load)A good number of vectorizable algorithms (especially the ones used in cases like ML.NET) can be handled with a worst-case scenario of two-unaligned loads maximum, regardless of the alignment of the input data. For such algorithms, the only question of whether to align or not really comes down to how much data is being processed in total and the cases where you don't want to make the data aligned can generally be special-cased.
The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:
Always align data when possible
Generally prefer aligned stores over aligned loads, if you have to choose
Consider just using two 128-bit unaligned loads if you can't align your 256-bit data
etc
Yes, that is what I meant. And aligned loads != movap*.
I can't align my data (they use LoadUnaligned)
My data is aligned and should never be unaligned (they use LoadAligned)
My data is aligned, but you don't have to check (they use Load)
All these three situations are suggested to use LoadUnaligned.
Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.
For the fortieth and two time: the runtime has nothing to do with this. I give up. We're probably speaking different languages or something.
@mikedn, I think the issue is a runtime issue when we start talking about folding the loads into the operations, in ways that might cause a runtime failure to be masked (i.e. if the load would throw on an unaligned address, but the operation into which it is folded would not). And I think what @tannergooding is after here is the ability to have a single load semantic that supports folding.
All that said, I'm still not sure where I fall on this issue.
Import @CarolEidt 's comment from https://github.com/dotnet/coreclr/issues/21308
I agree that we don't want to silently remove an exception. It is not just that it violates the "strict" behavior that developers expect from the CLR, but could also cause sudden unexpected failures if something changed in the user code (or the optimization settings used) that caused the JIT to no longer fold the instruction.
If users write LoadAligned, they should be responsible to handle the possible exceptions, whatever it is from JIT optimization or actual memory misalignment.
And aligned loads != movap*.
I would disagree here. We have two instructions that perform explicit loads (including movupd and movapd here):
movups, which works for any datamovaps, which only works for aligned dataWe additionally have the ability to support load semantics in most instructions:
I would say that the unaligned load could just be called a load, as it loads data with no special semantics (and it is still just a load even if the data happens to be aligned). I would say that an aligned load must also validate that the data is aligned (which only movaps does).
All these three situations are suggested to use LoadUnaligned.
LoadUnaligned does not successfully fulfill all three scenarios by itself. It definitively fills the first scenario, but the second scenario requires an additional validation step (and must fail if the requirements are not met) and the third scenario has an optional validation step (that would fail if the requirements are not met and optimizations are disabled; otherwise, when optimizations are enabled, it would skip the validation check).
I think the issue is a runtime issue when we start talking about folding the loads into the operations
@CarolEidt It most definitely isn't. As far as the runtime/ECMA spec is concerned, intrinsics are just method calls. There's nothing anywhere in the ECMA spec that says that LoadAligned must always throw an exception if the data is not aligned. That's solely in the intrinsic's spec courtyard.
Also, I don't claim that we should change the current LoadAligned semantic. I just want to see reasonable argumentation for adding a bunch of new intrinsics. And I'm not seeing it.
And aligned loads != movap*.
I would disagree here.
LoadUnaligned does not successfully fulfill all three scenarios by itself.
movups and movaps has the same performance and movups is always safe. Why not use it?
The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.
movups and movaps has the same performance and movups is always safe. Why not use it
Because we can't always safely fold a movups. Some examples are being on older/legacy hardware, working around a JIT bug (so you have disabled VEX-encoding support), using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...
The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.
Yes, and the purpose of the proposed intrinsic is to:
It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults; and they will not make any attempt to align the input data. As such, you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?
using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...
It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults;
No, my concern is that "most" workloads will just use movups and will not care about folding on older CPUs (or other low-power devices you mentioned).
Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps.
- Provide some debug-mode validation that you are not causing cache-line splits or page faults
- Allow the intrinsic to be folded, regardless of whether you are using the legacy or VEX encoding
If users do not guarantee the alignment, the compiler-generated validation (1) is not reliable, so that makes (2) unacceptable on older CPUs.
If users already guarantee the alignment, the compiler-generated validation (1) is useless, then i) using LoadUnaligned normally, or ii) using LoadAligned if the code-size on older CPUs is critical (needs https://github.com/dotnet/coreclr/issues/21308)
you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?
Right.
the compiler-generated validation (1) is not reliable
Given the low-level nature of the code, that the intrinsic is explicitly documented to do the validation, and that the GC currently only guarantees 8-byte alignment; a JIT inserted check for debug-builds should catch essentially every case (especially across multiple runs).
LoadAlignedOrUnaligned to explicitly call attention to the semantics, but it was suggested I change that to the current name).(needs dotnet/coreclr#21308)
I think that is a non-starter; not only would it silently remove the exception; but it removes the ability for users to guarantee the generation of a movaps (which will always validate the alignment). The current proposal allows users to continue having explicit aligned or unaligned semantics, if desired; but also gives them the ability to do the efficient thing.
No, my concern is that "most" workloads will just use movups and will not care about folding on older CPUs (or other low-power devices you mentioned).
Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps.
I believe that for "most" workloads, dev's will be concerned with cache-line splits and page faults, so they will attempt to align their data if the input is unaligned (these intrinsics are geared towards high-perf and hotspot scenarios, after all). They will want to additionally validate that they will not accidentally cause a cache-line split in production (their alignment logic is correct, they pinned the data, etc) and they will want to generate efficient code.
Therefore, it is my belief that they would write something like (this is using today's APIs and ignoring this proposal):
int misalignment = (int)(pCurrent) % 16;
// note: may need to handle unnaturally aligned data, such as if `float`
// which should be `(pCurrent % 4) == 0` is something else.
if (misalignment != 0)
{
// Vector128<T>.Count is a power of a two and a constant, this should optimize to a right shift
misalignment /= Vector128<T>.Count;
misalignment = Vector128<T>.Count - misalignment;
for (int i = 0; i < misalignment; i++)
{
// Process the individual elements, non-vectorized
}
count -= misalignment;
}
if (remainder >= Vector128<T>.Count)
{
remainder = count % Vector128<T>.Count;
count -= remainder;
for (int i = 0; i < count; i++)
{
// Assert we are still aligned
Debug.Assert(((int)(pCurrent) % 16) == 0);
// Process elements using vectorization and the appropriate Load(pCurrent)
}
}
else
{
remainder = count;
}
for (int i = 0; i < remainder; i++)
{
// Process the individual elements, non-vectorized
}
This proposal does a couple of things:
Debug.Assert((address % 16) == 0), as we will do the equivalent automaticallyThe code the user writes remains the same and if they want to have explicit Unaligned semantics or explicit Aligned semantics; they still have the option to use one of the other two overloads and have the correct/expected behavior for those
We just repeat our words again and again. I would like to stop the discussion and leave the proposal to the API review.
My original proposal was to call this something like LoadAlignedOrUnaligned to explicitly call attention to the semantics, but it was suggested I change that to the current name).
At least, do not name the new APIs to Load.
At least, do not name the new APIs to
Load.
Right now I also think this shouldn't be named Load, as it hides something.
Maybe (just thinking loud): LoadSafe (as opposed to Unsafe), LoadAutomatic, LoadFast, ...
On recent x86-hardware there is no need for this method -- I think here we have consensus, because always using unaligned reads has no penalty, and is recommended by intel, too.
But as @tannergooding tries to point out there exists hardware, on which the same code should run optimal without any need to change, where unaligned reads do have a penalty, so aligned reads should be used, also depending on the JIT if he will emit VEX or not.
This case could be handled with this new method. Therefore I think it is reasonable to add it.
But the name should be simple and obvious, not to overwhelm users and leave them with open questions about which Load-method to use.
If these methods have to exist, I agree Load is probably a bad name for them. On the other hand, it's better than the self-documenting LoadAlignedOnOldCpusOrUnalignedOnNewCpusOrBetterYetJustFoldTheLoadIfPossible.
I think @mikedn is right here, though. This is a new API, and if the documentation says LoadAligned may throw an exception on unaligned data, that's good enough. Is there really ever a scenario where you need to guarantee a movaps will be emitted for the intrinsic method? I like the idea of a 1:1 mapping of intrinsics to instructions, but since we can't access registers directly or specify the reg/mem targets of the instructions directly, there's a limit to how far that philosophy can go.
For high-performance scenarios, people will ensuring the data is aligned anyway, and if that's the only way to get the load folded, that just further encourages the best practice.
Thinking about this more, I reckon that if LoadAligned were required for folding, you'd end up with people writing code like this:
C#
Vector128<float> v;
if (Avx.IsSupported) {
v = Sse.LoadAligned(...) // Data isn't really aligned but this allows folding the load
} else {
v = Sse.LoadUnaligned(...)
}
...
Since this ultimately comes down to differences in CPU capabilities/encodings, maybe the right thing is to fit the API shape to that rather than trying to make a single magic adaptive method. What if we had Sse.LoadUnaligned, Sse.LoadAligned, and Avx.LoadVector128?
Sse.LoadAligned could be folded in both encoding forms (again with the caveat that the docs don't guarantee that unaligned addresses would throw on all CPUs), and Avx.LoadVector128 could be always be folded because it would only be available when VEX encoding is supported.
I reckon that if LoadAligned were required for folding, you'd end up with people writing code like this:
No, always using LoadUnaligned. But if you worry about the SSE encoding codegen, use Sse.LoadAligned.
Selecting different load based onif (Avx.IsSupported) is unnecessary because you have to align your data for LoadAligned like https://github.com/dotnet/corefx/issues/33566#issuecomment-443362604, so the Sse.LoadUnaligned(...) branch is meaningless but complicating your code.
Right, that was in reference to my previous suggestion that LoadAligned be required for folding in the API even though it's not required in all encodings. But people would end up calling the aligned method to get the folding even with unaligned data when they know it's supported by the CPU. I was just shooting down my first suggestion and then offering another.
Oh, I see, thanks for the explanation.
people writing code like this:
The code is backwards. When using the VEX encoding (Avx.IsSupported is true), we can only fold LoadUnaligned because the instructions do not validate the alignment of the address. If the user opts-in, we could additionally fold some method that sometimes checks alignment (whether that is via a 3rd, separate API; or via some other mechanism).
When not using the VEX encoding, we can only fold LoadAligned because the instructions themselves will assert alignment and will fail if the address is not aligned (there is no way to workaround this).
Since this ultimately comes down to differences in CPU capabilities/encodings, maybe the right thing is to fit the API shape to that rather than trying to make a single magic adaptive method
That is what this proposal is trying to do. It is trying to keep the two existing LoadUnaligned (currently just called Load) and LoadAligned methods, so that users who need to can explicitly use the required behavior.
It is additionally trying to expose a 3rd API that allows users to explicitly opt into doing the "efficient" thing (which basically requires saying: "Hey, my data is aligned, validate my assertion in debug but do the efficient thing in release").
I think doing anything without an explicit opt-in is a non-starter and the general rule has been that we don't expose a non-deterministic API without also exposing a deterministic one; so I don't think we can or should just relax LoadAligned.
I am not convinced that just calling the new API Load is right, as I don't think it conveys all the necessary semantics (which is why I originally had it named differently).
I am convinced that exposing an additional API for this behavior would be beneficial. We already know of cases where users will be running certain types of ML.NET code on low-power/IOT devices (such as Raspberry PI). Not everyone is buying the latest CPU's; and sometimes when they do they go for budget (which may mean they purchase a modern CPU without AVX support, and there are several of them that get released each generation).
such as Raspberry PI
is running on ARM, so it's out of the game here.
is running on ARM, so it's out of the game here.
Right, but there are also modern low-power/budget x86 computers without AVX support (from both AMD and Intel). I am unsure if we have any specific numbers on usage, but given that there are other low-power/IOT devices targeting ML.NET, I feel they shouldn't be excluded.
That's my opinion too.
The code is backwards.
My sample refers to my proposed implementation where LoadAligned doesn't guarantee an exception on unaligned data and is the only load type that is considered for folding by the JIT because it would work in both encodings. My point was that if the developer knows that method is the only one that can be folded and knows that on VEX processors, the data doesn't actually have to be aligned, they'd end up calling LoadAligned specifically on unaligned data but only on VEX processors, which is even more confusing than your proposal. I talked myself out of that.
When using the VEX encoding (Avx.IsSupported is true), we can only fold LoadUnaligned because the instructions do not validate the alignment of the address
I'm still not following your reasoning for believing you owe the user an exception if they call an aligned method with unaligned data. Intel guarantees the fault in the movaps instruction, but then they recommend against using that instruction on newer processors. I'm saying that if you never guarantee that movaps will actually be emitted, then you're not breaking anything by not throwing.
It is additionally trying to expose a 3rd API that allows users to explicitly opt into doing the "efficient" thing (which basically requires saying: "Hey, my data is aligned, validate my assertion in debug but do the efficient thing in release").
This still doesn't match up with the capabilities of the processor, though, because under VEX encoding the data doesn't have to be aligned. An Avx.LoadVector128 could be safely folded in all cases, whereas the too-clever Load would work on VEX processors and crash on legacy with unaligned data. The proposed assert in minopts mitigates that possibility but then creates an unnecessary restriction on the VEX processors.
Not everyone is buying the latest CPU's
I totally agree, and that's why I think there will be a large number of apps/libraries that have both SSE and AVX-optimized code paths. Not being able to completely optimize the AVX path because of a restriction on the older/lower-end processors is bad too.
I totally agree, and that's why I think there will be a large number of apps/libraries that have both SSE and AVX-optimized code paths.
Libraries will likely have both 128-bit and 256-bit optimized code paths. For the 128-bit code path, they may consider both the VEX-encoding and the legacy encoding scenarios.
I'm still not following your reasoning for believing you owe the user an exception if they call an aligned method with unaligned data.
An API called LoadAligned that does not validate that the data is actually aligned is confusing and would likely be shot down in API review.
This still doesn't match up with the capabilities of the processor, though, because under VEX encoding the data doesn't have to be aligned.
Correct. And if your data is not going to be aligned or if it is not beneficial for you to handle leading/trailing elements to make the rest of the handling aligned; you would call the explicit LoadUnaligned method.
However, as I listed above, it is my expectation that for all but the smallest input cases; users will be pinning their data and using various well-known "tricks" to get the majority of the data "aligned" (so that they don't have the cache-line penalty every 4 reads/writes or a page fault every X reads/writes). They just won't use LoadAligned because the JIT won't emit the efficient thing on modern CPUs. They will, however, likely want to have validation in Debug mode that their input is actually aligned.
but then they recommend against using that instruction on newer processors
The optimization manual makes no recommendations about using or not using movaps; rather they recommend that you align your data when possible and have a number of recommendations on what to do if you can't guarantee data alignment (such as preferring aligned stores over aligned reads). They also recommend using the more efficient encoding (rather than an explicit load) when possible.
The only reason not to use LoadAligned is because it will not be folded (emit the more efficient encoding) on VEX-enabled hardware. Yes, this is currently an API limitation, but no, I don't think it will be relaxed.
The reason to use LoadUnaligned is because it will be folded (emit the more efficient encoding) on VEX-enabled hardware. The downside is that, when you are doing performance optimizations and your data it aligned, you get no validation of that fact and you must add explicit Debug.Asserts to your own code.
This proposal came from the belief that needing to use LoadUnaligned and additionally assert the alignment was a common enough case. It has the additional benefit that it will folded on either legacy or VEX-encoding while still providing explicit APIs for the Aligned vs Unaligned case.
An API called LoadAligned that does not validate that the data is actually aligned is confusing and would likely be shot down in API review.
I see your point, but a method called Load that requires aligned data on some hardware but doesn't really on other hardware and enforces that only through an assert isn't really a lot better.
as I listed above, it is my expectation that for all but the smallest input cases; users will be pinning their data and using various well-known "tricks" to get the majority of the data "aligned"
This was my initial thought as well, but then I thought back to the challenge of re-implementing System.Numerics using S.R.I. On a granular method like operator+, you can't make any kind of assumption about the alignment, but the folding would still be beneficial. I don't know if some of the plans around first class structs and containment will address that through some other trickery, but limiting reg/mem encoding to only aligned memory scenarios is bound to cause problems elsewhere.
They also recommend using the more efficient encoding (rather than an explicit load) when possible.
This is the part that's most important to me. I can figure out whether I've aligned my pointers, but I want to get the most efficient encoding in every possible case.
I see your point, but a method called Load that requires aligned data on some hardware but doesn't really on other hardware and enforces that only through an assert isn't really a lot better.
Right. I expect the API review team will learn towards another name (like LoadUnsafe, due to its non-deterministic nature).
you can't make any kind of assumption about the alignment, but the folding would still be beneficial.
Right, and in this case (where you are doing one off operations) you would have to use LoadUnaligned. It will do the efficient encoding on modern hardware, but will involve an additional instruction for older hardware. -- However, you also wont likely be doing explicit loads here. I would imagine you would just use Unsafe.As to go from Vector4 to Vector128 and the JIT would do the efficient conversion (as we will often already be in register).
Ah, ok. As long as LoadUnaligned can still be folded on VEX hardware, that's fine. I was under the impression you were proposing LoadAligned always emitted movaps, LoadUnaligned always emitted movups, and that only Load (or LoadUnsafe or whatever) could be re-encoded into to another instruction.
I was under the impression you were proposing
Ah, definitely not. LoadAligned and LoadUnaligned would continue on with the exact same semantics as we have today (and would get folded in the same scenarios). The new API would behave as a LoadAligned and would validate the address under minopts, but would be folded under both VEX and legacy encodings when optimizations are enabled.
It's also worth noting that this proposal matches basically what the native compilers do.
For example (see https://godbolt.org/z/oITxgp), clang will emit exclusively movaps for _mm_load_ps and movups for _mm_loadu_ps in debug mode. However, in Release mode it will fold either for the VEX encoding and only _mm_load_ps for the legacy encoding.
The proposed API here does the same, but still exposes an explicit LoadAligned for when users explicitly want movaps.
-- cc. @fiigii l, since he suggested matching native behavior in the folding
It's also worth noting that this proposal matches basically what the native compilers do.
The problem is why not let the current APIs (Load and LoadAligned) match native compilers. Adding new APIs is unnecessary, in my opinion.
Thanks for the explanation. I'm still not sure why .NET developers would need a way to explicitly emit movaps if native developers don't, which is I think what @fiigii is saying too. I'm not too bothered one way or another as long as there is some way to get the JIT to emit efficient code.
The problem is why not let the current APIs (Load and LoadAligned) match native compilers.
Simply because the previous feedback from the API review and others (like @CarolEidt) has been that we dont want to expose non deterministic APIs without good reason and if we do, we should also expose a deterministic version.
I am just trying to meet the requirements that I have frequently heard and received back on other proposals while exposing an API that provides the desired semantics for something that is believed to be a common enough scenario.
I do not understand why "a possible exception not happen" is non-deterministic...
It is, by definition, non-deterministic because it:
Release or Debug mode)AFAIK, we don't expose any other APIs today where an external library (this includes S.P.Corelib, which is external to the user/application-code) will do something different based on how the calling assembly was compiled. The closest we have is System.Diagnostics.ConditionalAttribute, but that impacts whether the compiler emits the call or not and only impacts code in the calling library; it does not impact the code in the external library or the functionality/behavior of said code if it is called (such as via Reflection).
This was briefly discussed in API review today and we will come back to it in a future review meeting.
The suggestion was that:
Load and LoadAligned) should remain as they are with the existing semantics (the new API should still be named LoadVector128 but with some additional moniker that signifies the non-deterministic nature).I have updated the original post to include the truth tables and to change the name to LoadVector128Unsafe which relays the special non-deterministic nature of the additional API and which follows the previous naming we've done for other non-deterministic APIs.
CC. @bartonjs, @terrajobst
@tannergooding - Is this still under discussion? It seems unlikely that we will be able to incorporate the Unsafe forms for 3.0, with JIT support. I'm moving this to Future - let me know if you think this needs to be in for 3.0. Since the latest proposal doesn't impact existing APIs, it seems that we can introduce it later without breaking existing code.
Yes, this is still under discussion.
I agree that addressing this post 3.0, if it is needed, is fine. Based on the last feedback received, these methods (if exposed) would have new names containing Unsafe and the existing APIs would remain untouched.
LoadAlignedVector128Unsafe will throw an AccessViolationException when optimizations are disabled when the data isn't aligned. When optimizations are on, it depends on the hardware. On newer hardware, the access will work but be slower. Older hardware will AV.Aligned in the APIUnsafe should be a prefix, rather than a suffix (to make sure it pops) but to be consistent with other APIs we'll go with a suffix here.```C#
namespace System.Runtime.Intrinsics.X86
{
public partial abstract class Sse
{
// New APIs:
public unsafe Vector128
// Existing APIs:
// public unsafe Vector128<float> LoadVector128(float* address);
// public unsafe Vector128<float> LoadAlignedVector128(float* address);
}
public partial abstract class Sse2
{
// New APIs:
public unsafe Vector128<double> LoadAlignedVector128Unsafe(double* address);
public unsafe Vector128<byte> LoadAlignedVector128Unsafe(byte* address);
public unsafe Vector128<sbyte> LoadAlignedVector128Unsafe(sbyte* address);
public unsafe Vector128<short> LoadAlignedVector128Unsafe(short* address);
public unsafe Vector128<ushort> LoadAlignedVector128Unsafe(ushort* address);
public unsafe Vector128<int> LoadAlignedVector128Unsafe(int* address);
public unsafe Vector128<uint> LoadAlignedVector128Unsafe(uint* address);
public unsafe Vector128<long> LoadAlignedVector128Unsafe(long* address);
public unsafe Vector128<ulong> LoadAlignedVector128Unsafe(ulong* address);
// Existing APIs:
// public unsafe Vector128<double> LoadVector128(double* address);
// public unsafe Vector128<byte> LoadVector128(byte* address);
// public unsafe Vector128<sbyte> LoadVector128(sbyte* address);
// public unsafe Vector128<short> LoadVector128(short* address);
// public unsafe Vector128<ushort> LoadVector128(ushort* address);
// public unsafe Vector128<int> LoadVector128(int* address);
// public unsafe Vector128<uint> LoadVector128(uint* address);
// public unsafe Vector128<long> LoadVector128(long* address);
// public unsafe Vector128<ulong> LoadVector128(ulong* address);
//
// public unsafe Vector128<double> LoadAlignedVector128(double* address);
// public unsafe Vector128<byte> LoadAlignedVector128(byte* address);
// public unsafe Vector128<sbyte> LoadAlignedVector128(sbyte* address);
// public unsafe Vector128<short> LoadAlignedVector128(short* address);
// public unsafe Vector128<ushort> LoadAlignedVector128(ushort* address);
// public unsafe Vector128<int> LoadAlignedVector128(int* address);
// public unsafe Vector128<uint> LoadAlignedVector128(uint* address);
// public unsafe Vector128<long> LoadAlignedVector128(long* address);
// public unsafe Vector128<ulong> LoadAlignedVector128(ulong* address);
}
}
```
Is the whole purpose of these APIs to generate a bit more efficient code for old hardware? Do we have any numbers about the benefits?
These APIs smell like over-engineering to me.
It's not just older hardware. It's also for ready to run and AOT code which needs to be compiled for non-vex hardware by default.
I can get more concrete numbers but it can reduce the number of instructions by up to half (as the load can be folded into the consuming instruction, rather than needing to be separate).
It's also for ready to run and AOT code which needs to be compiled for non-vex hardware by default.
We do not have R2R for hardware intrinsics today (except a fragile hack for CoreLib). The proper fix for the hardware intrinsics AOT problem is to compile multiple versions of the method for different types of hardware and then pick the right one at runtime.
That's the same scenario, ultimately. The only difference between targeting VEX and non-VEX enabled hardware (for the 128-bit code path) is whether you use LoadVector128 (non VEX) or LoadAlignedVector128 (non-VEX).
In native land, the C++ compilers expose _mm_load_ps and _mm_loadu_ps (with other equivalents for integer and double types). loadu always does unaligned loads and will only be folded on VEX hardware. load does an aligned load in unoptimized code and does an appropriately folded load in optimized code (whether VEX or non-VEX).
We opted to expose LoadVector128 (unaligned) and LoadAlignedVector128 (aligned) because it more clearly states the contract. We also said that LoadAlignedVector128 will only be folded on non-VEX hardware because we don't want to have silently differing/non-deterministic behavior for these APIs. We likewise said that for the few cases were we expose differing/non-deterministic behavior we would do that with Unsafe (e.g. Vector128.CreateScalarUnsafe).
This API fills a gap were we weren't providing functionality available in native code. It provides the ability for users to write a single algorithm that supports both VEX and non-VEX encodings and for the JIT/AOT to always emit the appropriate codegen. It also ensures that users continue getting the appropriate validation when optimizations are disabled to help prevent bugs.
It's also worth pointing out that non-VEX doesn't necessarily mean old. Intel's Atom line still doesn't support VEX, but it's current technology in every other respect.
This API fills a gap were we weren't providing functionality available in native code.
We are not trying to match native code 100%. We are trying to get close enough while staying true to the .NET values like simplicity and productivity. Having all LoadAlignedVector128, LoadAlignedVector128Unsafe and LoadVector128 is anything but simple.
At the same time, the hardware intrinsic APIs weren't exactly designed with simplicity in mind. We are expecting users to pin their data, understand the hardware involved that they are targeting, and to write multiple code paths that do functionally the same thing but with different underlying instructions depending on the hardware they are running against.
This makes one aspect of the coding pattern better by allowing them to cleanly write:
Sse.LoadAlignedVector128Unsafe(pAddress)) rather than the following (which involves another helper method, the JIT cleanly inlining the call, dropping the dead code paths, and folding the load; which we've already determined can be problematic in some cases):
if (Avx.IsSupported)
{
Debug.Assert((((nint)pAddress) % 16) == 0);
return Sse.LoadVector128(pAddress);
}
else
{
return Sse.LoadAlignedVector128(pAddress);
}
It is also a common thing that these coding paths should be considering; especially in places like S.P.Corelib where we do have existing AOT/R2R support and where it defaults (and will continue to default) to supporting non-VEX enabled hardware by default. Users not wanting to deal with the complexity of hardware intrinsics have the option of using the type-safe and portable Vector<T> type instead.
I was just writing the same thing. I actually have this code in places to ensure I get folded loads
C#
// ensure pointer is aligned
if (Avx.IsSupported)
Sse.Add(vec, Sse.LoadVector128(p));
else
Sse.Add(vec, Sse.LoadAlignedVector128(p));
… which is far less simple and far less discoverable than LoadAlignedVector128Unsafe. If we could only have the one pair of methods, I'd want the LoadAlignedVector128Unsafe behavior. It's the current LoadAlignedVector that's more difficult to reason about given an understanding of how the native intrinsics work.
Are there any cases where one would want to use LoadAlignedVector128 instead of LoadAlignedVector128Unsafe?
I can't imagine one, particularly if the Unsafe variant can be implemented so it throws in minopts. I assume it would just emit movaps and never fold the load, exactly as LoadAlignedVector128 does on VEX processors today? I can see the reasoning behind wanting deterministic behavior for the .NET implementation, but I can't imagine an actual user of these APIs making the performance tradeoff to get the determinism if the choice is there.
I just had a chance to watch the review meeting video, and it does seem like the API shape and behavior with the two aligned load variants is confusing. I was equally confused until @tannergooding walked through it earlier in this thread.
So we should just fix the LoadAlignedVector128 to do the right thing. It does not make sense to tell everybody to change LoadAlignedVector128 to LoadAlignedVector128Unsafe in their code for best results.
I do not buy the argument about deterministic behavior for alignment faults. The platform does not guarantee deterministic alignment faults. For example, accessing misaligned double* may throw on some architectures in some situations. We do not specify when exactly nor try to ensure that it stays the same from version to version.
The platform does not guarantee deterministic alignment faults.
x86, however, does guarantee deterministic alignment faults and the hardware intrinsics expose hardware specific functionality. The aligned load/store operations will always raise a #GP(0) (General Protection) fault if the memory address is not correctly aligned (this is to 16-bytes for Vector128 and to 32-bytes for Vector256).
Thus far, we have a 1-to-1 mapping of the exposed API surface to an underlying instruction for the safe APIs. That is, Sse.Shuffle will always emit shufps, Sse.LoadVector128 will always emit movups and Sse.LoadAlignedVector128 will always emit movaps. We've only exposed non-deterministic behavior under the "unsafe" APIs (such as Vector128.CreateScalarUnsafe, which is the equivalent to CreateScalar but leaves the upper bits non-deterministic rather than forcing them to zero). This had been an explicit design decision so users could retain control over the emitted instructions and to ensure they got deterministic behavior. This in turn helps ensure more deterministic perf results and allows them to micro-tune their code to get the best results possible.
It does not make sense to tell everybody to change LoadAlignedVector128 to LoadAlignedVector128Unsafe in their code for best results.
I don't believe this isn't what most users will have to do. There are only a few categories of code that you can have right now.
You have people who don't care what the alignment is and are explicitly using LoadVector. These users are getting efficient codegen on newer hardware and will need to make no changes.
Then you have users who are explicitly taking alignment into account (this has to be explicit since CoreCLR provides no way to guarantee 16-byte or higher alignment today).
For users who are taking alignment into account, you will either have people who are doing it because they want alignment enforcement or they are doing it for performance.
If they are doing it for alignment enforcement they will only be using LoadAlignedVector and changing the behavior of LoadAlignedVector will be breaking.
If they are instead doing it for performance, then they will either have written two code paths or have written a small helper method like I gave above that uses LoadVector on VEX enabled hardware and LoadAlignedVecctor on non-VEX enabled hardware.
These new APIs target the latter category of user and gives them an API which bridges that gap without also regressing any other scenario. It will only reduce the amount of code they have written and makes it an explicit opt-in that you are willing to not have strict alignment checking in favor of performance.
changing the behavior of
LoadAlignedVectorwill be breaking.
Throwing less exceptions is not a breaking change.
explicit opt-in that you are willing to not have strict alignment checking in favor of performance
There is very little value in this explicit opt-in. Alignment bugs are like threading bugs to large degree. If you have alignment bug in your code, you often get lucky because of some other part of the code guarantees alignment.
Throwing less exceptions is not a breaking change.
I think the difference is that this is a hardware exception raised as the side-effect of a given instruction. It is used to ensure correctness and sometimes because not all hardware has fast unaligned loads (particularly older hardware and sometimes non-VEX enabled hardware).
Alignment bugs are like threading bugs to large degree.
I don't think I'd agree, particularly when the expectation is that users explicitly pin and check alignment first (such that they can specially handle leading/trailing elements via an unaligned and masked memory operation).
That being said... @GrabYourPitchforks, @terrajobst, @CarolEidt. Do you have a problem with changing the behavior of LoadAlignedVector* to match the desired behavior rather than exposing the new API?
The LoadAlignedVector* APIs are documented to map exactly to [v]movdqa, so as a consumer I'd certainly be surprised if the JIT ended up emitting any other instruction such as [v]movdqu. As we discussed yesterday I have relied on the fact that [v]movdqa AVs as a way to check my implementation while running under a test environment and while fuzzing. (Generally, fuzzing is meant to be run over a fully-optimized application, but I suppose it's possible to kick off two runs: one with optimizations enabled and one without. It's just more work to set up.)
As long as there exists some setting which will force this API to emit [v]movdqa, and as long as we can ensure we have unit tests running under such an environment, my scenario is satisfied.
As long as there exists some setting which will force this API to emit [v]movdqa,
The configuration switches to explicitly set the different hardware support levels (e.g. COMPlus_EnableAVX, etc.) should do the trick, no?
APIs are documented to map exactly to [v]movdqa, so as a consumer I'd certainly be surprised if the JIT ended up emitting any other instruction such as [v]movdqu
It's not a question of it emitting a different instruction than you requested, it's that the load would be folded so it's not emitted at all. So where you would have gotten a fault if vmovdqa were emitted, you won't because the load isn't present at all.
The real question is, if given a choice between a method that always emits vmovdqa and one that potentially eliminates that instruction by folding it into another, would you ever use the one that guarantees never to fold? If the purpose is to avoid a performance hit due to a load that crosses cache line boundaries, how is a load instruction that can't be eliminated/folded an improvement? As long as minopts emits the aligned load always so you can test, I don't see anyone choosing the other method.
Edit:
It's also worth noting that on non-VEX hardware, that method isn't guaranteed to emit movdqa today because it may fold that. The only difference is that non-VEX will fault even if the load is folded.
Also, in the Intel intrinsics guide, movdqa is documented as
Load 128-bits of integer data from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
Note that it says may there
It's not a question of it emitting a different instruction than you requested, it's that the load would be folded so it's not emitted at all.
Which is why the original proposal exposes a new API.
As it is today, if you call LoadAlignedVector128 you will always get codegen that will load 16-bytes from a given address and will generate a #GP(0) if that address is not also 16-byte aligned. Normally this is movaps/movapd/movdqa but it can also safely be folded into the consuming instruction for the non-VEX encoding. This is "safe" to do as most SIMD instructions (all except 2, iirc) that can also encode a memory operand require the address to be aligned and therefore will preserve the fault behavior.
When using the VEX-encoding, the instructions were changed to no longer fault if the address wasn't aligned. This means it is no longer "safe" to fold because doing so could remove a #GP(0) that would have otherwise been raised.
It's also worth noting that on non-VEX hardware, that method isn't guaranteed to emit movdqa today because it may fold that
The key point being that we were still providing deterministic behavior and preserving the semantics of the original code.
Note that it says may there
The native intrinsics don't guarantee that _mm_load_* will emit a movaps/movupd/movdqa. Their intrinsic behaves as the proposed LoadAlignedVector128Unsafe. They also don't provide deterministic APIs for things like CreateScalar and expect that you will explicitly zero the bits yourself if required, which is something that we previously said was undesirable for .NET.
The configuration switches to explicitly set the different hardware support levels (e.g. COMPlus_EnableAVX, etc.) should do the trick, no?
By itself it won't do the trick. The behavior would be:
| COMPlus_EnableAVX | Optimizations Enabled | Result |
| -- | -- | -- |
| 0 | 0 | #GP(0) if unaligned |
| 0 | 1 | #GP(0) if unaligned |
| 1 | 0 | #GP(0) if unaligned |
| 1 | 1 | No exception raised |
The real question is, if given a choice between a method that always emits vmovdqa and one that potentially eliminates that instruction by folding it into another, would you ever use the one that guarantees never to fold? If the purpose is to avoid a performance hit due to a load that crosses cache line boundaries, how is a load instruction that can't be eliminated/folded an improvement? As long as minopts emits the aligned load always so you can test, I don't see anyone choosing the other method.
This question hinges on whether you are wanting "fast code" or "strict code". If you want "fast code", then yes; there is never a reason to want the movaps/movapd/movdqa emitted. You need data to be aligned to get "fast" non-VEX but you still want the load to be folded to get the smaller codegen on the newer VEX encoding.
If you want "strict" code, then you would strictly want the alignment checking always, even under the newer VEX encoding and even if it provides a minimal perf hit. Likewise, you would probably not want to do the alignment checking yourself when there is a single instruction that will efficiently do it.
I tend to prefer "fast code" myself and I can't currently think of any scenarios where you would both want the VEX encoding and for "strict" mode to be used. However, at the same time, I don't think exposing a non-deterministic API (one that differs in result between optimizations on/off) is good API design and I feel it tends to go against what we've previously chosen to do for these APIs and for .NET in general.
(To clarify, I mean exposing a non-deterministic API without a deterministic equivalent).
There are two different reasons why you may want do fuzzing:
You want to do both of these for different hardware support levels.
The hardware architectures out there and .NET specifically do not provide strong guarantees around alignment faults across the board, so designing fuzzing strategy around the fact that a few x86 instructions emit alignment faults is not a viable cross-architecture strategy.
Do you have a problem with changing the behavior of LoadAlignedVector* to match the desired behavior rather than exposing the new API?
I don't have a problem with that change, and the arguments for seem much more compelling than the arguments against.
In general, we expect developers to use HW Intrinsics in scenarios where perf matters, and we expect them to have a great deal of sophistication with regard to their usage. That said, perhaps the subtleties here merit some guidance on this topic (blog post, anyone?)
Closing this. We've updated the containment handling to efficiently do this instead.
Most helpful comment
The JIT has to preserve the semantics of the underlying instruction. Currently
LoadAlignedwill cause a hardware exception to be raised if the input address is not aligned;LoadUnaligned(presently named justLoad) currently has unaligned semantics and will not raise an exception, regardless of the alignment of the input address.When using the legacy encoding, the vast majority of the instructions which allow a memory operand have "aligned" semantics and will cause a hardware exception to be raised if the input address is not aligned.
When using the VEX encoding, the reverse is true, and the vast majority of the instructions which allow a memory operand have "unaligned" semantics and will not cause a hardware exception to be raised if the input address is not aligned.
This means that, when we emit using the legacy encoding, we can only fold and emit efficient code for
LoadAligned. That is, if you seevar result = Sse.Add(left, Sse.LoadAligned(pRight)we can emit:addps xmm0, [pRight]; However, if you seevar result = Sse.Add(left, Sse.LoadUnaligned(pRight)we have to instead emitmovups xmm1, [pRight]; addps xmm0, xmm1And when using the VEX encoding, we can only fold and emit efficient code for
LoadUnaligned. That is, if you seevar result = Sse.Add(left, Sse.LoadAligned(pRight)we have to emit:movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you seevar result = Sse.Add(left, Sse.LoadUnaligned(pRight)we can just emitaddps xmm0, xmm0, [pRight]We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold
LoadAlignedwhen emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you foldLoadUnalignedwhen emitting the legacy encoding).We can expect that users will want to emit the most efficient code possible; However, we can also expect that users,,where possible, will be using various well-known tricks to get their data aligned so that they don't incur the penalty for reading/writing across a cache-line or page boundary (which will occur every 4 reads/writes on most modern processors, when working with 128-bit data).
This leads users to writing algorithms that do the following (where possible):
They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the
Process data as alignedsection) likely have aDebug.Assert((address % 16) == 0)and will be usingLoadUnaligned(presently just calledLoad).However, it is also not unreasonable that they will want to emit efficient codegen if the user happens to be running on some pre-AVX hardware (as is the case with the "Benchmarks Games" which run on a Q6600 -- which it is also worth noting doesn't have fast unaligned loads). If the user wants to additionally support this scenario, they need to write a helper method which effectively does the following:
Which also depends on the JIT being able to successfully recognize and fold this code.
So, this proposal suggests providing this helper method built-in by renaming the existing
Loadmethod to beLoadUnalignedand to expose a new methodLoad(I had originally suggested something likeLoadAlignedOrUnaligned, but it was suggested I change it to the current) which does effectively the above. That is, on hardware with VEX encoding support, it usesLoadUnalignedand when optimizations are disabled it will validate that the input address is aligned; while on older hardware that uses the legacy encoding, it will useLoadAligned. This ensures that the code can always be folded efficiently and it validates the semantics remains the same (without hurting perf when users compile their executables with optimizations enabled). We also still have theLoadUnalignedandLoadAlignedmethods explicitly for the cases where the user can't always do the "optimal" thing (sometimes you just have to work with data that can be unaligned).