Runtime: Consider exposing a HWIntrinsic that allows efficient loading, regardless of encoding.

Created on 17 Nov 2018 · 84Comments · Source: dotnet/runtime

Rationale

On x86 hardware, for SIMD instructions, there are effectively two encoding formats: "legacy" and "vex".

There are some minor differences between these two encodings, including the number of parameters they take and whether the memory operand has aligned or unaligned semantics.

As a brief summary:

The legacy encoding generally takes two operands (ins reg, reg/[mem]), the first serves as both a source and destination, and the last as a source which can generally be a register or memory address. The memory address form has "aligned" semantics and will cause an exception to be thrown if the address is not "naturally aligned" (generally 16-bytes).
The VEX encoding generally takes three operands (ins reg, reg, reg/[mem]), the first being the destination, the second being a source, and the last as a source which can generally be a register or memory address. The memory address form has "unaligned" semantics and will not cause an exception to be thrown, regardless of the input.

Today, we expose both Load (which has unaligned semantics) and LoadAligned (which has aligned semantics). Given that a user will often want to generate the "most efficient" code possible and that the JIT, in order to preserve semantics (not silently get rid of an exception that would otherwise be thrown), changes which method it will "fold" depending on the encoding it is currently emitting, it may be beneficial to expose an intrinsic which allows the user to specify that "This address is aligned, do whatever load is most efficient". This would ensure that it can be folded regardless of the current encoding.

Proposed API

namespace System.Runtime.Intrinsics.X86
{
    public partial abstract class Sse
    {
        // New APIs:
        public Vector128<float> LoadVector128Unsafe(float* address);

        // Existing APIs:
        public Vector128<float> LoadVector128(float* address);
        public Vector128<float> LoadAlignedVector128(float* address);
    }

    public partial abstract class Sse2
    {
        // New APIs:
        public Vector128<double> LoadVector128Unsafe(double* address);
        public Vector128<byte> LoadVector128Unsafe(byte* address);
        public Vector128<sbyte> LoadVector128Unsafe(sbyte* address);
        public Vector128<short> LoadVector128Unsafe(short* address);
        public Vector128<ushort> LoadVector128Unsafe(ushort* address);
        public Vector128<int> LoadVector128Unsafe(int* address);
        public Vector128<uint> LoadVector128Unsafe(uint* address);
        public Vector128<long> LoadVector128Unsafe(long* address);
        public Vector128<ulong> LoadVector128Unsafe(ulong* address);

        // Existing APIs:
        public Vector128<double> LoadVector128(double* address);
        public Vector128<byte> LoadVector128(byte* address);
        public Vector128<sbyte> LoadVector128(sbyte* address);
        public Vector128<short> LoadVector128(short* address);
        public Vector128<ushort> LoadVector128(ushort* address);
        public Vector128<int> LoadVector128(int* address);
        public Vector128<uint> LoadVector128(uint* address);
        public Vector128<long> LoadVector128(long* address);
        public Vector128<ulong> LoadVector128(ulong* address);

        public Vector128<double> LoadAlignedVector128(double* address);
        public Vector128<byte> LoadAlignedVector128(byte* address);
        public Vector128<sbyte> LoadAlignedVector128(sbyte* address);
        public Vector128<short> LoadAlignedVector128(short* address);
        public Vector128<ushort> LoadAlignedVector128(ushort* address);
        public Vector128<int> LoadAlignedVector128(int* address);
        public Vector128<uint> LoadAlignedVector128(uint* address);
        public Vector128<long> LoadAlignedVector128(long* address);
        public Vector128<ulong> LoadAlignedVector128(ulong* address);
    }
}

API Semantics

This shows the semantics of each API under the legacy and VEX encoding.

Most insrtuctions support having the last operand be either a register or memory operand: ins reg, reg/[mem]

When folding does not happen the load is an explicit separate instruction:

For example, Sse.Add(input, Sse.Load(address)) would generate
* movups xmm1, [address]
* addps xmm0, xmm1

While, folding allows the load to be folded into the calling instruction:

For example, Sse.Add(input, Sse.Load(address)) would generate
* addps xmm0, [address]

On Legacy hardware, the folded-form of the instructions assert that address is aligned (generally this means (address % 16) == 0).
On VEX hardware, the folded-form does no such validation and allows any input.

Load (mov unaligned)

| | Unaligned Input | Aligned Input |
| ------ | --------------- | ------------- |
| Legacy | No Folding | No Folding |
| VEX | Folds | Folds |

LoadAligned (mov aligned)

| | Unaligned Input | Aligned Input |
| ------ | ------------------ | ------------- |
| Legacy | Folds, Throws | Folds |
| VEX | No Folding, Throws | No Folding |

LoadUnsafe

| | Unaligned Input | Aligned Input |
| ------ | ---------------------- | ------------- |
| Legacy | Folds, Throws | Folds |
| VEX | Folds, Throws in Debug | Folds |

Additional Info

Some open ended questions:

Is this possibly better served by analysis or hints provided to the JIT?
- Not sure if providing analysis is worthwhile, given that modern hardware (since ~2011) almost always has VEX encoding support, and this is more a nicety for if people are supporting older hardware or working around a JIT bug (COMPlus_EnableAVX=0)
Should we also expose Store, StoreUnaligned, and StoreAligned counterparts or just keep Store and StoreAligned?
Should we also expose equivalents on Avx/Avx2?
- Noting that for Avx/Avx2 instructions, they require the VEX encoding and therefore always have "unaligned" semantics, but that Avx/Avx2 define and expose LoadAlignedVector256/StoreAligned instructions anyways

api-approved area-System.Runtime.Intrinsics

Source

tannergooding

Most helpful comment

I don't quite understand this proposal.

The JIT has to preserve the semantics of the underlying instruction. Currently LoadAligned will cause a hardware exception to be raised if the input address is not aligned; LoadUnaligned (presently named just Load) currently has unaligned semantics and will not raise an exception, regardless of the alignment of the input address.

When using the legacy encoding, the vast majority of the instructions which allow a memory operand have "aligned" semantics and will cause a hardware exception to be raised if the input address is not aligned.
When using the VEX encoding, the reverse is true, and the vast majority of the instructions which allow a memory operand have "unaligned" semantics and will not cause a hardware exception to be raised if the input address is not aligned.

This means that, when we emit using the legacy encoding, we can only fold and emit efficient code for LoadAligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we can emit: addps xmm0, [pRight]; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we have to instead emit movups xmm1, [pRight]; addps xmm0, xmm1

And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]

We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).

So, what is the difference between the proposed Load intrinsic and the existing LoadAligned

We can expect that users will want to emit the most efficient code possible; However, we can also expect that users,,where possible, will be using various well-known tricks to get their data aligned so that they don't incur the penalty for reading/writing across a cache-line or page boundary (which will occur every 4 reads/writes on most modern processors, when working with 128-bit data).

This leads users to writing algorithms that do the following (where possible):

// Process first few elements to become aligned
// Process data as aligned
// Process trailing elements that don't fit in a single `VectorXXX<T>`

They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).

This is what we are doing in ML.NET today, for example

However, it is also not unreasonable that they will want to emit efficient codegen if the user happens to be running on some pre-AVX hardware (as is the case with the "Benchmarks Games" which run on a Q6600 -- which it is also worth noting doesn't have fast unaligned loads). If the user wants to additionally support this scenario, they need to write a helper method which effectively does the following:

public static Vector128<float> EfficientLoad(float* address)
{
    if (Avx.IsSupported)
    {
        // Must assert, since this would fail for non-AVX otherwise
        Debug.Assert((address % 16) == 0);
        return Sse.LoadUnaligned(address);
    }
    else
    {
        return Sse.LoadAligned(address);
    }
}

Which also depends on the JIT being able to successfully recognize and fold this code.

So, this proposal suggests providing this helper method built-in by renaming the existing Load method to be LoadUnaligned and to expose a new method Load (I had originally suggested something like LoadAlignedOrUnaligned, but it was suggested I change it to the current) which does effectively the above. That is, on hardware with VEX encoding support, it uses LoadUnaligned and when optimizations are disabled it will validate that the input address is aligned; while on older hardware that uses the legacy encoding, it will use LoadAligned. This ensures that the code can always be folded efficiently and it validates the semantics remains the same (without hurting perf when users compile their executables with optimizations enabled). We also still have the LoadUnaligned and LoadAligned methods explicitly for the cases where the user can't always do the "optimal" thing (sometimes you just have to work with data that can be unaligned).

tannergooding on 29 Nov 2018

👍2

All 84 comments

CC. @eerhardt, @fiigii, @CarolEidt for thoughts

tannergooding on 17 Nov 2018

Today, we expose both Load (which has unaligned semantics) and LoadAligned (which has aligned semantics).

I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load, as the proposed name is quite long.

"This address is aligned, do whatever load is most efficient".

For the name Load would match.

Should we also expose StoreAlignedOrUnaligned methods?

If load is exposed, the same should be done for store. Hence rename Store to StoreUnaligned and use the name Store for the method analogous to this proposal.

Should we also expose equivalents on Avx/Avx2?

In my proposal for method-names this wouldn't be in symmetry with Avx/Avx2 anymore, so yes there should be the same methods -- even if they just forward to the unaligned methods.

gfoidl on 19 Nov 2018

👍1

I would rename the unaligned load to that what it is LoadUnaligned, then this proposed method can be called just Load,

I was thinking the same thing.

eerhardt on 19 Nov 2018

Updated to be LoadVector128, LoadAlignedVector128, and LoadUnalignedVector128.

tannergooding on 20 Nov 2018

👍2

Sorry for the delay to reply.
I don't quite understand this proposal. Currently, LoadAligned can be folded into the consumer instruction with SSE and VEX encoding, but Load (LoadUnaligned in this proposal) only can be folded with VEX encoding. This behavior also matches C++ compilers.
So, what is the difference between the proposed Load intrinsic and the existing LoadAligned?

Meanwhile, guarantee memory alignment should be user's responsibility, and the dynamic assertion ("Asserts (address % 16) == 0 when optimizations are disabled") does not always work.

fiigii on 29 Nov 2018

Additionly, on modern Intel CPUs, the unaligned load/store won't be slower than aligned load/store (if the memory access hits in one cache line), so we suggest always using unaligned load/store. That is why Load intrinsic maps to the unaligned load instruction.

fiigii on 29 Nov 2018

I don't quite understand this proposal.

So, what is the difference between the proposed Load intrinsic and the existing LoadAligned

This leads users to writing algorithms that do the following (where possible):

// Process first few elements to become aligned
// Process data as aligned
// Process trailing elements that don't fit in a single `VectorXXX<T>`

This is what we are doing in ML.NET today, for example

public static Vector128<float> EfficientLoad(float* address)
{
    if (Avx.IsSupported)
    {
        // Must assert, since this would fail for non-AVX otherwise
        Debug.Assert((address % 16) == 0);
        return Sse.LoadUnaligned(address);
    }
    else
    {
        return Sse.LoadAligned(address);
    }
}

Which also depends on the JIT being able to successfully recognize and fold this code.

tannergooding on 29 Nov 2018

👍2

And when using the VEX encoding, we can only fold and emit efficient code for LoadUnaligned. That is, if you see var result = Sse.Add(left, Sse.LoadAligned(pRight) we have to emit: movaps xmm1, [pRight]; addps xmm0, xmm0, xmm1; However, if you see var result = Sse.Add(left, Sse.LoadUnaligned(pRight) we can just emit addps xmm0, xmm0, [pRight]

We have to do it this way because you would otherwise risk losing an exception that would have been raised (e.g. you fold LoadAligned when emitting the VEX encoding) or you risk raising an exception that would otherwise not have been (e.g. you fold LoadUnaligned when emitting the legacy encoding).

Ah, I see the problem now. But we should fold LoadAligned with VEX encoding, as it does not throw exception.

They will naturally want to assert their code is correct and ensure codegen is efficient, so they will (in the Process data as aligned section) likely have a Debug.Assert((address % 16) == 0) and will be using LoadUnaligned (presently just called Load).

This assert is useless. address could be either aligned or unaligned in different runs.

fiigii on 29 Nov 2018

as it does not throw exception.

That is the problem and why we can't fold it. LoadAligned emits movaps which always throws if the address is unaligned. If you fold it into a VEX-encoded instruction, it will no longer throw. This means that if you passed in an unaligned address, the exception would silently disappear whenever the JIT folded it.

This assert is useless. address could be either aligned or unaligned in different runs.

The point of the assert is to help catch places where the user is calling Load without pinning the data and ensuring that it is properly aligned. It will still be possible for a user to do this incorrectly, but it would be an explicit opt-in.
We could potentially use another name, such as LoadUnsafe, or something that conveys that the data must be aligned but we won't always validate that it is.

tannergooding on 29 Nov 2018

I like this proposal, and agree that the assert, while not guaranteeing to catch all potential dynamic data configurations, is very likely to catch nearly all cases.
Given that there is a great deal of expertise required to use the hardware intrinsics effectively, I don't think it's necessary to encumber this with the Unsafe suffix. Rather, documenting its behavior and asserting in the unoptimized case should be sufficient.

CarolEidt on 29 Nov 2018

Ah, looked the manual again, yes, vmovaps can still throw exceptions with VEX-encoding, but other instructions won't. Thanks.

fiigii on 29 Nov 2018

I don't think it's necessary to encumber this with the Unsafe suffix

Agree, actually other Load* are not safe neither 😄

fiigii on 29 Nov 2018

and this is more a nicety for if people are supporting older hardware or working around a JIT bug (COMPlus_EnableAVX=0)

Are there enough such people to justify adding more intrinsics?

Should we also expose equivalents on Avx/Avx2?

It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic. But that would imply adding more intrinsics. Or perhaps AVX Load should simply be renamed to LoadUnaligned, though the longer name sucks. Same for stores.

But to quote @CarolEidt "Given that there is a great deal of expertise required to use the hardware intrinsics effectively", I don't think such difference in semantics justifies adding more intrinsics.

Should these instructions have a Debug.Assert((address % 16) == 0)?

Where does this assert go exactly? Is Load a true intrinsic or a helper? Who's using debug builds of corelib, besides the CoreCLR team?

mikedn on 30 Nov 2018

I don't think such difference in semantics justifies adding more intrinsic.

Agree, I suggest that we can fold more existing load intrinsic rather than adding more intrinsics.

SSE encoding: only fold LoadAligned because folding LoadUnaligned would throw exceptions that the users did not expect.
VEX encoding: fold LoadAligned and LoadUnaligned both. The individual LoadAligned may throw exceptions but folded ones won't, but I think this is fine. We assume that users "expect" exceptions from LoadAligned and handle in their own code, and folded LoadAligned just throws exceptions with 0% possibility.

The above codegen strategy is already adopted by all the mainstream C/C++ compilers.

fiigii on 30 Nov 2018

Are there enough such people to justify adding more intrinsics?

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

It's perhaps weird for SSE Load to have one semantic and AVX Load to have a different semantic

Yes, if this API is approved, we would (at a minimum rename) the AVX intrinsic to be LoadUnaligned. I already have a note for the API review about whether we should have Store mirror these names.

Where does this assert go exactly? Is Load a true intrinsic or a helper?

A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.

and folded LoadAligned just throws exceptions with 0% possibility

I don't think this is acceptable without an explicit user opt-in. The runtime has had a very strict policy on not silently removing side-effects

The above codegen strategy is already adopted by all the mainstream C/C++ compilers.

C/C++ makes a number of optimizations that we have, at this point, decided not to do and to instead address via Analyzers or look at more in depth later.
We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).

tannergooding on 30 Nov 2018

👍1

The runtime has had a very strict policy on not silently removing side-effects

It is not removing side-effects, it has no any side-effects (hardware exceptions).

fiigii on 30 Nov 2018

It is not removing side-effects, it has no any side-effects (hardware exceptions).

All exceptions, including HardwareExceptions are considered side-effects.

The movaps raises the hardware #GP exception if its input is not properly aligned and the JIT interecepts that and converts it into a System.AccessViolationException.

tannergooding on 30 Nov 2018

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

I didn't ask about cases where you tried to do that. I asked if there are enough people who are using .NET Core on old or gimped CPU and expect things to work perfectly rather than just reasonably.

A true intrinsic and the JIT would add the additional check when optimizations are disabled. This is the only way to get it to work in end-user code for release builds of the CoreCLR.

Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned. Whether it's a good idea to do that in debug builds it's debatable. Some people may end up disabling optimizations to workaround a JIT bug or to be able to get a correct stacktrace for a hard to track down bug. Now they risk getting exceptions that with optimizations enabled would not occur.

The runtime has had a very strict policy on not silently removing side-effects

Please do not mislead people. The runtime does not impose any semantics on intrinsics, it's the other way around. LoadAligned may be defined as "throws an exception if the address is unaligned" or as "may throw an exception if the address is unaligned". That may or may not be a good idea. But it doesn't have anything to do with any runtime policy.

We have, from the start, made a lot of effort to ensure that the APIs are deterministic, straightforward, and that they "fit in" with the .NET look, feel, expectations. And, when we want to additionally expose the "less safe" mechanism that C/C++ had, we have discussed whether we believe it is worthwhile and how to expose it independently (CreateScalarUnsafe and ToVector256Unsafe are both good examples of this).

Indeed. Squabbles over names resulted in string intrinsic being removed from .NET 3. Yet adding non-deterministic intrinsics to support old CPUs is doable.

mikedn on 30 Nov 2018

Sounds to me that you don't need some assert or other kind of check, you just need to import Load as LoadAligned.

I believe the result of this proposal is just forcing most people to change Load to LoadUnaligned in their programs.

Yes, we already have cases of trying to use intrinsics and needing to write additional logic to ensure downlevel hardware would also be efficient.

I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches

for (;IsUnaligned(addr); addr++)
{
    // processing data until the memory get aligned
}
for (; addr < end; addr += Vector128<int>.Count)
{
    var v = LoadAlignedVector128(addr); 
    // or LoadUnalignedVector128(addr), whatever, we recommand always using LoadUnalignedVector128
    ......
}

for (; addr < end; addr += Vector128<int>.Count)
{
    var v = LoadUnalignedVector128(addr); 
    ......
}

Using which approach is based on profiling and it is much more important than folding loads for SSE encoding.

fiigii on 30 Nov 2018

I didn't ask about cases where you tried to do that

I wasn't saying that with regards to personal attempts.

Now they risk getting exceptions that with optimizations enabled would not occur

Correct, but the API is explicitly documented to do as such and they have the option to enforce the desired behavior by explicitly using one of the other two APIs (LoadAligned or LoadUnaligned).

By not having all three, users either have to write their own wrappers for something that we could (and in my opinion should) reasonably handle.

Additionally, if we only expose two APIs and make one of them "sometimes validate alignment", then we can't provide a guarantee to the end user that they can enforce the semantics where required.

Please do not mislead people. The runtime does not impose any semantics on intrinsics

I don't think this is misleading at all. We have had a very clear contract and design for the platform-specific hardware intrinsics that each is tied to a very particular instruction.

LoadAligned is documented to emit movaps and LoadUnaligned (currently just called Load) will emit movups. This gives a very clear story and guarantees that, if someone uses this API, they will get exactly that instruction.

Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.

It is worth noting, however, that we have also tried to determine common patterns, use-cases, and limitations with the decision to tie a given hardware intrinsic to a particular instruction. When we have found such limitations, we have discussed what we could do about it (both the "ideal" and realistic scenarios) and determined an appropriate solution. So far, out of all the scenarios that have been brought up, we have determined to deviate only for CreateScalarUnsafe and ToVector256Unsafe (where we say that the upper bits will be non-deterministic, rather than explicitly zeroed; but for which we also expose "safe" versions that do explicitly zero the upper bits). This may be another case, that is more trivial to resolve via an additional API and that is worthwhile to do.

Squabbles over names resulted in string intrinsic being removed from .NET 3

They have not necessarily been removed from netcoreapp3.0, instead we decided that we would pull them until we can determine the appropriate way to expose these APIs as there were concerns over the useability and understandability of said APIs. The underlying instructions behave quite differently from most other intrinsics we've exposed and so they require some additional thought. This is no different from any other API that we decide to expose in that we don't want to ship something that has a bad surface area and we are still working towards making sure these APIs ship in a good and useable state.

Yet adding non-deterministic intrinsics to support old CPUs is doable.

As mentioned above, both cases where we exposed an additional non-deterministic API were discussed in length.

tannergooding on 1 Dec 2018

I want to say again Debug.Assert((address % 16) == 0); is useless or relying on this assert is an incorrect way to treat memory alignment. .NET Core does not have "aligned allocation", so we have to deal with memory alignment in two approaches

Once you have pinned the memory, it is guaranteed not to move and you can rely on the assert. Anyone dealing with perf-sensitive scenarios and larger blocks of memory (for example, ML.NET) should be pinning their data anyways, as the GC could move the data otherwise and that can mess with the cache, alignment, etc.

Using which approach is based on profiling and it is much more important than folding loads.

The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" has multiple pages/sections all discussing this. The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:

Always align data when possible
Generally prefer aligned stores over aligned loads, if you have to choose
Consider just using two 128-bit unaligned loads if you can't align your 256-bit data
etc

All this additional API does is give the user the option to choose between:

I can't align my data (they use LoadUnaligned)
My data is aligned and should never be unaligned (they use LoadAligned)
My data is aligned, but you don't have to check (they use Load)

A good number of vectorizable algorithms (especially the ones used in cases like ML.NET) can be handled with a worst-case scenario of two-unaligned loads maximum, regardless of the alignment of the input data. For such algorithms, the only question of whether to align or not really comes down to how much data is being processed in total and the cases where you don't want to make the data aligned can generally be special-cased.

tannergooding on 1 Dec 2018

The performance impact of performing an unaligned load/store across a cache-line or page boundary is made very clear and there are very clear recommendations to do things like:

Always align data when possible
Generally prefer aligned stores over aligned loads, if you have to choose
Consider just using two 128-bit unaligned loads if you can't align your 256-bit data
etc

Yes, that is what I meant. And aligned loads != movap*.

I can't align my data (they use LoadUnaligned)
My data is aligned and should never be unaligned (they use LoadAligned)
My data is aligned, but you don't have to check (they use Load)

All these three situations are suggested to use LoadUnaligned.

fiigii on 1 Dec 2018

Because of this explicit design decision and because the runtime says side-effects must be preserved, we can't (and shouldn't) just fold LoadAligned under the VEX-encoding.

For the fortieth and two time: the runtime has nothing to do with this. I give up. We're probably speaking different languages or something.

mikedn on 1 Dec 2018

@mikedn, I think the issue is a runtime issue when we start talking about folding the loads into the operations, in ways that might cause a runtime failure to be masked (i.e. if the load would throw on an unaligned address, but the operation into which it is folded would not). And I think what @tannergooding is after here is the ability to have a single load semantic that supports folding.

All that said, I'm still not sure where I fall on this issue.

CarolEidt on 1 Dec 2018

Import @CarolEidt 's comment from https://github.com/dotnet/coreclr/issues/21308

I agree that we don't want to silently remove an exception. It is not just that it violates the "strict" behavior that developers expect from the CLR, but could also cause sudden unexpected failures if something changed in the user code (or the optimization settings used) that caused the JIT to no longer fold the instruction.

If users write LoadAligned, they should be responsible to handle the possible exceptions, whatever it is from JIT optimization or actual memory misalignment.

fiigii on 1 Dec 2018

And aligned loads != movap*.

I would disagree here. We have two instructions that perform explicit loads (including movupd and movapd here):

movups, which works for any data
movaps, which only works for aligned data

We additionally have the ability to support load semantics in most instructions:

Under legacy encoding, this only works for aligned data
Under VEX encoding, this works for any data

I would say that the unaligned load could just be called a load, as it loads data with no special semantics (and it is still just a load even if the data happens to be aligned). I would say that an aligned load must also validate that the data is aligned (which only movaps does).

All these three situations are suggested to use LoadUnaligned.

LoadUnaligned does not successfully fulfill all three scenarios by itself. It definitively fills the first scenario, but the second scenario requires an additional validation step (and must fail if the requirements are not met) and the third scenario has an optional validation step (that would fail if the requirements are not met and optimizations are disabled; otherwise, when optimizations are enabled, it would skip the validation check).

tannergooding on 1 Dec 2018

I think the issue is a runtime issue when we start talking about folding the loads into the operations

@CarolEidt It most definitely isn't. As far as the runtime/ECMA spec is concerned, intrinsics are just method calls. There's nothing anywhere in the ECMA spec that says that LoadAligned must always throw an exception if the data is not aligned. That's solely in the intrinsic's spec courtyard.

Also, I don't claim that we should change the current LoadAligned semantic. I just want to see reasonable argumentation for adding a bunch of new intrinsics. And I'm not seeing it.

mikedn on 1 Dec 2018

And aligned loads != movap*.

I would disagree here.

LoadUnaligned does not successfully fulfill all three scenarios by itself.

movups and movaps has the same performance and movups is always safe. Why not use it?

The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.

fiigii on 1 Dec 2018

movups and movaps has the same performance and movups is always safe. Why not use it

Because we can't always safely fold a movups. Some examples are being on older/legacy hardware, working around a JIT bug (so you have disabled VEX-encoding support), using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...

The words of "aligned loads" in the optimization manual mean that "unaligned loads" would cause performance issue by cache-line split or page fault rather than movups itself.

Yes, and the purpose of the proposed intrinsic is to:

Provide some debug-mode validation that you are not causing cache-line splits or page faults
Allow the intrinsic to be folded, regardless of whether you are using the legacy or VEX encoding

It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults; and they will not make any attempt to align the input data. As such, you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?

tannergooding on 1 Dec 2018

using modern CPUs that don't have AVX support because you are in a low-power, IOT, or budget scenario, etc...

It sounds like your concern is that "most" workloads will just use movups and will not care about cache-line splits or page-faults;

No, my concern is that "most" workloads will just use movups and will not care about folding on older CPUs (or other low-power devices you mentioned).
Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps.

Provide some debug-mode validation that you are not causing cache-line splits or page faults

Allow the intrinsic to be folded, regardless of whether you are using the legacy or VEX encoding

If users do not guarantee the alignment, the compiler-generated validation (1) is not reliable, so that makes (2) unacceptable on older CPUs.
If users already guarantee the alignment, the compiler-generated validation (1) is useless, then i) using LoadUnaligned normally, or ii) using LoadAligned if the code-size on older CPUs is critical (needs https://github.com/dotnet/coreclr/issues/21308)

you think that calling this intrinsic just Load is undesirable because most people will actually use/want LoadUnaligned. Is that about right?

Right.

fiigii on 1 Dec 2018

the compiler-generated validation (1) is not reliable

Given the low-level nature of the code, that the intrinsic is explicitly documented to do the validation, and that the GC currently only guarantees 8-byte alignment; a JIT inserted check for debug-builds should catch essentially every case (especially across multiple runs).

My original proposal was to call this something like LoadAlignedOrUnaligned to explicitly call attention to the semantics, but it was suggested I change that to the current name).

(needs dotnet/coreclr#21308)

I think that is a non-starter; not only would it silently remove the exception; but it removes the ability for users to guarantee the generation of a movaps (which will always validate the alignment). The current proposal allows users to continue having explicit aligned or unaligned semantics, if desired; but also gives them the ability to do the efficient thing.

No, my concern is that "most" workloads will just use movups and will not care about folding on older CPUs (or other low-power devices you mentioned).
Handling cache-line splits and page faults is another story. Again, aligned loads != using movaps.

I believe that for "most" workloads, dev's will be concerned with cache-line splits and page faults, so they will attempt to align their data if the input is unaligned (these intrinsics are geared towards high-perf and hotspot scenarios, after all). They will want to additionally validate that they will not accidentally cause a cache-line split in production (their alignment logic is correct, they pinned the data, etc) and they will want to generate efficient code.

Therefore, it is my belief that they would write something like (this is using today's APIs and ignoring this proposal):

int misalignment = (int)(pCurrent) % 16;

// note: may need to handle unnaturally aligned data, such as if `float`
//       which should be `(pCurrent % 4) == 0` is something else.

if (misalignment != 0)
{
    // Vector128<T>.Count is a power of a two and a constant, this should optimize to a right shift
    misalignment /= Vector128<T>.Count;
    misalignment = Vector128<T>.Count - misalignment;

    for (int i = 0; i < misalignment; i++)
    {
       // Process the individual elements, non-vectorized
    }

    count -= misalignment;
}

if (remainder >= Vector128<T>.Count)
{
    remainder = count % Vector128<T>.Count;
    count -= remainder;

    for (int i = 0; i < count; i++)
    {
        // Assert we are still aligned
        Debug.Assert(((int)(pCurrent) % 16) == 0);

        // Process elements using vectorization and the appropriate Load(pCurrent)
    }
}
else
{
    remainder = count;
}

for (int i = 0; i < remainder; i++)
{
    // Process the individual elements, non-vectorized
}

This proposal does a couple of things:

It removes the need for the user to manually do Debug.Assert((address % 16) == 0), as we will do the equivalent automatically
It allows efficient codegen in all scenarios (whether on modern/high-perf or on older/low-perf)

The code the user writes remains the same and if they want to have explicit Unaligned semantics or explicit Aligned semantics; they still have the option to use one of the other two overloads and have the correct/expected behavior for those

tannergooding on 1 Dec 2018

We just repeat our words again and again. I would like to stop the discussion and leave the proposal to the API review.

My original proposal was to call this something like LoadAlignedOrUnaligned to explicitly call attention to the semantics, but it was suggested I change that to the current name).

At least, do not name the new APIs to Load.

fiigii on 1 Dec 2018

At least, do not name the new APIs to Load.

Right now I also think this shouldn't be named Load, as it hides something.
Maybe (just thinking loud): LoadSafe (as opposed to Unsafe), LoadAutomatic, LoadFast, ...

On recent x86-hardware there is no need for this method -- I think here we have consensus, because always using unaligned reads has no penalty, and is recommended by intel, too.

But as @tannergooding tries to point out there exists hardware, on which the same code should run optimal without any need to change, where unaligned reads do have a penalty, so aligned reads should be used, also depending on the JIT if he will emit VEX or not.
This case could be handled with this new method. Therefore I think it is reasonable to add it.

But the name should be simple and obvious, not to overwhelm users and leave them with open questions about which Load-method to use.

gfoidl on 1 Dec 2018

If these methods have to exist, I agree Load is probably a bad name for them. On the other hand, it's better than the self-documenting LoadAlignedOnOldCpusOrUnalignedOnNewCpusOrBetterYetJustFoldTheLoadIfPossible.

I think @mikedn is right here, though. This is a new API, and if the documentation says LoadAligned may throw an exception on unaligned data, that's good enough. Is there really ever a scenario where you need to guarantee a movaps will be emitted for the intrinsic method? I like the idea of a 1:1 mapping of intrinsics to instructions, but since we can't access registers directly or specify the reg/mem targets of the instructions directly, there's a limit to how far that philosophy can go.

For high-performance scenarios, people will ensuring the data is aligned anyway, and if that's the only way to get the load folded, that just further encourages the best practice.

saucecontrol on 1 Dec 2018

Thinking about this more, I reckon that if LoadAligned were required for folding, you'd end up with people writing code like this:

C# Vector128<float> v; if (Avx.IsSupported) { v = Sse.LoadAligned(...) // Data isn't really aligned but this allows folding the load } else { v = Sse.LoadUnaligned(...) } ...

Since this ultimately comes down to differences in CPU capabilities/encodings, maybe the right thing is to fit the API shape to that rather than trying to make a single magic adaptive method. What if we had Sse.LoadUnaligned, Sse.LoadAligned, and Avx.LoadVector128?

Sse.LoadAligned could be folded in both encoding forms (again with the caveat that the docs don't guarantee that unaligned addresses would throw on all CPUs), and Avx.LoadVector128 could be always be folded because it would only be available when VEX encoding is supported.

saucecontrol on 2 Dec 2018

I reckon that if LoadAligned were required for folding, you'd end up with people writing code like this:

No, always using LoadUnaligned. But if you worry about the SSE encoding codegen, use Sse.LoadAligned.
Selecting different load based onif (Avx.IsSupported) is unnecessary because you have to align your data for LoadAligned like https://github.com/dotnet/corefx/issues/33566#issuecomment-443362604, so the Sse.LoadUnaligned(...) branch is meaningless but complicating your code.

fiigii on 2 Dec 2018

Right, that was in reference to my previous suggestion that LoadAligned be required for folding in the API even though it's not required in all encodings. But people would end up calling the aligned method to get the folding even with unaligned data when they know it's supported by the CPU. I was just shooting down my first suggestion and then offering another.

saucecontrol on 2 Dec 2018

👍1

Oh, I see, thanks for the explanation.

fiigii on 2 Dec 2018

people writing code like this:

The code is backwards. When using the VEX encoding (Avx.IsSupported is true), we can only fold LoadUnaligned because the instructions do not validate the alignment of the address. If the user opts-in, we could additionally fold some method that sometimes checks alignment (whether that is via a 3rd, separate API; or via some other mechanism).

When not using the VEX encoding, we can only fold LoadAligned because the instructions themselves will assert alignment and will fail if the address is not aligned (there is no way to workaround this).

Since this ultimately comes down to differences in CPU capabilities/encodings, maybe the right thing is to fit the API shape to that rather than trying to make a single magic adaptive method

That is what this proposal is trying to do. It is trying to keep the two existing LoadUnaligned (currently just called Load) and LoadAligned methods, so that users who need to can explicitly use the required behavior.

It is additionally trying to expose a 3rd API that allows users to explicitly opt into doing the "efficient" thing (which basically requires saying: "Hey, my data is aligned, validate my assertion in debug but do the efficient thing in release").

tannergooding on 2 Dec 2018

I think doing anything without an explicit opt-in is a non-starter and the general rule has been that we don't expose a non-deterministic API without also exposing a deterministic one; so I don't think we can or should just relax LoadAligned.

I am not convinced that just calling the new API Load is right, as I don't think it conveys all the necessary semantics (which is why I originally had it named differently).

I am convinced that exposing an additional API for this behavior would be beneficial. We already know of cases where users will be running certain types of ML.NET code on low-power/IOT devices (such as Raspberry PI). Not everyone is buying the latest CPU's; and sometimes when they do they go for budget (which may mean they purchase a modern CPU without AVX support, and there are several of them that get released each generation).

tannergooding on 2 Dec 2018

such as Raspberry PI

is running on ARM, so it's out of the game here.

gfoidl on 2 Dec 2018

is running on ARM, so it's out of the game here.

Right, but there are also modern low-power/budget x86 computers without AVX support (from both AMD and Intel). I am unsure if we have any specific numbers on usage, but given that there are other low-power/IOT devices targeting ML.NET, I feel they shouldn't be excluded.

tannergooding on 2 Dec 2018

That's my opinion too.

gfoidl on 2 Dec 2018

The code is backwards.

My sample refers to my proposed implementation where LoadAligned doesn't guarantee an exception on unaligned data and is the only load type that is considered for folding by the JIT because it would work in both encodings. My point was that if the developer knows that method is the only one that can be folded and knows that on VEX processors, the data doesn't actually have to be aligned, they'd end up calling LoadAligned specifically on unaligned data but only on VEX processors, which is even more confusing than your proposal. I talked myself out of that.

When using the VEX encoding (Avx.IsSupported is true), we can only fold LoadUnaligned because the instructions do not validate the alignment of the address

I'm still not following your reasoning for believing you owe the user an exception if they call an aligned method with unaligned data. Intel guarantees the fault in the movaps instruction, but then they recommend against using that instruction on newer processors. I'm saying that if you never guarantee that movaps will actually be emitted, then you're not breaking anything by not throwing.

It is additionally trying to expose a 3rd API that allows users to explicitly opt into doing the "efficient" thing (which basically requires saying: "Hey, my data is aligned, validate my assertion in debug but do the efficient thing in release").

This still doesn't match up with the capabilities of the processor, though, because under VEX encoding the data doesn't have to be aligned. An Avx.LoadVector128 could be safely folded in all cases, whereas the too-clever Load would work on VEX processors and crash on legacy with unaligned data. The proposed assert in minopts mitigates that possibility but then creates an unnecessary restriction on the VEX processors.

Not everyone is buying the latest CPU's

I totally agree, and that's why I think there will be a large number of apps/libraries that have both SSE and AVX-optimized code paths. Not being able to completely optimize the AVX path because of a restriction on the older/lower-end processors is bad too.

saucecontrol on 2 Dec 2018

I totally agree, and that's why I think there will be a large number of apps/libraries that have both SSE and AVX-optimized code paths.

Libraries will likely have both 128-bit and 256-bit optimized code paths. For the 128-bit code path, they may consider both the VEX-encoding and the legacy encoding scenarios.

I'm still not following your reasoning for believing you owe the user an exception if they call an aligned method with unaligned data.

An API called LoadAligned that does not validate that the data is actually aligned is confusing and would likely be shot down in API review.

This still doesn't match up with the capabilities of the processor, though, because under VEX encoding the data doesn't have to be aligned.

Correct. And if your data is not going to be aligned or if it is not beneficial for you to handle leading/trailing elements to make the rest of the handling aligned; you would call the explicit LoadUnaligned method.

However, as I listed above, it is my expectation that for all but the smallest input cases; users will be pinning their data and using various well-known "tricks" to get the majority of the data "aligned" (so that they don't have the cache-line penalty every 4 reads/writes or a page fault every X reads/writes). They just won't use LoadAligned because the JIT won't emit the efficient thing on modern CPUs. They will, however, likely want to have validation in Debug mode that their input is actually aligned.

but then they recommend against using that instruction on newer processors

The optimization manual makes no recommendations about using or not using movaps; rather they recommend that you align your data when possible and have a number of recommendations on what to do if you can't guarantee data alignment (such as preferring aligned stores over aligned reads). They also recommend using the more efficient encoding (rather than an explicit load) when possible.

The only reason not to use LoadAligned is because it will not be folded (emit the more efficient encoding) on VEX-enabled hardware. Yes, this is currently an API limitation, but no, I don't think it will be relaxed.
The reason to use LoadUnaligned is because it will be folded (emit the more efficient encoding) on VEX-enabled hardware. The downside is that, when you are doing performance optimizations and your data it aligned, you get no validation of that fact and you must add explicit Debug.Asserts to your own code.

This proposal came from the belief that needing to use LoadUnaligned and additionally assert the alignment was a common enough case. It has the additional benefit that it will folded on either legacy or VEX-encoding while still providing explicit APIs for the Aligned vs Unaligned case.

tannergooding on 2 Dec 2018

An API called LoadAligned that does not validate that the data is actually aligned is confusing and would likely be shot down in API review.

I see your point, but a method called Load that requires aligned data on some hardware but doesn't really on other hardware and enforces that only through an assert isn't really a lot better.

as I listed above, it is my expectation that for all but the smallest input cases; users will be pinning their data and using various well-known "tricks" to get the majority of the data "aligned"

This was my initial thought as well, but then I thought back to the challenge of re-implementing System.Numerics using S.R.I. On a granular method like operator+, you can't make any kind of assumption about the alignment, but the folding would still be beneficial. I don't know if some of the plans around first class structs and containment will address that through some other trickery, but limiting reg/mem encoding to only aligned memory scenarios is bound to cause problems elsewhere.

They also recommend using the more efficient encoding (rather than an explicit load) when possible.

This is the part that's most important to me. I can figure out whether I've aligned my pointers, but I want to get the most efficient encoding in every possible case.

saucecontrol on 2 Dec 2018

I see your point, but a method called Load that requires aligned data on some hardware but doesn't really on other hardware and enforces that only through an assert isn't really a lot better.

Right. I expect the API review team will learn towards another name (like LoadUnsafe, due to its non-deterministic nature).

you can't make any kind of assumption about the alignment, but the folding would still be beneficial.

Right, and in this case (where you are doing one off operations) you would have to use LoadUnaligned. It will do the efficient encoding on modern hardware, but will involve an additional instruction for older hardware. -- However, you also wont likely be doing explicit loads here. I would imagine you would just use Unsafe.As to go from Vector4 to Vector128 and the JIT would do the efficient conversion (as we will often already be in register).

tannergooding on 2 Dec 2018

Ah, ok. As long as LoadUnaligned can still be folded on VEX hardware, that's fine. I was under the impression you were proposing LoadAligned always emitted movaps, LoadUnaligned always emitted movups, and that only Load (or LoadUnsafe or whatever) could be re-encoded into to another instruction.

saucecontrol on 2 Dec 2018

I was under the impression you were proposing

Ah, definitely not. LoadAligned and LoadUnaligned would continue on with the exact same semantics as we have today (and would get folded in the same scenarios). The new API would behave as a LoadAligned and would validate the address under minopts, but would be folded under both VEX and legacy encodings when optimizations are enabled.

tannergooding on 2 Dec 2018

👍1

It's also worth noting that this proposal matches basically what the native compilers do.

For example (see https://godbolt.org/z/oITxgp), clang will emit exclusively movaps for _mm_load_ps and movups for _mm_loadu_ps in debug mode. However, in Release mode it will fold either for the VEX encoding and only _mm_load_ps for the legacy encoding.

The proposed API here does the same, but still exposes an explicit LoadAligned for when users explicitly want movaps.

-- cc. @fiigii l, since he suggested matching native behavior in the folding

tannergooding on 2 Dec 2018

It's also worth noting that this proposal matches basically what the native compilers do.

The problem is why not let the current APIs (Load and LoadAligned) match native compilers. Adding new APIs is unnecessary, in my opinion.

fiigii on 2 Dec 2018

Thanks for the explanation. I'm still not sure why .NET developers would need a way to explicitly emit movaps if native developers don't, which is I think what @fiigii is saying too. I'm not too bothered one way or another as long as there is some way to get the JIT to emit efficient code.

saucecontrol on 2 Dec 2018

The problem is why not let the current APIs (Load and LoadAligned) match native compilers.

Simply because the previous feedback from the API review and others (like @CarolEidt) has been that we dont want to expose non deterministic APIs without good reason and if we do, we should also expose a deterministic version.

tannergooding on 2 Dec 2018

I am just trying to meet the requirements that I have frequently heard and received back on other proposals while exposing an API that provides the desired semantics for something that is believed to be a common enough scenario.

tannergooding on 2 Dec 2018

I do not understand why "a possible exception not happen" is non-deterministic...

fiigii on 2 Dec 2018

It is, by definition, non-deterministic because it:

Produces a different output for a given input
Is dependent on an external state (whether an external library was compiled for Release or Debug mode)

AFAIK, we don't expose any other APIs today where an external library (this includes S.P.Corelib, which is external to the user/application-code) will do something different based on how the calling assembly was compiled. The closest we have is System.Diagnostics.ConditionalAttribute, but that impacts whether the compiler emits the call or not and only impacts code in the calling library; it does not impact the code in the external library or the functionality/behavior of said code if it is called (such as via Reflection).

tannergooding on 2 Dec 2018

This was briefly discussed in API review today and we will come back to it in a future review meeting.

The suggestion was that:

If we expose this, it should be a distinct and separate API from the existing two
The existing APIs (Load and LoadAligned) should remain as they are with the existing semantics (the new API should still be named LoadVector128 but with some additional moniker that signifies the non-deterministic nature).
The original post should be updated with a "truth" table for each API and when it folds/throws

I have updated the original post to include the truth tables and to change the name to LoadVector128Unsafe which relays the special non-deterministic nature of the additional API and which follows the previous naming we've done for other non-deterministic APIs.

CC. @bartonjs, @terrajobst

tannergooding on 4 Dec 2018

@tannergooding - Is this still under discussion? It seems unlikely that we will be able to incorporate the Unsafe forms for 3.0, with JIT support. I'm moving this to Future - let me know if you think this needs to be in for 3.0. Since the latest proposal doesn't impact existing APIs, it seems that we can introduce it later without breaking existing code.

CarolEidt on 24 Jan 2019

Yes, this is still under discussion.

I agree that addressing this post 3.0, if it is needed, is fine. Based on the last feedback received, these methods (if exposed) would have new names containing Unsafe and the existing APIs would remain untouched.

tannergooding on 24 Jan 2019

Video

LoadAlignedVector128Unsafe will throw an AccessViolationException when optimizations are disabled when the data isn't aligned. When optimizations are on, it depends on the hardware. On newer hardware, the access will work but be slower. Older hardware will AV.
We should include the word Aligned in the API
In general, Unsafe should be a prefix, rather than a suffix (to make sure it pops) but to be consistent with other APIs we'll go with a suffix here.

```C#
namespace System.Runtime.Intrinsics.X86
{
public partial abstract class Sse
{
// New APIs:
public unsafe Vector128 LoadAlignedVector128Unsafe(float* address);

    // Existing APIs:
    // public unsafe Vector128<float> LoadVector128(float* address);
    // public unsafe Vector128<float> LoadAlignedVector128(float* address);
}

public partial abstract class Sse2
{
    // New APIs:
    public unsafe Vector128<double> LoadAlignedVector128Unsafe(double* address);
    public unsafe Vector128<byte> LoadAlignedVector128Unsafe(byte* address);
    public unsafe Vector128<sbyte> LoadAlignedVector128Unsafe(sbyte* address);
    public unsafe Vector128<short> LoadAlignedVector128Unsafe(short* address);
    public unsafe Vector128<ushort> LoadAlignedVector128Unsafe(ushort* address);
    public unsafe Vector128<int> LoadAlignedVector128Unsafe(int* address);
    public unsafe Vector128<uint> LoadAlignedVector128Unsafe(uint* address);
    public unsafe Vector128<long> LoadAlignedVector128Unsafe(long* address);
    public unsafe Vector128<ulong> LoadAlignedVector128Unsafe(ulong* address);

    // Existing APIs:
    // public unsafe Vector128<double> LoadVector128(double* address);
    // public unsafe Vector128<byte> LoadVector128(byte* address);
    // public unsafe Vector128<sbyte> LoadVector128(sbyte* address);
    // public unsafe Vector128<short> LoadVector128(short* address);
    // public unsafe Vector128<ushort> LoadVector128(ushort* address);
    // public unsafe Vector128<int> LoadVector128(int* address);
    // public unsafe Vector128<uint> LoadVector128(uint* address);
    // public unsafe Vector128<long> LoadVector128(long* address);
    // public unsafe Vector128<ulong> LoadVector128(ulong* address);
    //
    // public unsafe Vector128<double> LoadAlignedVector128(double* address);
    // public unsafe Vector128<byte> LoadAlignedVector128(byte* address);
    // public unsafe Vector128<sbyte> LoadAlignedVector128(sbyte* address);
    // public unsafe Vector128<short> LoadAlignedVector128(short* address);
    // public unsafe Vector128<ushort> LoadAlignedVector128(ushort* address);
    // public unsafe Vector128<int> LoadAlignedVector128(int* address);
    // public unsafe Vector128<uint> LoadAlignedVector128(uint* address);
    // public unsafe Vector128<long> LoadAlignedVector128(long* address);
    // public unsafe Vector128<ulong> LoadAlignedVector128(ulong* address);
}

}
```

terrajobst on 26 Nov 2019

Is the whole purpose of these APIs to generate a bit more efficient code for old hardware? Do we have any numbers about the benefits?

These APIs smell like over-engineering to me.

jkotas on 26 Nov 2019

It's not just older hardware. It's also for ready to run and AOT code which needs to be compiled for non-vex hardware by default.

I can get more concrete numbers but it can reduce the number of instructions by up to half (as the load can be folded into the consuming instruction, rather than needing to be separate).

tannergooding on 26 Nov 2019

It's also for ready to run and AOT code which needs to be compiled for non-vex hardware by default.

We do not have R2R for hardware intrinsics today (except a fragile hack for CoreLib). The proper fix for the hardware intrinsics AOT problem is to compile multiple versions of the method for different types of hardware and then pick the right one at runtime.

jkotas on 26 Nov 2019

That's the same scenario, ultimately. The only difference between targeting VEX and non-VEX enabled hardware (for the 128-bit code path) is whether you use LoadVector128 (non VEX) or LoadAlignedVector128 (non-VEX).

In native land, the C++ compilers expose _mm_load_ps and _mm_loadu_ps (with other equivalents for integer and double types). loadu always does unaligned loads and will only be folded on VEX hardware. load does an aligned load in unoptimized code and does an appropriately folded load in optimized code (whether VEX or non-VEX).

We opted to expose LoadVector128 (unaligned) and LoadAlignedVector128 (aligned) because it more clearly states the contract. We also said that LoadAlignedVector128 will only be folded on non-VEX hardware because we don't want to have silently differing/non-deterministic behavior for these APIs. We likewise said that for the few cases were we expose differing/non-deterministic behavior we would do that with Unsafe (e.g. Vector128.CreateScalarUnsafe).

This API fills a gap were we weren't providing functionality available in native code. It provides the ability for users to write a single algorithm that supports both VEX and non-VEX encodings and for the JIT/AOT to always emit the appropriate codegen. It also ensures that users continue getting the appropriate validation when optimizations are disabled to help prevent bugs.

tannergooding on 26 Nov 2019

It's also worth pointing out that non-VEX doesn't necessarily mean old. Intel's Atom line still doesn't support VEX, but it's current technology in every other respect.

saucecontrol on 26 Nov 2019

👍1

This API fills a gap were we weren't providing functionality available in native code.

We are not trying to match native code 100%. We are trying to get close enough while staying true to the .NET values like simplicity and productivity. Having all LoadAlignedVector128, LoadAlignedVector128Unsafe and LoadVector128 is anything but simple.

jkotas on 26 Nov 2019

At the same time, the hardware intrinsic APIs weren't exactly designed with simplicity in mind. We are expecting users to pin their data, understand the hardware involved that they are targeting, and to write multiple code paths that do functionally the same thing but with different underlying instructions depending on the hardware they are running against.

This makes one aspect of the coding pattern better by allowing them to cleanly write:
Sse.LoadAlignedVector128Unsafe(pAddress)) rather than the following (which involves another helper method, the JIT cleanly inlining the call, dropping the dead code paths, and folding the load; which we've already determined can be problematic in some cases):

if (Avx.IsSupported)
{
    Debug.Assert((((nint)pAddress) % 16) == 0);
    return Sse.LoadVector128(pAddress);
}
else
{
    return Sse.LoadAlignedVector128(pAddress);
}

It is also a common thing that these coding paths should be considering; especially in places like S.P.Corelib where we do have existing AOT/R2R support and where it defaults (and will continue to default) to supporting non-VEX enabled hardware by default. Users not wanting to deal with the complexity of hardware intrinsics have the option of using the type-safe and portable Vector<T> type instead.

tannergooding on 27 Nov 2019

I was just writing the same thing. I actually have this code in places to ensure I get folded loads

C# // ensure pointer is aligned if (Avx.IsSupported) Sse.Add(vec, Sse.LoadVector128(p)); else Sse.Add(vec, Sse.LoadAlignedVector128(p));

… which is far less simple and far less discoverable than LoadAlignedVector128Unsafe. If we could only have the one pair of methods, I'd want the LoadAlignedVector128Unsafe behavior. It's the current LoadAlignedVector that's more difficult to reason about given an understanding of how the native intrinsics work.

saucecontrol on 27 Nov 2019

Are there any cases where one would want to use LoadAlignedVector128 instead of LoadAlignedVector128Unsafe?

jkotas on 27 Nov 2019

I can't imagine one, particularly if the Unsafe variant can be implemented so it throws in minopts. I assume it would just emit movaps and never fold the load, exactly as LoadAlignedVector128 does on VEX processors today? I can see the reasoning behind wanting deterministic behavior for the .NET implementation, but I can't imagine an actual user of these APIs making the performance tradeoff to get the determinism if the choice is there.

I just had a chance to watch the review meeting video, and it does seem like the API shape and behavior with the two aligned load variants is confusing. I was equally confused until @tannergooding walked through it earlier in this thread.

saucecontrol on 27 Nov 2019

So we should just fix the LoadAlignedVector128 to do the right thing. It does not make sense to tell everybody to change LoadAlignedVector128 to LoadAlignedVector128Unsafe in their code for best results.

I do not buy the argument about deterministic behavior for alignment faults. The platform does not guarantee deterministic alignment faults. For example, accessing misaligned double* may throw on some architectures in some situations. We do not specify when exactly nor try to ensure that it stays the same from version to version.

jkotas on 27 Nov 2019

The platform does not guarantee deterministic alignment faults.

x86, however, does guarantee deterministic alignment faults and the hardware intrinsics expose hardware specific functionality. The aligned load/store operations will always raise a #GP(0) (General Protection) fault if the memory address is not correctly aligned (this is to 16-bytes for Vector128 and to 32-bytes for Vector256).

Thus far, we have a 1-to-1 mapping of the exposed API surface to an underlying instruction for the safe APIs. That is, Sse.Shuffle will always emit shufps, Sse.LoadVector128 will always emit movups and Sse.LoadAlignedVector128 will always emit movaps. We've only exposed non-deterministic behavior under the "unsafe" APIs (such as Vector128.CreateScalarUnsafe, which is the equivalent to CreateScalar but leaves the upper bits non-deterministic rather than forcing them to zero). This had been an explicit design decision so users could retain control over the emitted instructions and to ensure they got deterministic behavior. This in turn helps ensure more deterministic perf results and allows them to micro-tune their code to get the best results possible.

It does not make sense to tell everybody to change LoadAlignedVector128 to LoadAlignedVector128Unsafe in their code for best results.

I don't believe this isn't what most users will have to do. There are only a few categories of code that you can have right now.

You have people who don't care what the alignment is and are explicitly using LoadVector. These users are getting efficient codegen on newer hardware and will need to make no changes.

Then you have users who are explicitly taking alignment into account (this has to be explicit since CoreCLR provides no way to guarantee 16-byte or higher alignment today).

For users who are taking alignment into account, you will either have people who are doing it because they want alignment enforcement or they are doing it for performance.

If they are doing it for alignment enforcement they will only be using LoadAlignedVector and changing the behavior of LoadAlignedVector will be breaking.

If they are instead doing it for performance, then they will either have written two code paths or have written a small helper method like I gave above that uses LoadVector on VEX enabled hardware and LoadAlignedVecctor on non-VEX enabled hardware.

These new APIs target the latter category of user and gives them an API which bridges that gap without also regressing any other scenario. It will only reduce the amount of code they have written and makes it an explicit opt-in that you are willing to not have strict alignment checking in favor of performance.

tannergooding on 27 Nov 2019

changing the behavior of LoadAlignedVector will be breaking.

Throwing less exceptions is not a breaking change.

explicit opt-in that you are willing to not have strict alignment checking in favor of performance

There is very little value in this explicit opt-in. Alignment bugs are like threading bugs to large degree. If you have alignment bug in your code, you often get lucky because of some other part of the code guarantees alignment.

jkotas on 27 Nov 2019

Throwing less exceptions is not a breaking change.

I think the difference is that this is a hardware exception raised as the side-effect of a given instruction. It is used to ensure correctness and sometimes because not all hardware has fast unaligned loads (particularly older hardware and sometimes non-VEX enabled hardware).

Alignment bugs are like threading bugs to large degree.

I don't think I'd agree, particularly when the expectation is that users explicitly pin and check alignment first (such that they can specially handle leading/trailing elements via an unaligned and masked memory operation).

That being said... @GrabYourPitchforks, @terrajobst, @CarolEidt. Do you have a problem with changing the behavior of LoadAlignedVector* to match the desired behavior rather than exposing the new API?

tannergooding on 27 Nov 2019

The LoadAlignedVector* APIs are documented to map exactly to [v]movdqa, so as a consumer I'd certainly be surprised if the JIT ended up emitting any other instruction such as [v]movdqu. As we discussed yesterday I have relied on the fact that [v]movdqa AVs as a way to check my implementation while running under a test environment and while fuzzing. (Generally, fuzzing is meant to be run over a fully-optimized application, but I suppose it's possible to kick off two runs: one with optimizations enabled and one without. It's just more work to set up.)

As long as there exists some setting which will force this API to emit [v]movdqa, and as long as we can ensure we have unit tests running under such an environment, my scenario is satisfied.

GrabYourPitchforks on 27 Nov 2019

As long as there exists some setting which will force this API to emit [v]movdqa,

The configuration switches to explicitly set the different hardware support levels (e.g. COMPlus_EnableAVX, etc.) should do the trick, no?

jkotas on 27 Nov 2019

APIs are documented to map exactly to [v]movdqa, so as a consumer I'd certainly be surprised if the JIT ended up emitting any other instruction such as [v]movdqu

It's not a question of it emitting a different instruction than you requested, it's that the load would be folded so it's not emitted at all. So where you would have gotten a fault if vmovdqa were emitted, you won't because the load isn't present at all.

The real question is, if given a choice between a method that always emits vmovdqa and one that potentially eliminates that instruction by folding it into another, would you ever use the one that guarantees never to fold? If the purpose is to avoid a performance hit due to a load that crosses cache line boundaries, how is a load instruction that can't be eliminated/folded an improvement? As long as minopts emits the aligned load always so you can test, I don't see anyone choosing the other method.

Edit:
It's also worth noting that on non-VEX hardware, that method isn't guaranteed to emit movdqa today because it may fold that. The only difference is that non-VEX will fault even if the load is folded.

Also, in the Intel intrinsics guide, movdqa is documented as

Load 128-bits of integer data from memory into dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.

Note that it says may there

saucecontrol on 27 Nov 2019

It's not a question of it emitting a different instruction than you requested, it's that the load would be folded so it's not emitted at all.

Which is why the original proposal exposes a new API.

As it is today, if you call LoadAlignedVector128 you will always get codegen that will load 16-bytes from a given address and will generate a #GP(0) if that address is not also 16-byte aligned. Normally this is movaps/movapd/movdqa but it can also safely be folded into the consuming instruction for the non-VEX encoding. This is "safe" to do as most SIMD instructions (all except 2, iirc) that can also encode a memory operand require the address to be aligned and therefore will preserve the fault behavior.

When using the VEX-encoding, the instructions were changed to no longer fault if the address wasn't aligned. This means it is no longer "safe" to fold because doing so could remove a #GP(0) that would have otherwise been raised.

It's also worth noting that on non-VEX hardware, that method isn't guaranteed to emit movdqa today because it may fold that

The key point being that we were still providing deterministic behavior and preserving the semantics of the original code.

Note that it says may there

The native intrinsics don't guarantee that _mm_load_* will emit a movaps/movupd/movdqa. Their intrinsic behaves as the proposed LoadAlignedVector128Unsafe. They also don't provide deterministic APIs for things like CreateScalar and expect that you will explicitly zero the bits yourself if required, which is something that we previously said was undesirable for .NET.

tannergooding on 27 Nov 2019

The configuration switches to explicitly set the different hardware support levels (e.g. COMPlus_EnableAVX, etc.) should do the trick, no?

By itself it won't do the trick. The behavior would be:

| COMPlus_EnableAVX | Optimizations Enabled | Result |
| -- | -- | -- |
| 0 | 0 | #GP(0) if unaligned |
| 0 | 1 | #GP(0) if unaligned |
| 1 | 0 | #GP(0) if unaligned |
| 1 | 1 | No exception raised |

tannergooding on 27 Nov 2019

The real question is, if given a choice between a method that always emits vmovdqa and one that potentially eliminates that instruction by folding it into another, would you ever use the one that guarantees never to fold? If the purpose is to avoid a performance hit due to a load that crosses cache line boundaries, how is a load instruction that can't be eliminated/folded an improvement? As long as minopts emits the aligned load always so you can test, I don't see anyone choosing the other method.

This question hinges on whether you are wanting "fast code" or "strict code". If you want "fast code", then yes; there is never a reason to want the movaps/movapd/movdqa emitted. You need data to be aligned to get "fast" non-VEX but you still want the load to be folded to get the smaller codegen on the newer VEX encoding.

If you want "strict" code, then you would strictly want the alignment checking always, even under the newer VEX encoding and even if it provides a minimal perf hit. Likewise, you would probably not want to do the alignment checking yourself when there is a single instruction that will efficiently do it.

I tend to prefer "fast code" myself and I can't currently think of any scenarios where you would both want the VEX encoding and for "strict" mode to be used. However, at the same time, I don't think exposing a non-deterministic API (one that differs in result between optimizations on/off) is good API design and I feel it tends to go against what we've previously chosen to do for these APIs and for .NET in general.

tannergooding on 27 Nov 2019

(To clarify, I mean exposing a non-deterministic API without a deterministic equivalent).

tannergooding on 27 Nov 2019

There are two different reasons why you may want do fuzzing:

Verifying that the implementation is functionally correct.
Verifying that the implementation maintains its internal invariants. This includes alignment as well as many other invariants well-written code asserts for.

You want to do both of these for different hardware support levels.

The hardware architectures out there and .NET specifically do not provide strong guarantees around alignment faults across the board, so designing fuzzing strategy around the fact that a few x86 instructions emit alignment faults is not a viable cross-architecture strategy.

jkotas on 27 Nov 2019

Do you have a problem with changing the behavior of LoadAlignedVector* to match the desired behavior rather than exposing the new API?

I don't have a problem with that change, and the arguments for seem much more compelling than the arguments against.

In general, we expect developers to use HW Intrinsics in scenarios where perf matters, and we expect them to have a great deal of sophistication with regard to their usage. That said, perhaps the subtleties here merit some guidance on this topic (blog post, anyone?)

CarolEidt on 27 Nov 2019

Closing this. We've updated the containment handling to efficiently do this instead.

tannergooding on 10 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings