A really interesting concept is function multi-versioning. The general idea is to support implementing multiple versions of a function for different hardware and having the correct version of the function selected at run time. Made up sample code:
pub fn someMathFunction(vec: Vector) Vector [target: sse4.2]
{
// optimized for SSE 4.2
}
pub fn someMathFunction(vec: Vector) Vector [target: avx2]
{
// optimized for avx2
}
pub fn someMathFunction(vec: Vector) Vector [target: default]
{
// no asm/intrinsics optimization
}
// later on
const v = giveMeAVect();
const v2 = someMathFunction(v); // calls the best version based on run time selection
There are ways to simulate this using function pointers, but the compiler would be better at optimizing this, plus implementing that over and over by hand would suck.
LLVM https://llvm.org/docs/LangRef.html#ifuncs
GCC https://lwn.net/Articles/691932/
This use case is in scope of Zig. Whether we use LLVM IFuncs or something else is still up for research, as well as the syntax and how it fits in with comptime and other features.
Ifuncs only work on ELF platforms (and not all of them either) so Zig would need a fallback for platforms where they're not supported, such as Windows, FreeBSD < 12, etc.
I'd attack this bottom-up: start with a generic solution and later on add ifuncs as a compiler optimization.
Using hidden function pointers set by the compiler for individual functions would hog cache friendliness and destroy potential for inlining.
Edit: Never mind. Thanks for clarifying @tiehuis!
The way this is solved in Go is with a build tag at the file level. When the build runs, some build properties are defined depending on the platform and optional build arguments. The entire file is then either included or excluded depending on its build tags. Build tags such as 386,!darwin are supported, which means the file will be built only for 386 AND NOT darwin, etc.
This is simple to use, extensible with custom tags, and forces the user to put all platform-specific stuff in files having a tag immediately at the top, so there's no custom build stuff intermixed with normal code.
@binary132 That is different to the issue here since it is a compile-time selection. This is possible in Zig right now by using standard if statements with comptime values. This issue is about function selection at runtime, based on runtime cpu feature detection.
This is useful to generate a single portable binary that runs on different cpu processor architectures (e.g. core 2 duo vs. i7 skylake), while still allowing cpu-specific performance optimizations for hot functions.
function selection at runtime, based on runtime cpu feature detection
Or when using Vulkan? :smirk:
I proposal follow way to solve this stuff to make the code more easy to read and write:
pub fn someMathFunction(vec: Vector) Vector
{
@useLLVMIFuncs
if (builtin.hasSse42){
...
}else if (builtin.hasAvx2){
...
}else{
...
}
}
And the compiler can generate 3 versions of this function and generate the LLVM IFuncs base on that information.
Solve multi-versioning problem like golang (build tag https://golang.org/pkg/go/build/#hdr-Build_Constraints ) make the code difficult to read and write and refactor. (Do I define all the need versions I need? Do I change all the function name of all the version? Where is the linux version of that function?)
I think the switch as specified by @bronze1man seems most in line with how zig would work, given it uses all standard syntax. Ignoring the ifunc optimization, first, we need cpu feature detection of sorts. I've implemented below some example code which provides runtime cpu feature detection to get a better idea of potential issues.
// initialized globally somewhere, could do locally but minor cost
var cpu = CpuFeatures.init();
pub fn memchr(buf: *const u8, c: u8) usize {
if (cpu.hasFeature(FeatureX86.avx512)) {
return memchr_avx512(buf, c);
} else if (cpu.hasFeature(FeatureX86.sse2)) {
return memchr_sse2(buf, c);
} else {
return memchr_generic(buf, c);
}
}
cpuid.c
// This is done separately for now since zig's multi-return inline asm was a pain.
#include <cpuid.h>
int cpuid(unsigned int leaf, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx)
{
return __get_cpuid(leaf, eax, ebx, ecx, edx);
}
cpu.zig
const std = @import("std");
const builtin = @import("builtin");
// Runtime cpu feature detection.
//
// This is currently implemented for x86/x64 targets. For generic targets, the features
// returned will be compile-time false and will not use any code space.
comptime {
std.debug.assert(@memberCount(FeatureX86) == 224);
}
// See https://en.wikipedia.org/wiki/CPUID
pub const FeatureX86 = enum {
// eax = 1, output: edx
fpu,
vme,
de,
pse,
tsc,
msr,
pae,
mce,
cx8,
apic,
_reserved1,
sep,
mtrr,
pge,
mca,
cmov,
pat,
pse_36,
psn,
clfsh,
_reserved2,
ds,
acpi,
mmx,
fxsr,
sse,
sse2,
ss,
htt,
tm,
ia64,
pbe,
// omitted remaining features for brevity
....
};
// Implemented in C until multi-output asm is easier. See: #215.
extern fn cpuid(leaf: c_uint, eax: *c_uint, ebx: *c_uint, ecx: *c_uint, edx: *c_uint) c_int;
pub const CpuFeatures = struct {
buf: [7]u32,
pub fn init() CpuFeatures {
var self = CpuFeatures{ .buf = []u32{0} ** 7 };
switch (builtin.arch) {
builtin.Arch.i386, builtin.Arch.x86_64 => {
var eax: c_uint = undefined;
var ebx: c_uint = undefined;
var ecx: c_uint = undefined;
var edx: c_uint = undefined;
// We don't strictly need to check this since __get_cpuid does but our
// implementation may not.
const max_basic_cpu_leaf = cpuid(0, &eax, &ebx, &ecx, &edx);
if (max_basic_cpu_leaf >= 1) {
if (cpuid(1, &eax, &ebx, &ecx, &edx) == 1) {
self.buf[0] = edx;
self.buf[1] = ecx;
}
}
if (max_basic_cpu_leaf >= 7) {
if (cpuid(7, &eax, &ebx, &ecx, &edx) == 1) {
self.buf[2] = ebx;
self.buf[3] = ecx;
self.buf[4] = edx;
}
}
const max_ext_cpu_leaf = cpuid(0x80000000, &eax, &ebx, &ecx, &edx);
if (max_ext_cpu_leaf >= 1) {
if (cpuid(0x80000001, &eax, &ebx, &ecx, &edx) == 1) {
self.buf[5] = edx;
self.buf[6] = ecx;
}
}
},
else => {
// other targets return false for `hasFeature`
},
}
return self;
}
// This would actually take a var, which would accept any platform Feature, e.g.
// FeatureX86 or FeatureARM (is that applicable?).
//
// If compiling for a target without this feature we know at comptime that we can never
// execute that feature and no arch-specific code is included.
pub inline fn hasFeature(self: *const CpuFeatures, feature: FeatureX86) bool {
// We require #868 for allowing runtime-selected functions to be allowed to run at
// compile-time. Assuming something like D and its __ctfe for the moment.
if (__ctfe) {
return false;
}
switch (builtin.arch) {
builtin.Arch.i386, builtin.Arch.x86_64 => {
const n = @enumToInt(feature);
return (self.buf[n >> 5] & (u32(1) << @intCast(u5, n & 0x1f))) != 0;
},
else => {
return false;
},
}
}
};
Non-compatible architectures where the feature does not work are compile-time known which allows us to avoid compiling in incompatible branches.
For the following example:
pub fn main() void {
const cpu = CpuFeatures.init();
if (cpu.hasFeature(FeatureX86.sse42)) {
@compileLog("not allowed to compile to target which may have sse");
} else {
std.debug.warn("no sse\n");
}
}
$ zig build-exe cpu.zig
/home/me/src/cpuid/cpu.zig:356:9: error: not allowed to compile to target which may have sse
@compileError("not allowed to compile to target which may have sse");
$ zig build-exe cpu.zig --target-arch armv7
# all ok
We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.
does llvm have any cpu feature capability detection features we could leverage instead of rolling our own?
does llvm have any cpu feature capability detection features we could leverage instead of rolling our own?
yes, but it is x86-specific and we already have it in the zig tree as ./c_headers/cpuid.h. For other architectures it may be (Linux-specific) /proc/cpuinfo. IFUNC has yet to be ported to non-x86 architectures.
We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.
We should probably model this after Linux's AT_HWCAP and AT_HWCAP2, which is not needed on the CSICy x86, but is needed on most other arches.
See also the GCC target_clones function attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target_005fclones-function-attribute
The target_clones attribute is used to specify that a function be cloned into multiple versions compiled with different target options than specified on the command line. The supported options and restrictions are the same as for target attribute.
For instance, on an x86, you could compile a function with target_clones("sse4.1,avx"). GCC creates two function clones, one compiled with -msse4.1 and another with -mavx.
On a PowerPC, you can compile a function with target_clones("cpu=power9,default"). GCC will create two function clones, one compiled with -mcpu=power9 and another with the default options. GCC must be configured to use GLIBC 2.23 or newer in order to use the target_clones attribute.
It also creates a resolver function (see the ifunc attribute above) that dynamically selects a clone suitable for current architecture. The resolver is created only if there is a usage of a function with target_clones attribute.
Note that any subsequent call of a function without target_clone from a target_clone caller will not lead to copying (target clone) of the called function. If you want to enforce such behaviour, we recommend declaring the calling function with the flatten attribute?
Most helpful comment
I think the switch as specified by @bronze1man seems most in line with how zig would work, given it uses all standard syntax. Ignoring the ifunc optimization, first, we need cpu feature detection of sorts. I've implemented below some example code which provides runtime cpu feature detection to get a better idea of potential issues.
Example
Implementation
cpuid.c
cpu.zig
Notes
Non-compatible architectures where the feature does not work are compile-time known which allows us to avoid compiling in incompatible branches.
For the following example:
We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.
Links