Zig: Proposal: Function multi-versioning

Created on 17 May 2018 · 12Comments · Source: ziglang/zig

A really interesting concept is function multi-versioning. The general idea is to support implementing multiple versions of a function for different hardware and having the correct version of the function selected at run time. Made up sample code:

pub  fn someMathFunction(vec: Vector) Vector [target: sse4.2]
{
   // optimized for SSE 4.2
}

pub  fn someMathFunction(vec: Vector) Vector [target: avx2]
{
   // optimized for avx2 
}

pub  fn someMathFunction(vec: Vector) Vector [target: default]
{
   // no asm/intrinsics optimization
}

// later on

const v = giveMeAVect();
const v2 = someMathFunction(v); // calls the best version based on run time selection

There are ways to simulate this using function pointers, but the compiler would be better at optimizing this, plus implementing that over and over by hand would suck.

LLVM https://llvm.org/docs/LangRef.html#ifuncs
GCC https://lwn.net/Articles/691932/

accepted proposal

Source

bheads

👍5 ❤1

Most helpful comment

I think the switch as specified by @bronze1man seems most in line with how zig would work, given it uses all standard syntax. Ignoring the ifunc optimization, first, we need cpu feature detection of sorts. I've implemented below some example code which provides runtime cpu feature detection to get a better idea of potential issues.

Example

// initialized globally somewhere, could do locally but minor cost
var cpu = CpuFeatures.init();

pub fn memchr(buf: *const u8, c: u8) usize {
    if (cpu.hasFeature(FeatureX86.avx512)) {
        return memchr_avx512(buf, c);
    } else if (cpu.hasFeature(FeatureX86.sse2)) {
        return memchr_sse2(buf, c);
    } else {
        return memchr_generic(buf, c);
    }
}

Implementation

cpuid.c

// This is done separately for now since zig's multi-return inline asm was a pain.
#include <cpuid.h>

int cpuid(unsigned int leaf, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx)
{
    return __get_cpuid(leaf, eax, ebx, ecx, edx);
}

cpu.zig

const std = @import("std");
const builtin = @import("builtin");

// Runtime cpu feature detection.
//
// This is currently implemented for x86/x64 targets. For generic targets, the features
// returned will be compile-time false and will not use any code space.

comptime {
    std.debug.assert(@memberCount(FeatureX86) == 224);
}

// See https://en.wikipedia.org/wiki/CPUID
pub const FeatureX86 = enum {
    // eax = 1, output: edx
    fpu,
    vme,
    de,
    pse,
    tsc,
    msr,
    pae,
    mce,
    cx8,
    apic,
    _reserved1,
    sep,
    mtrr,
    pge,
    mca,
    cmov,
    pat,
    pse_36,
    psn,
    clfsh,
    _reserved2,
    ds,
    acpi,
    mmx,
    fxsr,
    sse,
    sse2,
    ss,
    htt,
    tm,
    ia64,
    pbe,

    // omitted remaining features for brevity
    ....
};

// Implemented in C until multi-output asm is easier. See: #215.
extern fn cpuid(leaf: c_uint, eax: *c_uint, ebx: *c_uint, ecx: *c_uint, edx: *c_uint) c_int;

pub const CpuFeatures = struct {
    buf: [7]u32,

    pub fn init() CpuFeatures {
        var self = CpuFeatures{ .buf = []u32{0} ** 7 };

        switch (builtin.arch) {
            builtin.Arch.i386, builtin.Arch.x86_64 => {
                var eax: c_uint = undefined;
                var ebx: c_uint = undefined;
                var ecx: c_uint = undefined;
                var edx: c_uint = undefined;

                // We don't strictly need to check this since __get_cpuid does but our
                // implementation may not.
                const max_basic_cpu_leaf = cpuid(0, &eax, &ebx, &ecx, &edx);

                if (max_basic_cpu_leaf >= 1) {
                    if (cpuid(1, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[0] = edx;
                        self.buf[1] = ecx;
                    }
                }

                if (max_basic_cpu_leaf >= 7) {
                    if (cpuid(7, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[2] = ebx;
                        self.buf[3] = ecx;
                        self.buf[4] = edx;
                    }
                }

                const max_ext_cpu_leaf = cpuid(0x80000000, &eax, &ebx, &ecx, &edx);

                if (max_ext_cpu_leaf >= 1) {
                    if (cpuid(0x80000001, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[5] = edx;
                        self.buf[6] = ecx;
                    }
                }
            },
            else => {
                // other targets return false for `hasFeature`
            },
        }

        return self;
    }

    // This would actually take a var, which would accept any platform Feature, e.g.
    // FeatureX86 or FeatureARM (is that applicable?).
    //
    // If compiling for a target without this feature we know at comptime that we can never
    // execute that feature and no arch-specific code is included.
    pub inline fn hasFeature(self: *const CpuFeatures, feature: FeatureX86) bool {
        // We require #868 for allowing runtime-selected functions to be allowed to run at
        // compile-time. Assuming something like D and its __ctfe for the moment.
        if (__ctfe) {
            return false;
        }

        switch (builtin.arch) {
            builtin.Arch.i386, builtin.Arch.x86_64 => {
                const n = @enumToInt(feature);
                return (self.buf[n >> 5] & (u32(1) << @intCast(u5, n & 0x1f))) != 0;
            },
            else => {
                return false;
            },
        }
    }
};

Notes

Non-compatible architectures where the feature does not work are compile-time known which allows us to avoid compiling in incompatible branches.

For the following example:

pub fn main() void {
    const cpu = CpuFeatures.init();

    if (cpu.hasFeature(FeatureX86.sse42)) {
        @compileLog("not allowed to compile to target which may have sse");
    } else {
        std.debug.warn("no sse\n");
    }
}

$ zig build-exe cpu.zig
/home/me/src/cpuid/cpu.zig:356:9: error: not allowed to compile to target which may have sse
        @compileError("not allowed to compile to target which may have sse");

$ zig build-exe cpu.zig --target-arch armv7
# all ok

We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.

All 12 comments

This use case is in scope of Zig. Whether we use LLVM IFuncs or something else is still up for research, as well as the syntax and how it fits in with comptime and other features.

andrewrk on 18 May 2018

❤2

Ifuncs only work on ELF platforms (and not all of them either) so Zig would need a fallback for platforms where they're not supported, such as Windows, FreeBSD < 12, etc.

I'd attack this bottom-up: start with a generic solution and later on add ifuncs as a compiler optimization.

bnoordhuis on 18 May 2018

In the build system allow targets with different properties (SSE/cache size/main memory size/GPU/...).
Allow specialized constants/data structures/functions by these properties.
Generate "fat executable" where all targets are generated into one file, and the most fitting target gets selected at the startup.
For very specialized cases allow to offer own selection mechanism (by running a benchmark at the startup).

Using hidden function pointers set by the compiler for individual functions would hog cache friendliness and destroy potential for inlining.

PavelVozenilek on 19 May 2018

Edit: Never mind. Thanks for clarifying @tiehuis!

The way this is solved in Go is with a build tag at the file level. When the build runs, some build properties are defined depending on the platform and optional build arguments. The entire file is then either included or excluded depending on its build tags. Build tags such as 386,!darwin are supported, which means the file will be built only for 386 AND NOT darwin, etc.

This is simple to use, extensible with custom tags, and forces the user to put all platform-specific stuff in files having a tag immediately at the top, so there's no custom build stuff intermixed with normal code.

binary132 on 15 Jun 2018

@binary132 That is different to the issue here since it is a compile-time selection. This is possible in Zig right now by using standard if statements with comptime values. This issue is about function selection at runtime, based on runtime cpu feature detection.

This is useful to generate a single portable binary that runs on different cpu processor architectures (e.g. core 2 duo vs. i7 skylake), while still allowing cpu-specific performance optimizations for hot functions.

tiehuis on 15 Jun 2018

👍1

function selection at runtime, based on runtime cpu feature detection

Or when using Vulkan? :smirk:

0joshuaolson1 on 15 Jun 2018

I proposal follow way to solve this stuff to make the code more easy to read and write:

pub  fn someMathFunction(vec: Vector) Vector
{
    @useLLVMIFuncs
    if (builtin.hasSse42){
      ...
    }else if (builtin.hasAvx2){
      ...
    }else{
      ...
    }
}

And the compiler can generate 3 versions of this function and generate the LLVM IFuncs base on that information.

Solve multi-versioning problem like golang (build tag https://golang.org/pkg/go/build/#hdr-Build_Constraints ) make the code difficult to read and write and refactor. (Do I define all the need versions I need? Do I change all the function name of all the version? Where is the linux version of that function?)

bronze1man on 6 Jul 2018

Example

// initialized globally somewhere, could do locally but minor cost
var cpu = CpuFeatures.init();

pub fn memchr(buf: *const u8, c: u8) usize {
    if (cpu.hasFeature(FeatureX86.avx512)) {
        return memchr_avx512(buf, c);
    } else if (cpu.hasFeature(FeatureX86.sse2)) {
        return memchr_sse2(buf, c);
    } else {
        return memchr_generic(buf, c);
    }
}

Implementation

cpuid.c

// This is done separately for now since zig's multi-return inline asm was a pain.
#include <cpuid.h>

int cpuid(unsigned int leaf, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx)
{
    return __get_cpuid(leaf, eax, ebx, ecx, edx);
}

cpu.zig

const std = @import("std");
const builtin = @import("builtin");

// Runtime cpu feature detection.
//
// This is currently implemented for x86/x64 targets. For generic targets, the features
// returned will be compile-time false and will not use any code space.

comptime {
    std.debug.assert(@memberCount(FeatureX86) == 224);
}

// See https://en.wikipedia.org/wiki/CPUID
pub const FeatureX86 = enum {
    // eax = 1, output: edx
    fpu,
    vme,
    de,
    pse,
    tsc,
    msr,
    pae,
    mce,
    cx8,
    apic,
    _reserved1,
    sep,
    mtrr,
    pge,
    mca,
    cmov,
    pat,
    pse_36,
    psn,
    clfsh,
    _reserved2,
    ds,
    acpi,
    mmx,
    fxsr,
    sse,
    sse2,
    ss,
    htt,
    tm,
    ia64,
    pbe,

    // omitted remaining features for brevity
    ....
};

// Implemented in C until multi-output asm is easier. See: #215.
extern fn cpuid(leaf: c_uint, eax: *c_uint, ebx: *c_uint, ecx: *c_uint, edx: *c_uint) c_int;

pub const CpuFeatures = struct {
    buf: [7]u32,

    pub fn init() CpuFeatures {
        var self = CpuFeatures{ .buf = []u32{0} ** 7 };

        switch (builtin.arch) {
            builtin.Arch.i386, builtin.Arch.x86_64 => {
                var eax: c_uint = undefined;
                var ebx: c_uint = undefined;
                var ecx: c_uint = undefined;
                var edx: c_uint = undefined;

                // We don't strictly need to check this since __get_cpuid does but our
                // implementation may not.
                const max_basic_cpu_leaf = cpuid(0, &eax, &ebx, &ecx, &edx);

                if (max_basic_cpu_leaf >= 1) {
                    if (cpuid(1, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[0] = edx;
                        self.buf[1] = ecx;
                    }
                }

                if (max_basic_cpu_leaf >= 7) {
                    if (cpuid(7, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[2] = ebx;
                        self.buf[3] = ecx;
                        self.buf[4] = edx;
                    }
                }

                const max_ext_cpu_leaf = cpuid(0x80000000, &eax, &ebx, &ecx, &edx);

                if (max_ext_cpu_leaf >= 1) {
                    if (cpuid(0x80000001, &eax, &ebx, &ecx, &edx) == 1) {
                        self.buf[5] = edx;
                        self.buf[6] = ecx;
                    }
                }
            },
            else => {
                // other targets return false for `hasFeature`
            },
        }

        return self;
    }

    // This would actually take a var, which would accept any platform Feature, e.g.
    // FeatureX86 or FeatureARM (is that applicable?).
    //
    // If compiling for a target without this feature we know at comptime that we can never
    // execute that feature and no arch-specific code is included.
    pub inline fn hasFeature(self: *const CpuFeatures, feature: FeatureX86) bool {
        // We require #868 for allowing runtime-selected functions to be allowed to run at
        // compile-time. Assuming something like D and its __ctfe for the moment.
        if (__ctfe) {
            return false;
        }

        switch (builtin.arch) {
            builtin.Arch.i386, builtin.Arch.x86_64 => {
                const n = @enumToInt(feature);
                return (self.buf[n >> 5] & (u32(1) << @intCast(u5, n & 0x1f))) != 0;
            },
            else => {
                return false;
            },
        }
    }
};

Notes

Non-compatible architectures where the feature does not work are compile-time known which allows us to avoid compiling in incompatible branches.

For the following example:

pub fn main() void {
    const cpu = CpuFeatures.init();

    if (cpu.hasFeature(FeatureX86.sse42)) {
        @compileLog("not allowed to compile to target which may have sse");
    } else {
        std.debug.warn("no sse\n");
    }
}

$ zig build-exe cpu.zig
/home/me/src/cpuid/cpu.zig:356:9: error: not allowed to compile to target which may have sse
        @compileError("not allowed to compile to target which may have sse");

$ zig build-exe cpu.zig --target-arch armv7
# all ok

We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.

Links

#215
#868

tiehuis on 7 Jul 2018

👍3

does llvm have any cpu feature capability detection features we could leverage instead of rolling our own?

emekoi on 4 Feb 2019

does llvm have any cpu feature capability detection features we could leverage instead of rolling our own?

yes, but it is x86-specific and we already have it in the zig tree as ./c_headers/cpuid.h. ~~For other architectures it may be (Linux-specific) /proc/cpuinfo. IFUNC has yet to be ported to non-x86 architectures.~~

shawnl on 22 Apr 2019

We require #868 to allow for suitable fallback implementations where not otherwise specified. Unfortunately I don't think this can be avoided in this case.

We should probably model this after Linux's AT_HWCAP and AT_HWCAP2, which is not needed on the CSICy x86, but is needed on most other arches.

shawnl on 26 Jun 2019

See also the GCC target_clones function attribute: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target_005fclones-function-attribute

The target_clones attribute is used to specify that a function be cloned into multiple versions compiled with different target options than specified on the command line. The supported options and restrictions are the same as for target attribute.

For instance, on an x86, you could compile a function with target_clones("sse4.1,avx"). GCC creates two function clones, one compiled with -msse4.1 and another with -mavx.

On a PowerPC, you can compile a function with target_clones("cpu=power9,default"). GCC will create two function clones, one compiled with -mcpu=power9 and another with the default options. GCC must be configured to use GLIBC 2.23 or newer in order to use the target_clones attribute.

It also creates a resolver function (see the ifunc attribute above) that dynamically selects a clone suitable for current architecture. The resolver is created only if there is a usage of a function with target_clones attribute.

Note that any subsequent call of a function without target_clone from a target_clone caller will not lead to copying (target clone) of the called function. If you want to enforce such behaviour, we recommend declaring the calling function with the flatten attribute?

daurnimator on 29 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

C ABI test failures on Windows x86_64

andrewrk · 3Comments

proposal: rename List to Vector in standard library

andrewrk · 3Comments

QOL Proposal: use(xxx) statement to reduce repeating element

bheads · 3Comments

compare and contrast, my own pet language

dobkeratops · 3Comments

missing stacktraces

komuw · 3Comments