It would be nice to have feature detection for ARM (32-bit) in x/sys/cpu.
Concretely, a new cpu.ARM struct that closely resembles the existing cpu.ARM64 struct, tailored to the ARM specific hardware capabilities. The following fields are proposed, which map directly to the {HWCAP, HWCAP2} auxiliary vector values on Linux and FreeBSD:
HasSWP bool // SWP instruction support
HasHALF bool // Half-word load and store support
HasTHUMB bool // ARM Thumb instruction set
Has26BIT bool // Address space limited to 26-bits
HasFASTMUL bool // 32-bit operand, 64-bit result multiplication support
HasFPA bool // Floating point arithmetic support
HasVFP bool // Vector floating point support
HasEDSP bool // DSP Extensions support
HasJAVA bool // Java instruction set
HasIWMMXT bool // Intel Wireless MMX technology support
HasCRUNCH bool // MaverickCrunch context switching and handling
HasTHUMBEE bool // Thumb EE instruction set
HasNEON bool // NEON instruction set
HasVFPv3 bool // Vector floating point version 3 support
HasVFPv3D16 bool // Vector floating point version 3 D8-D15
HasTLS bool // Thread local storage support
HasVFPv4 bool // Vector floating point version 4 support
HasIDIVA bool // Integer divide instruction support in ARM mode
HasIDIVT bool // Integer divide instruction support in Thumb mode
HasIDIV bool // Integer divide instruction support in ARM and Thumb mode
HasVFPD32 bool // Vector floating point version 3 D15-D31
HasLPAE bool // Large Physical Address Extensions
HasEVTSTRM bool // Event stream support
HasAES bool // AES hardware implementation
HasPMULL bool // Polynomial multiplication instruction set
HasSHA1 bool // SHA1 hardware implementation
HasSHA2 bool // SHA2 hardware implementation
HasCRC32 bool // CRC32 hardware implementation
As I look around, I see code detecting CPU features based on the runtime.goarm value (which is set by the GOARM environment variable at link time), rather than a runtime check. This means that:
runtime.goarm is not const, the fast-path (e.g. using NEON) and slow-path fallback are being compiled into the binary, but only one path can ever be used. It would be nice if both paths can be used via run-time detection instead.In one of my projects, I have resorted to parsing /proc/cpuinfo for run-time detection of NEON, which only works on Linux. I'd love to instead use the standard library.
x/sys/cpu has still not been totally updated to match internal/cpu, which has some of the feature flags you are looking for.
Note that so far, we have only exposed flags that are needed for the runtime and standard library. That is, CL 114826 exposes HasIDIVA for use in the runtime, and CL 126315 exposes HasVFPv4 for use in the math package. I could see the argument for exposing all of the HWCAP flags listed in linux, for use in external code.
I will defer to @martisch's judgement on that.
Note that changing internal/cpu seems a separate topic (it also already supports arm) and there are no plans to export those variables outside the runtime/standard library.
If x/sys/cpu support for linux/arm is enough (not *BSD or windows/arm) then it should be as easy as implementing the hwCap/hwCap2 bit checks in https://github.com/golang/sys/blob/master/cpu/cpu_arm.go since hwCap/hwCap2 should already be populated on Linux by: https://github.com/golang/sys/blob/master/cpu/cpu_linux.go. I think its fine to expose all linux supported but general arm applicable flags in x/sys/cpu.
Note that the above would not support android as /proc/self/auxv might not be accessible and it wont change the runtime and standard libs detection of arm features (since that uses internal/cpu and not x/sys/cpu) and the effects of setting goarm are not effected as they override any cpu package .e.g.:
https://github.com/golang/go/blob/3b6216ed0601c81fe42c2a4738d419afccb62163/src/runtime/os_freebsd_arm.go#L14
https://github.com/golang/go/blob/0df9fa2ebec975359c8ee1150ecf7f28f12b39ee/src/math/sqrt_arm.s#L9
Thanks for the detailed response, Martin.
Note that changing internal/cpu seems a separate topic (it also already supports arm) and there are no plans to export those variables outside the runtime/standard library.
That's right; I'm happy to open a new issue to discuss that separately if you wish. I would propose that the internal implementation be separate (a "copy") of the x/sys/cpu version.
Since you mention it, I was reading the CL for #28148 (see this source file), and noticed how the use of detection of NEON could be improved by checking the hardware-caps instead of using the static runtime.goarm value. I did notice other similar uses throughout the Go internal/standard library, as you have linked to. While it might be some work to review all of them for a possible switch to runtime hardware-cap checks (esp where we can use a NEON code-path), it might be beneficial to do so over time?
(As said above, I am doing run-time detection at present by parsing /proc/cpuinfo which could also be improved.)
If x/sys/cpu support for linux/arm is enough (not *BSD or windows/arm) then it should be as easy as implementing the hwCap/hwCap2 bit checks in https://github.com/golang/sys/blob/master/cpu/cpu_arm.go since hwCap/hwCap2 should already be populated on Linux by: https://github.com/golang/sys/blob/master/cpu/cpu_linux.go. I think its fine to expose all linux supported but general arm applicable flags in x/sys/cpu.
I am mostly interested in Android, so support for Linux would be a win. (Apple has moved exclusively to aarch64 for some time now.)
It looks like FreeBSD has added support back in 2017 (see D12290 and D12291), so it might be possible to do something similar there also.
As far as I can tell, Go bring-up for windows/arm has stalled (c.f. #26148), so that would need to wait regardless?
Note that the above would not support android as /proc/self/auxv might not be accessible and it wont change the runtime and standard libs detection of arm features (since that uses internal/cpu and not x/sys/cpu) and the effects of setting goarm are not effected as they override any cpu package .e.g.:
BoringSSL provides a good reference implementation: they weak-link getauxval which is available on Android from API level 20, and fall back to /proc/self/auxv and /proc/cpuinfo in that order. So it appears we might be able to improve on the implementation you have linked to.
That's right; I'm happy to open a new issue to discuss that separately if you wish. I would propose that the internal implementation be separate (a "copy") of the
x/sys/cpuversion.
The scope of the feature request is currently unclear to me when reading the title and first post content. From the comments I would infer it is to add arm support for android. It also does not seem restricted to x/sys/cpu but part of the proposal is to change the runtimes use of goarm with dynamic detection. As internal/cpu IRC already has support for arm the runtime could already use that in some places where goarm is used (no need to copy from x/sys/cpu). However this would be a behaviour change and some platforms would still need to use goarm (to populate the cpu variables). I think that is better discussed decoupled from any x/sys/cpu support.
The Go compiler uses the goarm setting to make decisions about the emitted instructions and thereby does not produce "universal binaries". Changing this would be a larger change than x/sys/cpu support and likely have larger performance and binary size implications when switched to runtime detection:
https://github.com/golang/go/blob/f125b32d19cdb0e2650e8b7ae7b909b4bd0ae2a2/src/cmd/compile/internal/arm/ssa.go#L644
https://github.com/golang/go/blob/8a317ebc0f50339628c003bf06107cd865406dd4/src/cmd/compile/internal/ssa/regalloc.go#L599
Since you mention it, I was reading the CL for #28148 (see this source file), and noticed how the use of detection of NEON could be improved by checking the hardware-caps instead of using the static
runtime.goarmvalue.
Seems like a good candidate for using x/sys/cpu but it seems it would need a fallback in x/sys/cpu to goarm for platforms not supporting CPU feature detection via AUXV or syscalls/proc.
It looks like FreeBSD has added support back in 2017 (see D12290 and D12291), so it might be possible to do something similar there also.
The Go runtime sets hwcap on arm in internal/cpu on FreeBSD by reading the auxv thats provided after argv. I dont think x/sys/cpu has direct access to that. So I think we can not simply copy that approach to x/sys/cpu. https://github.com/golang/go/blob/a62b5723be15849656279c244834064b951801fc/src/runtime/os_freebsd.go#L385
As far as I can tell, Go bring-up for windows/arm has stalled (c.f. #26148), so that would need to wait regardless?
No having only partial support in x/sys/cpu is fine. However to not regress in performance for other platforms we would seem to need a fallback to setting the cpu features based on goarm on those not supporting runtime detection.
BoringSSL provides a good reference implementation: they weak-link
getauxvalwhich is available on Android from API level 20, and fall back to/proc/self/auxvand/proc/cpuinfoin that order. So it appears we might be able to improve on the implementation you have linked to.
Thank you. Thats a nice reference. If we can make the same work under go then that looks like an option to gain feature detection support on android in x/sys/cpu.
The scope of the feature request is currently unclear to me when reading the title and first post content. From the comments I would infer it is to add arm support for android. It also does not seem restricted to x/sys/cpu but part of the proposal is to change the runtimes use of goarm with dynamic detection. As internal/cpu IRC already has support for arm the runtime could already use that in some places where goarm is used (no need to copy from x/sys/cpu). However this would be a behaviour change and some platforms would still need to use goarm (to populate the cpu variables). I think that is better discussed decoupled from any x/sys/cpu support.
Let me be more clear...
This issue is about:
x/sys/cpu for third parties.internal/cpu.elf32_freebsd_sysvec symbol in libc; see posted links above to the FreeBSD site for further details.This issue is not about,
internal/cpu and how CPU hardware-cap detection is done internally in the Go standard library.however a separate issue can be created that proposes:
internal/cpu; for example, to determine whether Advanced SIMD (NEON) is supported on the device on Linux. The BoringSSL implementation for Linux could be used as a reference here also.runtime.goarm. After a quick look, it appears there is no support for NEON-accelerated algorithms on arm; only a reference to "will want to revision NEON, when support is better", which is actually on arm64 where NEON is always supported. (That reference is to Plan 9 assembly support, I am guessing.)The Go compiler uses the goarm setting to make decisions about the emitted instructions and thereby does not produce "universal binaries". Changing this would be a larger change than x/sys/cpu support and likely have larger performance and binary size implications when switched to runtime detection:
I understand this issue, but I would never propose it. My reference to a "universal binary" was with reference to one of my projects where I detect NEON, and take a fast-path accordingly. A true universal binary -- that is, a program supporting multiple architectures -- is often achieved though shipping separate texts for each architecture in a sandwiched "fat binary" as is possible on macOS, and only exists as a proof-of-concept on Linux. I am not proposing that here.
Seems like a good candidate for using x/sys/cpu but it seems it would need a fallback in x/sys/cpu to goarm for platforms not supporting CPU feature detection via AUXV or syscalls/proc.
Yes -- and sure beats developers individually using go:linkname to punch a hole through to runtime.goarm.
The Go runtime sets hwcap on arm in internal/cpu on FreeBSD by reading the auxv thats provided after argv. I dont think x/sys/cpu has direct access to that. So I think we can not simply copy that approach to x/sys/cpu.
I have not tried it, but it appears a symbol exists on libc that gets you access to the information without having to go via auxv.
Change https://golang.org/cl/190525 mentions this issue: cpu: add support for ARM (32-bit) feature detection.
@martisch, I've submitted a CL for this issue as described by my past post, and have tested it on linux/arm and freebsd/arm. Are you able to review it?
@martisch, I've submitted a CL for this issue as described by my past post, and have tested it on linux/arm and freebsd/arm. Are you able to review it?
Sure. Thanks for working on it. I added a first round of high level comments.
I been meaning to reply more in depth about how we can approach this but unfortunately did not find the time earlier.
There is already support for linux auxv reading in cpu_linux. We should first leverage that existing detection to add the hwcap bits to variable mapping and then I think basic linux/arm support should already work. Continuing from there we can add additional support for android to work around /proc/self/auxv not necessary being accessible to fill hwcap in x/sys/cpu. Afterwards we can extend the support for arm on other platforms and wheel in help from testers with hardware on those.
What makes arm special in x/sys/cpu vs other archs is that absent hwcap detection we should fall back to the minimal set of hwcap bits set that is mandated by the goarm variable.
As far as the API is concerned, which is what the proposal process would care about, it seems that the proposal is to add a cpu.ARM that is very similar to cpu.ARM64 with appropriate modifications for the actual CPU features that might be present.
/cc @tklauser @martisch @ianlancetaylor for feedback
For the API I think we should stay consistent with the other architectures and internal/cpu which means to expose a cpu.ARM struct with fields named HasNAME where NAME is the corresponding HWCAP name of the feature. If needed the client of x/sys/cpu can create combinations such as HasIDIVA | HasIDIVT outside of x/sys/cpu but we do not need to create these as fields inside the cpu.ARM struct.
In general (for adding other architectures) I think if the existing naming schema is followed we do not need to have an separate proposal for each API addition for additional CPU feature structs/variables in x/sys/cpu.
@martisch, I have updated the original post to include a concrete list of fields.
If needed the client of x/sys/cpu can create combinations such as HasIDIVA | HasIDIVT outside of x/sys/cpu but we do not need to create these as fields inside the cpu.ARM struct.
I would argue that having "derived" fields, such as the proposed HasIDIV makes the API easier to consume by the user. Anyone else keen to chime in?
Based on the discussion, this is doing for ARM what has already been done for other architectures in the package, and there have been no objections. This sounds like a likely accept.
Leaving open for a week for any objections to accepting.
/cc @randall77
Sounds good.
I agree with Martin, we should just provide the raw hardware specs and let users make their own compound flags if needed. The set needed for any particular assembly is highly dependent on what that assembly needs, so even if there are a few standard ones people will be rolling their own anyway.
Some down sides that I think combined outweigh the usefulness of providing combined feature flags:
Let us try to avoid conflating logical grouping of homogeneous hardware capabilities with one that is purely arbitrary. Saying "I always support integer division" (the topic of discussion above) is very different to "I support vector processing and bit manipulation", or "I support encryption and subset X of SIMD".
Let's consider for a moment how these hardware-capability flags are used in practice.
An algorithm needs to run and one implementation is selected from an available set based on hardware features (e.g. cpu.ARM.HasNEON), so that the runtime or power consumption is most favorable.
The amount of "work done" is often non-trivial: otherwise the developer could just let the compiler do its best without hand-tuning the instruction text. This also means that by the time the algorithm completes, the cache has been polluted and the hardware-capability flags have most likely been evicted. Think of image processing, compression, hashing, or cipher operation: they typically operate on lots of input data.
Let's say the algorithm instead relied on two hardware features. Regardless of whether the API under discussion reflected the situation in two fields or one "derived" field, the end result is the same: the cache will likely have evicted them all. Whether the fields were laid out in two or more cache lines to begin with are unlikely to impact the "setup overhead" of the algorithm as a whole.
Trying to optimize the hardware-capabilities cache line in the previous discussion seems rather over the top in general. When selecting hardware-dependent code paths, it is also often used practice to memoize the result in the form of a function pointer, avoiding a conditional branch. (It often, perhaps more importantly, makes the code more readable.)
The suggestion to include a derived field for ARM- and thumb-supported integer division (HasIDIV) was intended to simplify consumption of the API and nothing more. Even the Linux kernel implementation defines a convenience preprocessor symbol HWCAP_IDIV for the same purpose, and is where this all originated.
Either way, I'm not wedded to including this one derived field in the proposed API change. In fact, I have no intention of even using this field at all -- my interest is in consuming HasNEON. Please, let's not overcomplicate the situation here.
Let us try to avoid conflating logical grouping of homogeneous hardware capabilities with one that is purely arbitrary.
Let's consider for a moment how these hardware-capability flags are used in practice.
Some of the examples given are not arbitrary but real use cases of the existing API mirrored in internal/cpu. e.g.:
https://github.com/golang/go/blob/3aa7bbdbcb432f78462fa816ba7c63cb7f3991fe/src/runtime/alg.go#L290
https://github.com/golang/crypto/blob/614d502a4dac94afa3a6ce146bd1736da82514c6/chacha20poly1305/chacha20poly1305_amd64.go#L23
The suggestion to include a derived field for ARM- and thumb-supported integer division (HasIDIV) was intended to simplify consumption of the API and nothing more. Even the Linux kernel implementation defines a convenience preprocessor symbol HWCAP_IDIV for the same purpose, and is where this all originated.
That is a good point that would make the problem of name collisions less likely and does set IDIV apart from other combinations. Note that as far as I perceived the discussion it was generally about "derived" fields not only about IDIV.
Either way, I'm not wedded to including this one derived field in the proposed API change. In fact, I have no intention of even using this field at all -- my interest is in consuming
HasNEON. Please, let's not overcomplicate the situation here.
Then it would seem we can just leave it out for now and wait for a use case.
Let's say the algorithm instead relied on two hardware features. Regardless of whether the API under discussion reflected the situation in two fields or one "derived" field, the end result is the same: the cache will likely have evicted them all. Whether the fields were laid out in two or more cache lines to begin with are unlikely to impact the "setup overhead" of the algorithm as a whole.
This may very well be a extreme scenario and therefore should only be a secondary consideration but for something like a memcpy implementation that is frequently accessed in different parts of a programming pulling in one more cache line besides the copied data (which is often small) can be shown to matter measurably in production.
Given that one can always add more but not easily remove feature variables I would still suggest to stay with a leaner set for now until its known what a good criteria for the potential inclusion of "derived" flags would be.
Marked this last week as likely accept w/ call for last comments (https://github.com/golang/go/issues/33508#issuecomment-525473442).
No dissenting comments, so accepting.
Now that the proposal has been accepted, I will update my PR (CL) shortly.
The CL has been updated and split into a chain of commits.
@tklauser I've invited you as a reviewer and hope that you've got a little time to look at this contribution. Thank you in advance.
Change https://golang.org/cl/197541 mentions this issue: cpu: protect ARM feature detection from broken device
Change https://golang.org/cl/197540 mentions this issue: cpu: fallback to using /proc/{self/auxv, cpuinfo} for ARM feature detection
Change https://golang.org/cl/197542 mentions this issue: cpu: add support for FreeBSD ARM feature detection
(I'm one of the BoringSSL maintainers and the author of our ARM CPU detect bits.)
32-bit ARM CPU feature detection on Linux is kind of a headache with older Androids, yeah. :-(
I probably wouldn't recommend trying to detect that one broken CPU (https://golang.org/cl/197541). The Android cpu-features library doesn't do this and we only ever had issues with one function. At this point the affected CPU is rare enough that I'm hoping to just remove it soon. (Chrome for Android already requires NEON support. The workaround results in us carrying extra crypto implementations for just that CPU.)
As for whether you want the the tower of /proc fallbacks, I guess it depends on what versions of Android you care about and how much you care about getting NEON on those older devices. Android L and up have getauxval. I expect other Linux ARMs have getauxval by now. https://developer.android.com/about/dashboards has some Android usage numbers.
I'll also note that BoringSSL only pays attention to NEON and ARMv8 crypto-related bits, so some of our fallbacks may not be a good template for the other features. In particular, I don't think the ARMv8 logic in https://golang.org/cl/197540's /proc/cpuinfo parser is quite right.
@davidben, thanks for chiming in; some comments inline below.
I probably wouldn't recommend trying to detect that one broken CPU (https://golang.org/cl/197541). The Android cpu-features library doesn't do this and we only ever had issues with one function. At this point the affected CPU is rare enough that I'm hoping to just remove it soon. (Chrome for Android already requires NEON support. The workaround results in us carrying extra crypto implementations for just that CPU.)
That's fair enough. My interest is in NEON detection on Android. However it would be nice if we also had accurate crypto detection, so that we might later introduce accelerated TLS for arm32 in Go's stdlib. It looks like support is currently limited to arm64.
As for whether you want the the tower of /proc fallbacks, I guess it depends on what versions of Android you care about and how much you care about getting NEON on those older devices. Android L and up have getauxval.
I would ideally like to target KitKat (API level 19) and up. How good is the /proc/self/auxv fallback there? If we can't land the fallback support into x/sys/cpu, then I might just use a vendored approach. What would you recommend?
I'll also note that BoringSSL only pays attention to NEON and ARMv8 crypto-related bits, so some of our fallbacks may not be a good template for the other features. In particular, I don't think the ARMv8 logic in https://golang.org/cl/197540's /proc/cpuinfo parser is quite right.
On the ARMv8 check, I lifted it directly from this BoringSSL implementation. If you can outline what's not quite right here, and what could be improved, I'd appreciate it.
I would ideally like to target KitKat (API level 19) and up. How good is the /proc/self/auxv fallback there? If we can't land the fallback support into x/sys/cpu, then I might just use a vendored approach. What would you recommend?
I don't remember off-hand when the /proc/self/auxv works vs. /proc/cpuinfo. I vaguely recall it was something weird though, where some Android version or device accidentally took away /proc/self/auxv without adding getauxval?
On the ARMv8 check, I lifted it directly from this BoringSSL implementation. If you can outline what's not quite right here, and what could be improved, I'd appreciate it.
As I said, BoringSSL only cares about NEON and ARMv8 crypto-related bits. I expect there are other optional ARMv7 features that became mandatory in ARMv8 other than just NEON. Since BoringSSL doesn't care about any ARMv7 feature other than NEON, our code is not a good template for those features.
It probably makes sense to check how the kernel actually fills in /proc/cpuinfo and review against that.
I don't remember off-hand when the /proc/self/auxv works vs. /proc/cpuinfo. I vaguely recall it was something weird though, where some Android version or device accidentally took away /proc/self/auxv without adding getauxval?
In that case, would you recommend that if you had any fallback whatsoever, you include both /proc/self/auxv and /proc/cpuinfo as you have in the BoringSSL implementation, and as I have done in the CL under discussion?
On the ARMv8 check, I lifted it directly from this BoringSSL implementation. If you can outline what's not quite right here, and what could be improved, I'd appreciate it.
As I said, BoringSSL only cares about NEON and ARMv8 crypto-related bits. I expect there are other optional ARMv7 features that became mandatory in ARMv8 other than just NEON. Since BoringSSL doesn't care about any ARMv7 feature other than NEON, our code is not a good template for those features.
That makes sense.
Looking at the Armv8 Architecture Reference Manual, in the AArch32 execution state:
As you've stated, there could be more features that are mandatory in ARMv8, but I would say that the above are the ones we would generally care about (NEON + crypto).
I'll update the CL to reflect this. If I've missed a feature that you care about, please let me know.