Go: cmd/asm: ARM64 instructions FLDPQ and FSTPQ not supported

Created on 12 Aug 2020  ·  40Comments  ·  Source: golang/go

What version of Go are you using (go version)?

$ go version
go version go1.15 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build668753232=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Assemble the following ARM64 assembly file:

TEXT foo(SB),0,$0-0
        FLDPQ (R0), (F0, F1)
        FSTPQ (F0, F1), (R0)

What did you expect to see?

An object file with instructions that disassemble to the GNU-style instructions

   0:   ad400400    ldp q0, q1, [x0]
   4:   ad000400    stp q0, q1, [x0]

What did you see instead?

$ go tool asm test.s
test.s:2: unrecognized instruction "FLDPQ"
test.s:3: unrecognized instruction "FSTPQ"
asm: assembly of test.s failed

It appears that only FLDPS and FLDPD are supported. Going through the instruction lists, it appears that no FP instructions with 128 bit instruction size seem to be supported at all. It would be great to have them added as I need them for a project.

FeatureRequest NeedsInvestigation arch-arm64

Most helpful comment

Thanks, will do so. I'll proceed and prototype the algorithm in the GNU assembler so I can tell you which instructions exactly I'm going to need.

All 40 comments

I found a bunch more issues with the assembler. Is it okay if I make a large issue just gathering all of them?

Some more valid (I hope) instructions that don't encode:

VLD1R.P 1(R1), [V9.B8] // displacement 8 is incorrectly accepted
FMOVD $0x8040201008040201, F10 // won't generate a literal pool load, won't encode
VCMTST V8.B8, V9.B8, V9.B8 // unknown instruction
VSUB V10.B8, V9.B8, V10.B8 // invalid rrr, VADD works

There's probably a bunch more. Right now, almost every single SIMD instruction I want to use is broken in the assembler. This is not a sustainable state of affairs.

@fuzxxl Thank you for the observation.
Yes, the current go assembler on ARM64 is not complemented. We added assembler support for most instructions because these instructions need to be used. Can you tell us which instructions you want to use now? We are happy to enable them, or you can try to do it if you are interested. 😊

@zhangfannie Best would be to have support for all SIMD and FP instructions, but fixing those I mentioned might be a good start. I have not written the code yet, so I can't tell for sure what exactly it is going to involve. If you like, I can try to implement the algorithm using the GNU assembler and then tell you what instructions I used. How does that sound?

Best would be to have support for all SIMD and FP instructions, but fixing those I mentioned might be a good start.

Please don't take this as a critisism, but as an observer of a number of this class of request, the best results are obtained when the OP, you in this case, can enumerate exactly which instructions to add. I have no explanation why requests for _all XXX instructions_ are unsuccessful, but encourage you to list precisely the instructions you would like to see added as there is anecdotal evidence that requests formed in this way are resolved faster.

Thanks, will do so. I'll proceed and prototype the algorithm in the GNU assembler so I can tell you which instructions exactly I'm going to need.

So, I've now sketched out a prototype using the UNIX assembler. The following new instructions are required:

  • VEOR (possibly already implemented)
  • VBIT
  • VBSL
  • VLD1 (possibly already implemented)
  • VLD1.P (might already be implemented)
  • FMOVD $imm64, Fr (translates to a load from a literal pool; having FMOVQ too would be nice)
  • VMOVI (seems like that one is already implemented)
  • VLD1R.P (implemented, but incorrectly; cf. example above)
  • VCMTST
  • VSUB (already implemented but defective)
  • VAND (possibly already implemented)
  • VUSHR (possibly already implemented)
  • VSHL (possibly already implemented)
  • UADDLV (possibly already implemented)
  • VMOV Vx.y[z], Vx.y[z] (possibly already implemented)
  • VUXTL
  • VUXTL2
  • VADD (possibly already implemented)
  • VST1 (possibly already implemented)
  • PRFM (possibly already implemented; nice to have)

Especially important is the literal-pool FMOVD $imm64 as I can't replace that one with a WORD directive or similar due to the impossibility of manipulating literal pools directly.

@fuzxxl We will enable the above instructions ASAP. 🙂

Thanks! Very much appreciated.

@fuzxxl The current go assembler have enabled _FMOV (scalar, immediate)_ instruction, the go syntax is FMOVS and FMOVD. And according to ARM reference, there are no instructions like FMOVQ .
image

The instruction I want is listed in the section “LDR (literal, SIMD&FP)” of the ARM reference:

LDR <St>, label
LDR <Dt>, label
LDR <Qt>, label

here, label has a PC-relative addressing mode referring to a literal in a nearby literal pool. These instructions should therefore be translated to Go syntax as

FMOVS $imm32, Fr
FMOVD $imm64, Fr
FMOVQ $imm128, Fr

The assembler needs to place the immediate in a nearby literal pool and substitute its address.

Of course, the assembler is free to use the “FMOV (scalar, immediate)” variant if the immediate fits. Similarly, it should translate these to movi or mvni if the immediate is of the appropriate form.

The desired usage is to load a bit mask into a SIMD register:

FMOVD $0x8040201008040201, F0

Perhaps the mnemonics MOVS, MOVD, and MOVQ might indeed be better here as the instructions do not directly deal with floating point numbers.

Change https://golang.org/cl/249757 mentions this issue: cmd/asm: fix the error of checking the post-index offset of VLD[1-4]R instructions of arm64

Change https://golang.org/cl/249758 mentions this issue: cmd/internal/obj/arm64: enable some SIMD instructions

Thanks for the changes. You made a typo in changeset 249758, file asm7.go:5111:

    case 101: // /FMOVS/FMOVD/FMOVQ(liternal, SIMD&FP)

I will modify it. 😅 Thank you.

Due to some further optimisations in the code, I might also need the following instructions:

  • VUZP1 and VUZP2
  • VUXTL and VUXTL2 (or their generic variants, USHLL and USHLL2)
  • VADDP (might already be implemented)
  • VBIF

It would be great if you could add them, too. I hope this covers all of them.

Thanks for all your great help! I'm feeling like we're getting somewhere with this.

@fuzxxl You are welcome. We will enable the assembler's support for these instructions ASAP. 🙂

Thanks a lot for your hard work!

Change https://golang.org/cl/251519 mentions this issue: cmd/internal/obj/arm64: add support for arm64 instruction FLDPQ and FSTPQ

Change https://golang.org/cl/253659 mentions this issue: cmd/asm: add more SIMD instructions on arm64

@zhangfannie I've tried your new instructions, thanks a lot! One thing I noticed is that FMOVD and FMOVQ take a vector register for an argument (e.g. V0) despite having an F prefix in the mnemonic. Taking a FP register (e.g. F0) might have been more idiomatic. I also noticed that it is unfortunately not possible to specify 128 bit immediates. For example,

FMOVQ $0x123456789abcdef0123456789abcdef, V0

fails to assemble with an “immediate out of range” message. And even if I choose a suitably sized immediate like

FMOVQ $0x123456789abcdef0, V0

the assembler only generates a 64 bit literal in the literal pool despite this instruction reading 128 bit from the literal pool. I would be great to have this improved.

@fuzxxl For the issue of "immediate out of range". This issue is reported by strconv.ParseUint(str, 0, 64) called by p.atoi() in the process of parsing immediate. The current go only supports bit sizes 0, 8, 16, 32 and 64, if bit size is below 0 or above 64, an error is returned (Please see https://golang.org/pkg/strconv/#ParseUint). If we want to parse a 128 bit immediate, we need to support 128 bit sizes. @cherrymui Can we do it? Thank you.

Go doesn't have 128-bit integers (for now), so I don't think we want to do that. Instead, can we let it take two 64-bit operands, for the high and low parts separately? E.g. FMOVQ $1, $2, V3 is V3 = 1<<64 + 2 ? This doesn't look as nice but maybe fine.

Or, we only support loading from memory, and let user write large constants to memory, something like FMOVQ x(SB), V3 with

DATA x+0(SB)/8, $lo64
DATA x+8(SB)/8, $hi64
GLOBL x(SB),RODATA,$16

GNU assembler doesn't seem to support directly loading large constant either. It needs a label. We could use a global data symbol.

@cherrymui The problem with a global symbol (as opposed to a literal pool) is that instructions are wasted generating the address of the global symbol. This incurs needless cost in high-performance code. Having two separate operands would be fine with me and I suppose it's a good alternative. Another idea would be to support a form that loads from a local label such that the programmer can manually create a literal pool.

Any resolution on the inconsistency of using V0 instead of F0 for the register we load into? After all, the instruction mnemonic indicates an FP-type instruction, not a SIMD-type instruction.

How many instructions are there like this, where it takes a constant that doesn't fit into 64-bit? Is this the only case or there are still many? I'm less sure about supporting manual constant pool management, mainly for consistency (there have been several instructions that use the constant pool managed by the assembler).

I agree that the line between FP instruction and vector instruction is blurry. Does this instruction have anything specific to FP? If not, maybe we should just use VMOVQ (or VMOV) and take a V register?

@cherrymui I am unaware of any other instructions operating on 128 bit constants in a literal pool. I think having to write this one with two immediates is a workable choice.

The instruction is coded as taking a FP-style register in the ARM reference manual, so FMOVQ is consistent with that choice. I am inclined to say that both VMOVQ $imm, V0 and FMOVQ $imm, F0 should be supported as aliases to establish consistency in both SIMD and FP code and to make writing macros easier. In any case, the same choice should be made for literal-pool-loading FMOVD, too (this one appears to be somewhat more common in FP code for loading float64 constants, but it's also useful in vector code). It's not my call to make though.

@cherrymui @clausecker Thank you for the solution. I will post a patch to write 128 bit with two immediates.

@cherrymui @clausecker I spent some time trying to find a SIMD-FP instruction to do the "Dn << 64 + Dm = Qd" operation we need. Unfortunately, I did not find such an instruction 😥. Do you have any ideas? Thank you.

@zhangfannie I'm not sure what you mean. The idea was to have the literal-loading form of FMOVQ be

    FMOVQ $foo, $bar, F0

where foo and bar will be placed in a nearby literal pool such that they are adjacent. This then compiles to something like

    FMOVQ pool(PC), F0
    ...
pool:
    DWORD $foo
    DWORD $bar

Or perhaps I misunderstood something here?

@clausecker Sorry, my understanding is wrong. The patch will be submitted soon. Thank you for the explanation. 😊

@cherrymui @zhangfannie While we're add it, I have difficulty understanding what the Go syntax for a load or store instruction like these is:

ldr q0, [r5, #32]
str q0, [r5, #32]

I'm sure they are implemented (at least I hope), but neither of these variants seem to work:

FMOVQ 32(R5), F0
VMOVQ 32(R5), V0
MOVQ 32(R5), F0
MOVQ 32(R5), V0

Have I missed anything?

FMOVQ/VMOVQ seem the right syntax (MOVQ is not). They don't seem to be implemented yet, though. For now, FMOVQ only supports constant source operand, and VMOVQ does not exist. (We have VMOV, but that is currently used only for moves between V registers.)

@cherrymui Being able to load/store SIMD and FP registers would be a useful addition I suppose. I can use VLDR1 in many cases, but the addressing mode is pretty restrictive.

I agree it is useful. It just needs to be implemented.

@cherrymui

Thanks! If you're adding this one, could you also add VBIC and VUSRA? It turns out these ones can be used to shave off some instructions in my code.

@clausecker We will do it. Thank you.

@zhangfannie Super cool! Reading the manual further, I believe VSRI, VSLI, VUADDW, and VUADDW2 might be very helpful, too.

Change https://golang.org/cl/255900 mentions this issue: cmd/asm: fix the issue of moving 128-bit integers to vector registers on arm64

@clausecker The patch is ready and is under internal review. I will submit it as soon as possible. Thank you.

Change https://golang.org/cl/258397 mentions this issue: cmd/asm: add several arm64 SIMD instructions

Was this page helpful?
0 / 5 - 0 ratings