Glow: Add support for FP16 and bfloat

Created on 26 Jul 2018 · 40Comments · Source: pytorch/glow

Working with accelerators, we are going to need support for 16-bit floating point types.

FP16 is the binary16 format defined in IEEE 754-2008.
bfloat is the Tensorflow floating point type with the same exponent range as a float.

| Type | binary16 | bfloat | binary32 |
|--------|--------:|-------:|-------:|
| k, storage width in bits |16| 16| 32|
| w, exponent field width in bits| 5|8|8|
| t, trailing significand field width in bits|10|7|23|

We should at a minimum add these types to glow::ElemKind so we can represent low-precision tensors in these types. Ideally, also add interpreter support.

Source

stoklund

👍3

Most helpful comment

Turns out it was an easy fix, I missed to change the handle from float to float16_t that prints the result of the inference. With that fixed, resnet50 works like a charm in half precision:
./bin/image-classifier ../build_release/tests/images/imagenet/*.png -image_mode=0to1 -m=resnet50 -model_input_name=gpu_0/data -use-fp16 -run-fp16
Model: resnet50
File: ../build_release/tests/images/imagenet/cat_285.png Result:285
File: ../build_release/tests/images/imagenet/dog_207.png Result:207
File: ../build_release/tests/images/imagenet/zebra_340.png Result:340

Attaching the IR in fp16 for the record.
resnet50_fp16.txt

qcolombet on 4 Oct 2018

🎉4 ❤1

All 40 comments

I did some research and tried to find a way to implement fp16 arithmetic in Glow. FP16 is not defined in the c++ language specification. There are a few libraries that implement software emulation for FP16. half.sf.net is the most popular library.

Both clang and gcc support the __fp16 type extension. This allows us to implement the most basic operations, such as type conversion and basic arithmetic. The code below works with clang:

float foo(__fp16 *a, __fp16 *b) { return *a+*b; }

I couldn't find any mention of __fp16 in msvc.

My recommendation would be to alias __fp16 to some other name (such as glow::float16) and use that type to implement fp16 support. If we ever need to support msvc we could replace the type alias with some platform specific implementation.

nadavrot on 21 Aug 2018

👍1

Will look at this soon.
Assigning to myself.

qcolombet on 14 Sep 2018

👍1

High level plan:

Add HalfTy type
Implement the support in the interpreter
Add a conversion mode FP32 -> FP16 for testing purposes
Add a FP16 quantization scheme
Add FP16 in the importer
Implement the support in the CPU

qcolombet on 17 Sep 2018

@qcolombet Sounds like a great plan!

I think that for item-4 the process should just be a conversion of the different nodes (1:1 conversion without profile guided stuff). For item-6 (CPU support) we could just "legalize" the graph from F16 to F32 (essentially reverse item-4). What do you think?

nadavrot on 17 Sep 2018

I think that for item-4 the process should just be a conversion of the different nodes (1:1 conversion without profile guided stuff)

Agree, let us stick to the basis first :).

For item-6 (CPU support) we could just "legalize" the graph from F16 to F32 (essentially reverse item-4)

I think that's a good first step. We may want to go to a full FP16 implementation if we see precision discrepancy between the interpreter and the CPU.

qcolombet on 17 Sep 2018

👍1

TL;DR I believe we want to go with the library approach. The compiler support seems to be too sporadic for now (unless we want to bump our requirement to LLVM 6.0). I took a look and half.sf.net looks good to me.

I don't think we want to use __fp16, since it is basically an ARM thing and that the operations are promoted to single precision [1, 2].
Instead _Float16 is somewhat the standard today for pure half-precision support, but the support is still being worked out [3].

As of today _Float16 is natively supported on AArch64 and ARM only [3, 4] and I am not sure we want to be the early adopter on this. In particular, _Float16 support requires LLVM 6.0 or later [5]. Thus, we would need to bump our LLVM requirements. Right now we say we support LLVM 5.0.

[1] http://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
"[...] __fp16 is defined in the ARM C Language Extensions [...]
__fp16 is a storage and interchange format only. This means that values of __fp16 promote to (at least) float when used in arithmetic operations."

[2] https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html
"For purposes of arithmetic and other operations, __fp16 values in C or C++ expressions are automatically promoted to float."

[3] http://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
"_Float16 arithmetic and operations will directly map on half-precision instructions when they are available (e.g. Armv8.2-A) [...]
If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision."

[4] https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html#Floating-Types
"The _Float16 type is supported on AArch64 systems by default, and on ARM systems when the IEEE format for 16-bit floating-point types is selected with -mfp16-format=ieee"

[5] https://releases.llvm.org/6.0.1/tools/clang/docs/LanguageExtensions.html#half-precision-floating-point

qcolombet on 18 Sep 2018

FYI, actually looking more into half.sf.net, it turns out that some of the computations are performed in single precision mode under the hood[1], thus using _Float16 may provide the same kind of functionality. Anyhow, at first let us see how far we go with this library.

[1] https://sourceforge.net/p/half/code/HEAD/tree/tags/release-1.12.0/README.txt
"For performance reasons (and ease of implementation) many of the mathematical functions provided by the library as well as all arithmetic operations are actually carried out in single-precision under the hood"

qcolombet on 18 Sep 2018

Should we use the same library as pytorch/C2 to avoid potential numeric discrepancies, see here, https://github.com/pytorch/pytorch/tree/master/third_party for FP16?

rdzhabarov on 18 Sep 2018

@rdzhabarov That's a good point!

I had a quick look and unless I miss something, this library is more invasive/would require more work on our side than the half.sf.net one.
The main difference is that the half.sf.net one is purely C++ with operator overloading and so on, thus it fits nicely in our code base. The one that uses C2 is C based and we would either need to patch up all the call sites (invasive) or would need to build our own wrapper.

Having just one library would be nice though so the amount of work upfront shouldn't be an excuse. That said, I am leaning towards the sourceforge lib because it seems that the C2 stuff only provides conversions and we would still have to "implement" the other operators (i.e., use the single precision ones). Long term, I believe we would rather bump the compiler requirements and use _Float16. That way, we would be able to have native support on the platforms that have it. Given this is temporary, using another library is probably not a big deal.

What do you think?

qcolombet on 18 Sep 2018

👍1

Yeah, I looked quickly into the FP16 lib used by PyTorch as well and it seems that it does not really have any operators support, but some conversion mechanisms between fp32 and fp16 (int16_t).
Although it would be good to use something unified between C2/PyTorch and Glow, I'm not sure what is the good way to adapt FP16.

cc: @Maratyszcza for any insights/details on the FP16 lib.

rdzhabarov on 18 Sep 2018

Most hardware architectures today provide instructions for conversion between half-precision and single-precision, e.g.:

ARM in ARMv7+VFPv3-FP16 or ARMv7+NEON-FP16 (i.e. nearly all ARMv7 CPUs, except Cortex-A8)
x86 since F16C instruction set (Intel Ivy Bridge, AMD Piledriver)
Nvidia GPUs since Maxwell

Few hardware architectures, however, provide instructions for computations in half-precision:

ARMv8.2 CPUs with NEON-FP16-Compute extension (Cortex-A55, Cortex-A75, and Cortex-A76 cores, e.g. Snapdragon 845)
Selected Nvidia GPUs (Maxwell-based Tegra, Pascal-based Tesla P100, Volta-based Tesla V100)

Typically, when dealing with FP16, inputs are upconverted to FP32, all computations are performed in FP32, and then results are converted back into FP16. This must be done on high-level ops, e.g. matrix multiplication, or convolution. Note that for primitive floating-point operations, e.g. addition, converting inputs to FP32, performing addition in FP32, and then converting result to FP16, generally produces different (and less accurate) results than a direct FP16 operation (this issue is called double rounding in the literature).

Now, if you want to support FP16 ops on targets which don't provide hardware instructions for conversion, you can perform this conversion purely in software. There are half-dozen libraries to do such conversion, but Maratyszcza/FP16 is the most performant (it includes a comparative benchmark, so you can check it yourself).