Glow: Add support for FP16 and bfloat

Created on 26 Jul 2018  路  40Comments  路  Source: pytorch/glow

Working with accelerators, we are going to need support for 16-bit floating point types.

  • FP16 is the binary16 format defined in IEEE 754-2008.
  • bfloat is the Tensorflow floating point type with the same exponent range as a float.

| Type | binary16 | bfloat | binary32 |
|--------|--------:|-------:|-------:|
| k, storage width in bits |16| 16| 32|
| w, exponent field width in bits| 5|8|8|
| t, trailing significand field width in bits|10|7|23|

We should at a minimum add these types to glow::ElemKind so we can represent low-precision tensors in these types. Ideally, also add interpreter support.

Most helpful comment

Turns out it was an easy fix, I missed to change the handle from float to float16_t that prints the result of the inference. With that fixed, resnet50 works like a charm in half precision:
./bin/image-classifier ../build_release/tests/images/imagenet/*.png -image_mode=0to1 -m=resnet50 -model_input_name=gpu_0/data -use-fp16 -run-fp16
Model: resnet50
File: ../build_release/tests/images/imagenet/cat_285.png Result:285
File: ../build_release/tests/images/imagenet/dog_207.png Result:207
File: ../build_release/tests/images/imagenet/zebra_340.png Result:340

Attaching the IR in fp16 for the record.
resnet50_fp16.txt

All 40 comments

I did some research and tried to find a way to implement fp16 arithmetic in Glow. FP16 is not defined in the c++ language specification. There are a few libraries that implement software emulation for FP16. half.sf.net is the most popular library.

Both clang and gcc support the __fp16 type extension. This allows us to implement the most basic operations, such as type conversion and basic arithmetic. The code below works with clang:

float foo(__fp16 *a, __fp16 *b) { return *a+*b; }

I couldn't find any mention of __fp16 in msvc.

My recommendation would be to alias __fp16 to some other name (such as glow::float16) and use that type to implement fp16 support. If we ever need to support msvc we could replace the type alias with some platform specific implementation.

Will look at this soon.
Assigning to myself.

High level plan:

  1. Add HalfTy type
  2. Implement the support in the interpreter
  3. Add a conversion mode FP32 -> FP16 for testing purposes
  4. Add a FP16 quantization scheme
  5. Add FP16 in the importer
  6. Implement the support in the CPU

@qcolombet Sounds like a great plan!

I think that for item-4 the process should just be a conversion of the different nodes (1:1 conversion without profile guided stuff). For item-6 (CPU support) we could just "legalize" the graph from F16 to F32 (essentially reverse item-4). What do you think?

I think that for item-4 the process should just be a conversion of the different nodes (1:1 conversion without profile guided stuff)

Agree, let us stick to the basis first :).

For item-6 (CPU support) we could just "legalize" the graph from F16 to F32 (essentially reverse item-4)

I think that's a good first step. We may want to go to a full FP16 implementation if we see precision discrepancy between the interpreter and the CPU.

TL;DR I believe we want to go with the library approach. The compiler support seems to be too sporadic for now (unless we want to bump our requirement to LLVM 6.0). I took a look and half.sf.net looks good to me.

I don't think we want to use __fp16, since it is basically an ARM thing and that the operations are promoted to single precision [1, 2].
Instead _Float16 is somewhat the standard today for pure half-precision support, but the support is still being worked out [3].

As of today _Float16 is natively supported on AArch64 and ARM only [3, 4] and I am not sure we want to be the early adopter on this. In particular, _Float16 support requires LLVM 6.0 or later [5]. Thus, we would need to bump our LLVM requirements. Right now we say we support LLVM 5.0.

[1] http://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
"[...] __fp16聽is defined in the ARM C Language Extensions [...]
__fp16聽is a storage and interchange format only. This means that values of聽__fp16聽promote to (at least) float when used in arithmetic operations."

[2] https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html
"For purposes of arithmetic and other operations,聽__fp16聽values in C or C++ expressions are automatically promoted to聽float."

[3] http://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
"_Float16聽arithmetic and operations will directly map on half-precision instructions when they are available (e.g. Armv8.2-A) [...]
If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of聽__fp16聽except that the results will be stored in single-precision."

[4] https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html#Floating-Types
"The聽_Float16聽type is supported on AArch64 systems by default, and on ARM systems when the IEEE format for 16-bit floating-point types is selected with聽-mfp16-format=ieee"

[5] https://releases.llvm.org/6.0.1/tools/clang/docs/LanguageExtensions.html#half-precision-floating-point

FYI, actually looking more into half.sf.net, it turns out that some of the computations are performed in single precision mode under the hood[1], thus using _Float16 may provide the same kind of functionality. Anyhow, at first let us see how far we go with this library.

[1] https://sourceforge.net/p/half/code/HEAD/tree/tags/release-1.12.0/README.txt
"For performance reasons (and ease of implementation) many of the mathematical functions provided by the library as well as all arithmetic operations are actually carried out in single-precision under the hood"

Should we use the same library as pytorch/C2 to avoid potential numeric discrepancies, see here, https://github.com/pytorch/pytorch/tree/master/third_party for FP16?

@rdzhabarov That's a good point!

I had a quick look and unless I miss something, this library is more invasive/would require more work on our side than the half.sf.net one.
The main difference is that the half.sf.net one is purely C++ with operator overloading and so on, thus it fits nicely in our code base. The one that uses C2 is C based and we would either need to patch up all the call sites (invasive) or would need to build our own wrapper.

Having just one library would be nice though so the amount of work upfront shouldn't be an excuse. That said, I am leaning towards the sourceforge lib because it seems that the C2 stuff only provides conversions and we would still have to "implement" the other operators (i.e., use the single precision ones). Long term, I believe we would rather bump the compiler requirements and use _Float16. That way, we would be able to have native support on the platforms that have it. Given this is temporary, using another library is probably not a big deal.

What do you think?

Yeah, I looked quickly into the FP16 lib used by PyTorch as well and it seems that it does not really have any operators support, but some conversion mechanisms between fp32 and fp16 (int16_t).
Although it would be good to use something unified between C2/PyTorch and Glow, I'm not sure what is the good way to adapt FP16.

cc: @Maratyszcza for any insights/details on the FP16 lib.

Most hardware architectures today provide instructions for conversion between half-precision and single-precision, e.g.:

  • ARM in ARMv7+VFPv3-FP16 or ARMv7+NEON-FP16 (i.e. nearly all ARMv7 CPUs, except Cortex-A8)
  • x86 since F16C instruction set (Intel Ivy Bridge, AMD Piledriver)
  • Nvidia GPUs since Maxwell

Few hardware architectures, however, provide instructions for computations in half-precision:

  • ARMv8.2 CPUs with NEON-FP16-Compute extension (Cortex-A55, Cortex-A75, and Cortex-A76 cores, e.g. Snapdragon 845)
  • Selected Nvidia GPUs (Maxwell-based Tegra, Pascal-based Tesla P100, Volta-based Tesla V100)

Typically, when dealing with FP16, inputs are upconverted to FP32, all computations are performed in FP32, and then results are converted back into FP16. This must be done on high-level ops, e.g. matrix multiplication, or convolution. Note that for primitive floating-point operations, e.g. addition, converting inputs to FP32, performing addition in FP32, and then converting result to FP16, generally produces different (and less accurate) results than a direct FP16 operation (this issue is called double rounding in the literature).

Now, if you want to support FP16 ops on targets which don't provide hardware instructions for conversion, you can perform this conversion purely in software. There are half-dozen libraries to do such conversion, but Maratyszcza/FP16 is the most performant (it includes a comparative benchmark, so you can check it yourself).

@qcolombet @Maratyszcza I don't think that performance is a priority for us right now. Is there a reason not to just use __fp16? Clang is already a dependency of Glow. We could wrap __fp16 in a type (or start with typealias) to make sure that we can port to MSVC in the future. I'd like to avoid external dependencies as much as possible as it increases friction, makes it more difficult to test and deploy the system, increases the chance of breakage, etc.

AFAIK, __fp16 is ARM-specific

Actually, checked gcc.godbolt.org, and it works on x86 too since Clang 3.5

@nadavrot: do you want to actually do computations (or simulate computations) in FP16 accuracy as a preparation for inference on FP16 hardware, or do you want to enable running FP16 ONNX models on Glow? It the latter, it may be enough to add ops for conversion FP16->FP32 for inputs, conversion FP32->FP16 for outputs, and update all ops to operate in FP32. If the former, you'd need your own library for FP16 computations, because neither half.sf.net, nor __fp16 type in Clang will enable actual FP16 computations on targets without HW instructions.

BTW, ONNX doesn't currently support bfloat16. @bddppq: I think we should add it.

@nadavrot see my previous post why we don鈥檛 want __fp16. What we want is _Float16, but our clang doesn鈥檛 support it.
https://github.com/pytorch/glow/issues/1329#issuecomment-422190508

@nadavrot Simply put __fp16 has a bunch of restrictions. One of them is that __fp16 is a storage and interchange type only:
We cannot use them as parameter nor return value and that makes its integration not smooth at all.

@Maratyszcza @qcolombet
Hardware accelerators and compilers have different implementations for low-precision operations. There are a dozen ways to implement the innermost part of matrix multiplication or convolution with different tradeoffs between accuracy and performance (example, size of accumulator, or when you round, etc). Within the Glow compiler itself we freely convert between multiply-add and FMAs, we vectorize, we change the order in which we multiply and accumulate values. We also perform operator-level transformations, for example we merge batch normalization with convolutions. It is impossible to write a single implementation that matches some hardware implementation or some pure reference. In the context of float16, if we had to choose one default implementation then expanding fp16 to fp32 before performing the arithmetic operation is a pretty good default. We will need to pay attention to how we implement conv and matmul and probably accumulate into float32.

I wrote that the goal is not to implement bit-level accuracy to emulate some hardware, so what is the goal? We could just convert all fp16 ONNX buffers at load time, right? (no)
The goal here is to preparer the entire compile stack for fp16 compilation. This means that we need to be able to represent fp16 tensors. We need to be able to operate on fp16 tensors natively, optimize these tensors, allocate memory and let the backends generate code for fp16. Glow supports Int8 across the stack but the different backends have different implementations. Hardware backends can even implement this using unsigned integers, because int8 is only the canonical representation. Our interpreter uses floating point values for Int8 scale conversions, but the CPU backend generates efficient code using the shift-mul-shift technique. We could use the same approach in our fp16 implementation. It's totally okay to extend fp16 to fp32, especially in the interpreter. Our CPU backend could use efficient fp16 SIMD instructions one day, if it ever becomes a priority. Another solution would be to legalize fp16 graph nodes on the CPU (from fp16 to bfloat or fp32), but the interpreter has to support fp16 natively to actually test things.

Now let's talk about the __fp16 gcc extension type. I checked that __fp16 compiles and runs locally on my x86 machine using clang. Marat also checked verified it using godbolt. Yes, we can't pass __fp16 by argument, but I don't think that this is necessary. We could easily wrap __fp16 in a simple c++ class and implement a few operators. This would allow us to operate on a portable type (in case we need to support MSVC). Are there other problems with __fp16?

Here is a quick program that wraps __fp16 with a struct that allow us to move around freely, cast, add, etc.

#include <iostream>

struct FP16 {
  __fp16 data_;
  FP16(float x = 0.) : data_(x) {}
  FP16 operator+(const FP16 &) const;
  operator float() const { return data_; }
};

FP16 FP16::operator+(const FP16 &c) const {
  FP16 result;
  result.data_ = (this->data_ + c.data_);
  return result;
}

void print(FP16 x) {
  std::cout << "result" << (float)x << "\n";
}

int main() {
  FP16 x(0.3);
  FP16 y(0.3);
  FP16 z = x + y;
  print(z);
}

The point with the lib approach (which is just a header) is that we don鈥檛 have to write any code!

Actually, checked gcc.godbolt.org, and it works on x86 too since Clang 3.5

@Maratyszcza what options did you used for gcc on godbolt?
gcc complains that __fp16 is not an actual type for me. I'm tried with -fp16-format -mfp16-format=ieee but it doesn't like those options either.

@qcolombet: it works only with clang 3.5+

gcc supports __fp16 only on ARM/AArch64

Thanks for double checking!
That rules out using it directly... like I suspected!

I've added a mode to the loader to be able to convert fp32 model into fp16 in qcolombet/glow@f2bfaa01.
I need to add tests to submit a proper pull request but the basic functionality should be there.

I miss the implementation of a bunch of operators to be able to run resnet50, but we are getting there!

Note to myself, here are the instructions we need to support for resnet50:
avgpool
batchedadd
convolution [done]
elementadd
elementmax
matmul
maxpool
softmax
splat
tensorview
transpose

@qcolombet Can we also support the opposite direction, i.e. loading a NW that uses FP16, but converting all values into fp32 internally. This would allow us to run those fp16 models without any further changes in Glow, or? Of course, the performance may be worse compared to the real fp16-based execution.

Sure thing. The conversion process is pretty simple, thanks to the copyWithCast and setType things that I added (either in that commit or recently).

In that sense, applying the result of quantization is just a special case of this kind of cast. I may not want to generalize the conversion code to support that at first though :).

I was growing bored of having to do for each instruction in the interpreter:

  • Hoisting the FP functionality in a function
  • Adding a template parameter
  • Dispatch to the right template based on ElemKind
    Basically what we have to do everywhere in the compiler because of the problem I was mentioning with our handle in #1720.

So instead, I hacked a template-free handle demonstrated in action in qcolombet/glow@65b9e1c4216d3b65c0dc8f9056573cebe9db0d4e. (No need to review just hacking stuff.)
In that patch, the support of AvgPool for fp16 is just a matter of changing the type of the handle to this template-free handle. Similarly, the support for convolution was updated and we see that we can drop the template argument.

The advantage of this approach is that actual manipulation of the tensors data in hidden in one place (we could likely support quantization in a similar fashion but I have really though about it) instead of being spread everywhere. I.e., if we add another type we have this one class to touch, instead of having to patch the interpreter, the optimizer and so on.

The drawback of this approach is that the underlying type cannot be determined at compile time (of glow) and in particular, the compiler (compiling glow) wouldn't be able to understand that a loop always accesses the same type. Thus, it will break vectorization in the interpreter and in glow in general. That said, I don't think the goal of the project is to feature a super fast interpreter and my guts tell me that not having vectorized code inside a compiler doesn't matter.

I was wondering if people are interested in having this productized, or if I should just continue with the mechanical change, knowing that if we add new types, say Int23QTy (who knows :P), we would have to do the same thing again!

With that patch, status is:
avgpool [done]
batchedadd
convolution [done]
elementadd
elementmax
matmul
maxpool
softmax
splat [done]
tensorview
transpose

I think it'd be pretty nice to be able to add types without making a bunch of (mechanical, but still numerous) changes. I feel a bit weird about introducing a type-erased handle that wraps a typed handle that wraps an untyped tensor :-). And I don't feel like I have enough info w.r.t. performance implications to be able to judge. I like the existence of the interpreter (for correctness checking), and I'd like it not to be dog-slow, but I have no idea how this proposed patch would affect things.

How's that for a bunch of waffling :-D.

I wonder if we could write a generic dispatch method that could wrap a bunch of the functions in the appropriate handle type, with suitable template sorcery. Something like (warning, pseudo-C++):

template<typename Fn, typename Args...>
void dispatch(ElemKind k, Fn fn, Args ...args) {
  switch(k) {
    case ElemKind::Float16: fn<ElemKind::Float16>(args...);
    /* etc */
  }
}

(But that may or may not work, and may be too much template magic to stomach)

@qcolombet The implementation of the quantized operation is in some cases different. For example, this[1]. There is added logic for type conversion, clipping, and accumulation into the correct type.

I understand that it's annoying to turn methods into template methods. I wonder, can you think of a clever way to improve the templated dispatch? Would a macro, or templatizing the whole loop, or maybe some other mechanism would make the dispatch part of the code nicer? I guess that there are unusual operations that have different input and output types, like "cast". These operations complicate things.

Regarding the speed of the interpreter. How many tests do we want to be able to run in our CI in debug mode on each commit?

[1] - https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/libjit/libjit_matmul.cpp#L298

@bertmaher thanks for the feedbacks.

How's that for a bunch of waffling :-D.

Haha :).

I feel a bit weird about introducing a type-erased handle that wraps a typed handle that wraps an untyped tensor :-).

I agree. The patch was made with the intention of not changing the core representation. What I have in mind if we were to productize this is to kill the template parameter from the Handle class. I.e., not introducing the FPHandle of this patch.

And I don't feel like I have enough info w.r.t. performance implications to be able to judge.

Performance wise, with AVX2 we potentially slow down the highly vectorized kernel by a factor of 8x.
That's bad, but maybe still usable(?)
That said, if performance are a concern, we can still resort to directly using the raw pointer with actual type.

(But that may or may not work, and may be too much template magic to stomach)

Same here (too much magic :P).

@nadavrot thanks for the feedbacks too!

The implementation of the quantized operation is in some cases different. For example, this[1]. There is added logic for type conversion, clipping, and accumulation into the correct type.

Agreed, but that doesn't seem incompatible with what I am suggesting. Essentially the type conversion and clipping are just casting to/from the quantized types, which the FPData class abstracts away. Hand-waiving a lot here :).

I wonder, can you think of a clever way to improve the templated dispatch?

Yes, what I am suggesting :P.

Would a macro, or templatizing the whole loop, or maybe some other mechanism would make the dispatch part of the code nicer?

Aside from the relative complexity of templates, I don't like the fact that each time we add a new type, the template instantiation grows the binary size (potentially by a non-trivial amount). I guess we are not too concerned about code size, plus it is mainly the interpreter code that is affected and that shipping Glow could be done while stripping that away.
Anyway, I imagine what @bertmaher was pseudo-coding could work if we were to stay heavily templated on that front.

Regarding the speed of the interpreter. How many tests do we want to be able to run in our CI in debug mode on each commit?

No sure, but as far as I could tell things were not noticeably slower. I haven't done the full conversion though, nor full measurements.

Bottom line, something to keep on our min, but for now I will go back to the mechanical change. Indeed, I understand that revamping our handlers would need more discussions and thorough analysis.

Alright, sent the PR for all the nodes required to support ResNet50:
avgpool [done]
batchedadd [done]
convolution [done]
elementadd [done]
elementmax [done]
matmul [done]
maxpool [done]
softmax [done]
splat [done]
tensorview [done]
transpose [done]

I did a quick FP16 using my prototype from #1747 and saw some precision issue. That said, I may have a bug in the prototype because the results are random: different runs give different results.

Turns out it was an easy fix, I missed to change the handle from float to float16_t that prints the result of the inference. With that fixed, resnet50 works like a charm in half precision:
./bin/image-classifier ../build_release/tests/images/imagenet/*.png -image_mode=0to1 -m=resnet50 -model_input_name=gpu_0/data -use-fp16 -run-fp16
Model: resnet50
File: ../build_release/tests/images/imagenet/cat_285.png Result:285
File: ../build_release/tests/images/imagenet/dog_207.png Result:207
File: ../build_release/tests/images/imagenet/zebra_340.png Result:340

Attaching the IR in fp16 for the record.
resnet50_fp16.txt

Forgot to update this.
With #1931 the FP16 support is now complete for inference, for the interpreter.
That means glow supports FP16 and now each backend need to implement whatever they want to support.

Can close this issue?

We are indeed done for FP16, but the issue also mentioned BFloat.
Don't know if we want to close that one and file a new one for Bfloat or just keep this one around.

I think that we can file a new task for BFloat when the time is right. BFloat will deserve a clean task where we can discuss the design and argue over the implementation. ;)

Hehe :).
Sounds good to me!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ayermolo picture ayermolo  路  3Comments

wayneshawn picture wayneshawn  路  3Comments

pjaaskel picture pjaaskel  路  4Comments

tkclimb picture tkclimb  路  4Comments

s-peryt picture s-peryt  路  3Comments