Glow: Refactor Backend interface

Created on 6 Jul 2018 · 8Comments · Source: pytorch/glow

The glow::Backend interface is current storing a lot of state, and it needs to be refactored. We want to clearly separate the hardware abstraction layer from the different kinds of state we're storing.

Over in #1176 we want to compile multiple functions to run on different CPUs. In the future, we also want to support multiple GPUs and other accelerators, so we need a backend interface that separates state related to different functions and different execution units.

I'll elaborate on a design in the comments below.

Source

stoklund

Most helpful comment

Question: Should we consider the idea of a dynamic backend registration in the scope of the Backend interface re-factoring?

Currently, we need to statically enumerate all the backends in the Backend.h, in the BackendKind enum. And we have a createBackend method that creates new backends instances based this information. This approach results in a tight coupling and in the need to re-build the whole project if a new backend is added. Also the integration of new backends is more complicated due to this.

May be we should switch to the registry model, where backends basically register themselves (e.g. their name and the backend object responsible for managing of new instances, etc) and then they can be created by calling something like createBackend(backendName)?

opti-mix on 17 Jul 2018

👍4

All 8 comments

Prior art

One responsibility of the Backend class is to provide a common hardware abstraction layer interface. Each supported target architecture provides a subclass of Backend that implements virtual functions which specialize compilation and execution to the particular target.

Old-school re-targetable compilers used #ifdefs sprinkled all over their source code to specialize the compiler for a target architecture. The compiler could only be built for one architecture at a time. This meant that unit tests would only test the architecture you built, and you couldn't even detect syntax errors if they happened to be invisible in your current configuration. This made continuous integration very painful.

LLVM can be re-targeted at runtime. It uses a Target class which has virtual functions that provide further information about a target architecture. LLVM's target configuration system is more complex than Glow needs.

The Cretonne code generator uses a TargetIsa trait which encapsulates both the target architecture and any compiler flags and settings that can affect the code generation. This means that the generated code is a pure function of the input IR and the TargetIsa instance. No secret command line flags can affect the code generation and cause hard-to-reproduce bugs.

Common to the LLVM and Cretonne designs is that their Target and TargetIsa objects are constant once they have been created. All the virtual methods are declared as const. This means that one target object can be reused for multiple (concurrent) compilations, and the state related to target configuration is clearly separated.

stoklund on 6 Jul 2018

❤4

Use cases

To inform the design, it's useful to look at a couple use cases for Glow. These all assume that Glow is used as a JIT compiler, i.e. we're not concerned with cross compilation or saving compiled models to disk.

Parallel CPU inference

Running parallel inferences with a single fixed graph on a multicore CPU:

Compile graph once.
Execute compiled graph on many threads.

Parallel inference on NUMA CPU

Same as above, but running on a multi-socket NUMA server:

Compile graph once.
Make separate copy of weights for each socket.
Execute compiled graph on many threads per socket, using the weights local to the socket.

Pipelined inference on NUMA CPU

To save memory, we want to avoid multiple copies of the weights. Instead, we partition the graph and distribute it among the sockets. That way, each socket holds part of the weights.

Partition graph into multiple functions.
Compile each function once.
Set up multiple pipelines with semaphores, one per core in each socket.
Run multiple inference jobs through the pipelines in parallel.

Pipelined inference on two GPUs

Say the weights of our graph are too big to fit on one GPU's high-bandwidth memory, but we have two GPUs. We want to partition the graph into two parts that each fit on one GPU.

Partition graph into two functions.
Compile each function once.
Set up single pipeline transferring the intermediate output from one GPU to the next one.
Run inference jobs through the pipeline. Two jobs can be active at the same time.

Multiple graphs on single GPU

We want to run inference on multiple different graphs with low latency. The weights for all the graphs fit on HBM of a single GPU.

Compile each graph once.
Copy code and weights for all graphs to the GPU's HBM.
Run different types of inference jobs without needing to copy weights or code. Only input/output data is copied.

Observations

We want the Backend design to be compatible with these kinds of use cases. This doesn't necessarily mean that all the backends can do all these things, but the design shouldn't prevent them.

The parallel CPU use case suggests that we need to distinguish between shared constant data (i.e., weights) and dynamic per-run data (inputs, outputs, and activations). If we want to reuse one compilation on multiple threads, dynamic data needs to be relocatable or stack-based.
The NUMA use case suggests that constant data also needs to be relocatable in some cases.
The GPU use cases need to separate copying weights and code to the device from executing a single inference job.
We don't want to hang on to temporary compiler data structures like the graph and IR after we finished compilation. That's wasting memory, and IR can be large in some cases.

This all suggests that maybe variables should not belong to the Module along with the compiler IR. Such a change is not in scope for this issue, but it is worth keeping in mind when designing the Backend interface.

stoklund on 7 Jul 2018

👍1

As a first step, we can split the Backend state into three parts:

The hardware abstraction layer is a class with a number of const virtual functions. It contains no state other than configuration data, and it references no mutable state.
Temporary data used during compilation.
State representing a compiled function.

This first step does not address the issues with multithreading that the use cases bring up. A second refactoring step can handle this by distinguishing between a compiled function and a bound function which has fixed input and output locations.

stoklund on 9 Jul 2018

👍3

Phase 2: Execution environment

The initial incarnation of CompiledFunction above contains both the result of compilation and the state needed during execution, such as the input/output variables in the module. This means that a compiled function can't be run in two threads concurrently, for example.

We can separate these two kinds of data such that a single compilation can be reused for concurrent execution:

CompiledFunction owns the compiled code and possibly some constant data.
BoundFunction has bindings to specific input/output variable instances and it owns memory buffers that are mutated during execution, such as internal activations.

There can be multiple BoundFunction instances associated with a single CompiledFunction instance. This enables concurrent execution, whether on multiple threads or multiple hardware accelerators.

We can add a CompiledFunction::bind(...) method which returns a unique_pointer<BoundFunction>. Compare to the onnxSetGraphIO function; cc @rdzhabarov. Then move CompiledFunction::execute() to BoundFunction::execute()

An unresolved issue is how we handle multiple hardware devices. It seems that a BoundFunction should also be associated with a specific device.

stoklund on 11 Jul 2018

👍2

Question: Should we consider the idea of a dynamic backend registration in the scope of the Backend interface re-factoring?

opti-mix on 17 Jul 2018

👍4

Good idea, registerBackend in global constructor?