Incubator-mxnet: [Discussion] Sharing Operators between DL Frameworks

Created on 19 Jan 2017 · 111Comments · Source: apache/incubator-mxnet

This discussion started from https://github.com/dmlc/minpy/issues/129, with @soumith THC is a tensor library that backs torch. I open this issue in MXNet repo so more developers can see it.

First of all, it is possible reuse operator libraries between frameworks, for example

Support for THC and Torch Module was done in Torch Plugin, with interfacing to torch's lua library.
MXNet supports reuse operators from caffe

It is always interesting to see interchangeability happen. For example, schedule pytorch operations in mxnet's async engine, or run mxnet's declarative API to directly share data with pytorch's array.

However, there is some engineering obstacles in doing so, which I would like to explain what these obstacles are, and hopefully this can motivate the community to move forward, and make this easier.

Coupled Operator Data Structure Components

An operator can mean many things, here are some basic components on what the operators are:

Data structure that holds(shape) pointers to the array
Possible memory allocator to handle run-time memory allocation
Resource handles, if external resources is needed
Scheduling related objects if array support synchronize execution

Why such coupling prevents reuse? There are two reasons

Many systems have their own memory allocator and ways of resource handling code.
While having memory allocator enables runtime memory allocations, sometimes memory allocation is not preferred at all(e.g. BLAS calls where all memory are pre-allocated)

To resolve this problem, an operator library design should enable operators that accept user managed memory resources, when possible, not introduce allocator or resource management, but give hints to the user(CuDNN's workspace requirement eliminates the need to internal memory allocator).

From this point of view, CuDNN an cuBLAS are good examples. THC is nice, but still encapsulate memory allocator(which is needed sometimes for dynamic operators).

Lack of Unified Operator Interface

The second obstacle is mainly lack of common operator interface. This is a problem of CUDNN and THC that prevents reusing. Take CuDNN for example, each CuDNN API is a C function, with its own interface, to adopt the operator, there need to be one(or multiple) adapting function per operator.

Consider instead, if there is an unified operator interface(the following is a mock design), where each TBlob is a reference to the data fields and shape, and every function gets registered to the registry with their name

using FCompute = std::function<void (
   array_view<TBlob> ins, array_view<TBlob> outs, map kwargs, stream stream)>

Then it only takes one function to extract, and reuse all operators and automatically expose them to front end. In MXNet, it even directly generates the symbolic counterpart from the same imperative operator, if gradient is provided.

Problem of One Unified Operator Interface

There is always a flip side of the coin. Assume that we go with a unified operator interface. As a matter of fact, that is what MXNet, TensorFlow and Caffe have done. The problem now becomes what the interface should look like? One trap that framework designer always falls into is that we need one interface that rules them all.

Since one interface rules them all, we want to support all possible operators, what about the ones that need runtime memory allocations? Maybe add memory allocator to it, what about the ones that is asynchronize? In the end, the interface have to include memory-allocator, scheduling module in some way,
and that introduces the "Coupled Operator Data Structure Components" problem. The operator interface become deeply coupled with the rest of the framework and not reusable.

A Better Solution: A Few Unified Interfaces

Can we get the best of both worlds, having as few data structures and interfaces as possible, while still not introducing coupling to allocator and scheduling as much as possible? I think the answer is yes and we need to jump out from the ideal of one interface that rules all the operators.

I can categorize the operators roughly in three categories

type1: Basic operators: The ones that can do shape inference based on input shape, can take memory pointer, stream and go
type2: Basic+ operators: Same as basic operator, but also need to declare some additional resources(workspace)
type3: Complicated operators: The ones that requires runtime memory allocator, its output shape depends on content of the data.

If we design for general operator interface, the answer will usually looks like type3. However, type 1 and 2 dominates 90%+ of the major operators we are using.
If we design one operator interfaces for each type, this problem is solved. So that frameworks can pull and interact with each type in their own way.
It is much easier to do things like static memory planning if type1 and type2 are explicitly introduced. This is one additional layer of wrapping on top of THC and CuDNN is is lacking so far.

A registry system like NNVM could come very handy to easily resgister these informations, and get pull out by the libraries.

The Hope

I have always hopped that there is a minimum set of operator interface standard in C++, that can be shared across libraries. I think we have a good idea on what the solution looks like. While most system tends to become opague and coupled, I think this kind of transparent way can help evolve the community in a healthy way. This being said, there is always effort to make these happen. This involves a open discussion on what the interfaces should be and commitment from framework builders. I would really love to see this happen, and that is why I spend more than one hour writing this.

Unfortunately, most frameworks already have kinda of "enough collection of operators", so having a unified operator interface will contribute little to each framework in terms of usability in short term. Naturally this would be given lower priority. That is why commitment is needed to bring this out for longer term benefit

Source

tqchen

👍2

Most helpful comment

Or Deep Learning PACKage, DLPACK, motivated from LAPACK. We are providing
more than basic subprograms.

On Sun, Feb 12, 2017 at 11:28 PM, Yangqing Jia notifications@github.com
wrote:

Following the idea of BLAS we can probably call it BDAS (basic
deep-learning algebra subprograms) - sounds like "badass".

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-279312506, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAZv4RjloVzmTaOnuGKOIPLiBDtTBCQ4ks5rcAYHgaJpZM4Lobt0
.

mli on 13 Feb 2017

👍4

All 111 comments

I also had similar discussion with @Yangqing and @ajtulloch before.

tqchen on 19 Jan 2017

Great initiative! I think a lot of components can be shared if we refactor them in simple APIs. Would love to work together on this front.

Yangqing on 19 Jan 2017

The fundamental issue into having a Unified interface is that it needs a full buy-in. Anything short will make it a partial or full failure.

For this reason, I think what the CuDNN team did is actually correct.

For this reason, i think focusing on simplicity and reducing the friction of buy-in, and allowing a way to have partial buy-in will make more folks participate.

So, I think we should define:

one function that consumes:
- a void* pointer
- a long* pointer of sizes
- a long* pointer of strides
- nDims
  
  and it spits out a common Tensor descriptor
Each frameworks defines a convertor to and from the common descriptor
Frameworks factor out their ops to be used as independently compiled libs

Keeping it stupid and simple like this is the path of least resistance that will get us forward.

I dont feel confident that defining and maintaining a common registry will practically break ground, especially because it has a huge initial overhead from each of the framework writers (who are all busy with their own problems).

What do you guys think?

soumith on 19 Jan 2017

👍1

what I proposed only works for stateless operations initially, but i think that's where we should start from. Defining statefulness right now will lead to disagreements and complications.

soumith on 19 Jan 2017

I think stateless is a good starting point(essentially the type 1 operator). I do however, would like to have a few set unified interfaces in someway, and a registry that is decentralized.

So the scenario I hoped looks like

#include <common_nn_op.h>

void InitMXNetOps() {
   for (auto reg: Registry::ListBinaryOps()) {
       register(reg->name, reg->function);
   }
}

This enables one function to import all the operators that is provided in the operator library. It would indeed require a bit of registry code from the operator library side, for example, a wrap around THC or the library @soumith suggested.

This reduces the effort of importing and adapting new operators. However, the interface indeed need to be simple enough, like the one that contains a few tensor data structures.

tqchen on 19 Jan 2017

As I mentioned earlier, I do not agree on one unified operator interface, but I do like to see if there can be a few candidates that we can agree on. For example, binary operators.

BinaryOp(const Tensor& lhs, const Tensor& rhs, Tensor*out);
BianryOpShape(const Shape& lhs, const Shape& rhs, Shape* out);

The idea is to reduce the overhead of adaptation code, which otherwise need to happen for each operator, and it makes the framework builder harder to switch in.

tqchen on 19 Jan 2017

The easiest way of doing this is copy-pasting kernels, which we having been doing for a while.

a blas like interface is a good idea. But this is only worth while if the operator is complicated enough (i.e. longer than the code required to call it...). Sharing elementwise add probably isn't necessary.
having a TensorDescriptor like data structure further complicates this since you need to spend 20+ lines constructing these descriptors.

Unified operator interface is in theory the right way to do things, but obviously we all think our own interface is the best interface. So not sure if this will go anywhere anytime soon.

Examples of operators that are worth sharing: broadcast-reduce ops, embedding.

One thing we can do without having to agree to anything is some "principled copy pasta" wiki page where we share operator implementations without necessarily using the same interface.
A easily pair testing framework for verifying correctness on top of that is also good.

piiswrong on 19 Jan 2017

Also instead of sharing compiled code, a header only library where you use all data structure and array indexing as macros that can be redefined for each framework is much easier.

For example mxnet doesn't support stride while torch does. So indexing works differently. This can be solved by defining different macros.

piiswrong on 19 Jan 2017

I see there are two ends of the current existing discussions

* Build One Library that everybody will use*
Can use simple common data structure, with each framework calls into the same function. @soumith 's proposal is a better solution for this end. Since as long as the data structure is agreed, there is no problem of calling into the functions.

The problem is it is hard to convince developer to be fully committed to a shared core library.

* Being able to import operators from other libraries *
The major concern of me to do this is what is the overhead of importing? That is why some simple common interface, as with data structure might be desirable. So the cost of importing does not involve effort per operator, but instead one effort for importing all operators that all frameworks currently define.

* What are the set of Interfaces*
To be clear, I do not think the MXNet's(nor interface of existing frameworks) interface is the best way to do operator sharing. But I do think there are set of cleaner and minimum interfaces that we might agree on, just like we can agree on the data structures

tqchen on 19 Jan 2017

Interface wise, if I may - the cudnn type interface is a good start and this is what I have been telling other vendors too. If there are implementations for e.g. OpenCL, OpenGL, Vulkan etc, this will make the frameworks' life much better.

Yangqing on 19 Jan 2017

👍2

I think the interface includes two parts: (1) the function routines (2) tensor data structure. For example, I remember in THC, it has support for stride and offset, which is lacking in MXNet. If we use cudnn's way, then we need to include all these information in function arguments, which may be a problem for future extension.

jermainewang on 20 Jan 2017

It makes sense to start with a non-strided version, I think - Caffe/2 does not use stride either and assumes a 256-byte alignment for storage.

RE cudnn - I actually think that having a pure C interface is good for extension, since it would make cross-language integration much easier. For example Python C extension doesn't do a very good C++ support.

(Oh by the way, pybind11 is awesome.)

Yangqing on 20 Jan 2017

👍2

I am all for C interface for stable ABI, on the other hand. As a matter of fact, almost all dmlc projects interfaces through C API.

It is always possible to have an auxiliary c++ registry if we can categorize the functions, and return the function handles.

tqchen on 21 Jan 2017

👍1

Anybody is interested in Tensor API?

bhack on 2 Feb 2017

👍1

You can find the source code of an inital AMD implementation of the standard

bhack on 2 Feb 2017

My concern with the Tensor API(and there are a few of its kind) is that they are opaque, not only is this a standardization of Tensor, but it is also a standardization of graph based dl framework

It is quite easy to see motivation of such design, it maximize the possibility of some optimizations, however,
- It means it that manages memory, computation for you
- Eventually you will see that evolves into a deep learning library of its own
- It means huge effort for each vendor to implement all the scheduling and optimizations that could be shared across hardwares(via using a standard deep learning framework), this might benefit big vendors who have more resources, and provide higher chance of lock in, but was bad for vendors who cannot catch up.
- It also means we might loose possibility for imperative execution(if execution have to rely on a graph)

Personally, I think what we really need is separation of these things(Tensor data structure, the computation, memory management, scheduling), adopt unix philosophy, do one module more transparently and interact with others, specifically:

The memory layout of tensor should be transparent, with some indication of variants and layout when necessary in the operation implementations.
Allow the user to pass in the memory, and workspace instead of relying on internal allocator
Make computation as stateless as possible.
Allow easy hookup to existing kernels, layers
- e.g. hook that to Torch's TH Tensor, Caffe's layer and MXNet's TBlob

tqchen on 2 Feb 2017

I think what we are discussing here is not really to support XYZ features, that job of deep learning frameworks, but to come up with a minimum module that can be shared across frameworks.

As I may quote
"Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away"

tqchen on 2 Feb 2017

seems there is enough interest in this issue. What I may suggest is for some of us to post strawman design of Tensor structure and possible operators and we can comment from these points, these will help things to move toward a concrete direction.

I largely agree with @soumith and @Yangqing on a C language structured minimum tensor object (maybe design preferred compact layout, with optional stride support)

tqchen on 2 Feb 2017

How much hw vendors are involved in openvx neural network extension? Nvidia is in the team list as you see and had a strategy to implement Openvx computing graph over cuda (see VisionWorks). Also Samsung, Intel, Amd, Arm are in the committee.
In tiny-dnn, that it is strongly c++1x features oriented we have also evaluated array_ref proposal but I think many here are not interested in the differents c++ standardization efforts.

bhack on 2 Feb 2017

I also like the idea. Not sure how much time we can put in this. INR thing
that can help all team find the time would be to identify one feature we
don't have and others have. If we can tell, we need this and an relatively
easy way to get it and is a long term solution is do it, then it will
probably get done.

Otherwise I'm pretty sure this will get too low in most people priority
stack in a few weeks.

What do you think of trying to do that now?

About the registry of fct ptr, it isn't enough. A big part of time wrapping
a lib is in the error handling and other stuff like this. A registry would
need to tell the signature of the fct, ... So it will get complicated and
not used by many. So I think blas, cudnn like interface of the best.

Ping @lamblin @abergeron @bartvm so they know about this.

Le jeu. 2 févr. 2017 12:41, bhack notifications@github.com a écrit :

How much hw vendors are involved in openvx neural network extension? Nvida
is in the team list as you see and had a strategy to implement Openvx
computing graph over cuda (see VisionWorks). Also Samsung, Intel, Amd, Arm
are in the committee.
In tiny-dnn, that it is strongly c++1x feature oriented we also evaluated
array_ref https://github.com/kokkos/array_ref/tree/master/proposals
proposal but I think many here are not interested in the differents c++
standardization efforts.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277027997, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AALC-_8WCRWh_Ie6tac61WafZkN6Xc-Jks5rYhVFgaJpZM4Lobt0
.

nouiz on 3 Feb 2017

👍2

@nouiz What is your opinion of the openvx kernel/node api?

bhack on 3 Feb 2017

I don't think it is that interesting since it doesn't have any support for looping or branching.

Also it seems heavily oriented towards image processing.

abergeron on 3 Feb 2017

I think anything that tries to manage memory in an opaque way won't get much adoption.

The challenge of a blas like interface is how to minimize wrapping code. Currently you need to write hundreds of lines of code to call convolution with cudnn. For small ops it's completely not worth it.

A C++ interface using templates can still have a blas like philosophy. That sounds more promising.

piiswrong on 3 Feb 2017

This is the graph formalism and this the neural network extension overview. It is still provisional and we could work upstream if we want.

bhack on 3 Feb 2017

/cc @naibaf7

bhack on 3 Feb 2017

If anybody want to take a look we have an in internal header only tensor and tensor storage under construction. Any feedback it is appreciated.

bhack on 3 Feb 2017

Hi! I like this initiative. Simple C signatures seems fair for everybody. Having this as header only project could make it more portable and easy to plug to whatever framework. For design maybe we can start by sharing some UML prototypes.

edgarriba on 3 Feb 2017

@bhack Had a quick look. One thing it's missing is the ability to wrap around external memory.
Also host data and device data should be in separate tensors. Not every tensor have a host mirror

piiswrong on 3 Feb 2017

@piiswrong yes both are in the roadmap.. we can already cover this with upstream CLCUDAPI header using a start and end interator of a c++ container and we have distinct BufferHost and Buffer concepts.

bhack on 3 Feb 2017

What will be a minimal MVP for a Tensor in the scope of this issue? Is A Tensor interface the first step of a plan? Will each framework needs to handle conversion from/to this "rosetta stone" Tensor?

bhack on 3 Feb 2017

Here is what I would recommend

A minimum C style Tensor object, which most functions wraps into, for example

typedef struct {
   void* data;
   size_t ndim;
   size_t* shape;
   size_t* strides;
} CTensor;

Optionally, a header only C++ Tensor object that provides automatic conversion from to the C tensor, which might provide some util like shape management (maybe not memory management)

class Tensor {
  public:
     operator CTensor() const;
};

As long as the operator CTensor() is provided. It is likely you can call C API with the same signature without doing manual converison.

tqchen on 4 Feb 2017

I think a minimum tensor object should have support from strides.

abergeron on 4 Feb 2017

Is there any reason for a pure c interface? I think most people will be
happy with c++
The problem with c is you have to encode things like data type with flag
field instead of template arguments

Tianqi Chen notifications@github.com于2017年2月3日周五下午3:14写道：

Here is what I would recommend

A minimum C style Tensor object, which most functions wraps into,
for example

typedef struct {
void* data;
size_t ndim;
size_t* shape;
} CTensor;

Optionally, a header only C++ Tensor object that provides automatic
conversion from to the C tensor, which might provide some util like shape
management (maybe not memory management)

class Tensor {
public:
operator CTensor() const;
};

As long as the operator CTensor() is provided. It is likely you can call
C API with the same signature without doing manual converison.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277388740, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAiudKg8vgEOChJhfS6wkhSRbhw8nrdDks5rY7TXgaJpZM4Lobt0
.

piiswrong on 4 Feb 2017

So how ops will access to the tensor associated memory (I.e. device, framework, context)?

bhack on 4 Feb 2017

I think the MVP for each framework, it is at least for Theano for now, it
is to have this interface plus one operation we could reuse.

For example, CTC is in an external repo not in Theano. If the source of CTC
offer this interface, then using it while moving it in Theano would be a
minimal MVP. But as it is already available, another operation would be
better.

Le ven. 3 févr. 2017 17:13, bhack notifications@github.com a écrit :

What will be a minimal MVP for a Tensor in the scope of this issue? Is A
Tensor interface the first step of a plan? Will each framework needs to
handle conversion from/to this "rosetta stone" Tensor?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277376871, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AALC-2P3igraPl0XWJvRjn-8MaJ-IcyIks5rY6ahgaJpZM4Lobt0
.

nouiz on 4 Feb 2017

I vote for C++ and strides

edgarriba on 4 Feb 2017

Another thing we need to decide is whether ndim and dtype are template
argument or field.
It depends on whether you what to switch for type outside or inside API

Edgar Riba notifications@github.com于2017年2月3日周五下午3:30写道：

I vote for C++ and strides

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277391385, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAiudCLPDkLWHBNktDrDMpizmYsfSgAGks5rY7ijgaJpZM4Lobt0
.

piiswrong on 4 Feb 2017

The main advantage of C API is ABI stability. There is no standard C++ ABI, which means the compiled library can depend on compiler version even on the same platform.

For example, it is quite common to compile CUDA code with MSVC on windows, while link that library from MinGW if you are building an R binding(because R's windows builds on minGW). This is impossible if you use C++. If the code goes C++, essentially only source can be distributed, instead of binary. Which might make it a bit vendor unfriendly, if they want to distribute binary (like CuDNN).

On the other hand, c++1x is great and I think it would be great to have a header only library that wraps the C API, which allows simpler syntax.

tqchen on 4 Feb 2017

👍2

In terms of context, device and resources. There are usually two ways.

Asking the target to return opaque handles for streams and context, pass that back in each function call.
Make the function invariant from the streams and context, but allow runtime setting of these things through a Thread Local storage, which expose a function

The second way might be cleaner, but do have a overhead of fetching a TLS each function call, which is negligible(in micro second level). as a matter of fact, most runtime API like CUDA utilize TLS to make calls threadsafe.

tqchen on 4 Feb 2017

I think having strides is good, though many function can only support non-strided version, which a failure flag to ask the framework to call MakeContiguous first.

The danger in potential MakeContiguous() is that allocator gets involved(or the library have its own private workspace), which need a bit careful consideration.

tqchen on 4 Feb 2017

I agree with @edgarriba.

But I think in general operators should only be shared alike a BLAS and leave everything else up to the DNN framework. Stride and format of tensors should be kept open, but the DNN operators should specify support for formats alike BLAS libraries.

Unsupported tensors should be resolved using additional DNN operators for in-place and out-of-place tensor reordering.
Leave memory management and tensors themselves up to the framework. Device initialization and bookkeeping as well.
For internal/working memory of operators, there should be the option to pass an allocator/memory manager as lambda function/function pointer into the operator, or the operator can be allowed to allocate with internal/own allocators. This is very important for memory bookkeeping in more complex DNN frameworks. It can be designed a bit like memory managers inside OS kernels.
Operators can define themselves as stateless or stateful, and can also tell the user if it is a single-instance (multi-use) or a multi-instance (single use; needs to be instantiated per network layer) operator.
Stateless functions are alike BLAS functions and do not need instantiation.
Stateful operators need to define life cycle functions that allow to create the operator and reshape, compile, cache, use (forward/backward) and destruct it (and possibly more). Here the host framework has to keep a pointer to the instance of the operator. LibDNN and cuDNN are stateful, BLAS operations are not. Max pooling is stateful, average pooling can be stateless, activations are stateless (etc).
Operators can be single-device or multi-device in their execution.

naibaf7 on 4 Feb 2017

For memory interface for complicated operators, I think many of them could be simpler, without an allocator, instead having two functions, CuDNN is actually

A workspace requirement function that have the same signature of the execution function, but allow data field to be nullptr, and returns workspace requirement
A execution function that takes workspace pointer as additional argument.

This cannot cover all the complicated operators(there are some that depends on content of the data) but already include most cases. This removes the need of a memory allocator or lambda function.

tqchen on 4 Feb 2017

@tqchen Gets messy very fast with multi-device operators that are coming up, or if workspace memory needs to be consolidated between multiple operators to save memory. I think it will be hard to get around the memory allocator duality and additional life-cycle functions for stateful operators.

naibaf7 on 4 Feb 2017

@naibaf7 I cannot speak for the multi-device operators. But the workspace consolidation problem can be handled easily from the framework side. As a matter of fact, it can even be done statically when a computational graph is available, without relying on dynamic memory allocation

tqchen on 4 Feb 2017

@tqchen If the workspace memory for an operator is fixed maybe, but it's often not; also reshaping a network or operators that switch and autotune algorithms can have dynamic memory requirements.
I wouldn't want to take this possibility away for future operators that might come up.

naibaf7 on 4 Feb 2017

The assumption is that the workspace for an operator is fixed for the fixed input tensor shape, while the requirement can be re-calculated when the shape goes up. The requirement can be a rough estimation of a maximum space needed, as CuDNN did.

This can always fall back into the dynamic memory approach from the caller side, but leaves that decision to the user of the library.

tqchen on 4 Feb 2017

What's the argument against lambs allocator?
It's cleaner and more flexible.

Tianqi Chen notifications@github.com于2017年2月3日周五下午4:51写道：

For memory interface for complicated operators, I think many of them could
be simpler, without an allocator, instead having two functions, CuDNN is
actually

A workspace requirement function that have the same signature of the
execution function, but allow data field to be nullptr, and returns
workspace requirement

A execution function that takes workspace pointer as additional
argument.

This cannot cover all the complicated operators(there are some that
depends on content of the data) but already include most cases. This
removes the need of a memory allocator or lambda function.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277402694, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAiudKLOfwM9RhavhQ8I_D8QzOavv-7wks5rY8uPgaJpZM4Lobt0
.

piiswrong on 4 Feb 2017

It somewhat prevent the chance of static allocation of workspace. The workspace requirement interface is more restrictive, and enables the two phase strategy(allocation then execution)

I think the argument is not against using allocator when necessary, there are some cases where it is inevitable. but instead provide three categorizations

Those that does not requires workspace memory
Those that can declare workspace requirement before execution
Those that need a allocator

The former ones can be relaxed to the later ones. In general putting operator into the most restrictive types leaves chances for the user to decide what to do with them.

tqchen on 4 Feb 2017

👍3

@tqchen Yup. I agree with that last statement of yours. Those that do require an allocator can also be kept simple: They can use their internal allocator and destructor for device memory IF the framework does not care about having full memory management administration of devices.
For more restricted operators, the two-phase workspace configuration and execution can be incorporated in the life-cycle functions.

naibaf7 on 4 Feb 2017

@tqchen nice! go ahead and create a repo with simple C structures so that we can start to iterate.

edgarriba on 4 Feb 2017

I can start with the c++ wrapper once the baby starts to walk

edgarriba on 4 Feb 2017

I think a c++ wrapper on top of pure c interface is overkill.

This library is intended to be used through DL frameworks, not standalone. You wouldn't create a frontend binding just for this library, which would be essentially reinventing Torch. The top priority for it should be easy compiling/linking. If possible it should be a header only library with c++11 interface from ground up. Although if we want to support opencl it's hard to make it header only

piiswrong on 4 Feb 2017

This library is intended to be used through DL frameworks, not standalone.

@piiswrong It is? Apart from the name of the issue :D

edgarriba on 4 Feb 2017

A tensor should have the following attributes:
ndim,
dtype,
device (cpu/gpu/ocl),
data_ptr,
shape,
stride

We need to decide which one goes to field and which one goes to template argument. If ndim is in template argument we enjoy the benefit of on stack allocation of shape/stride. The down side is caller need to switch over ndim

piiswrong on 4 Feb 2017

@edgarriba I hope so. Otherwise it's reinventing Torch, which doesn't make much sense. I'm hoping we can solve, or at least mitigate, the problem that there are too many redundant work in DL frameworks, not add to it ;)

piiswrong on 4 Feb 2017

Honestly speaking, I would vote for a C interface, not because I personally like C, but because over the years, I have seen a lot of opinions around C++. Tianqi's argument about ABI compatibility is one prominent reason for using C as a compatibility layer. Also, languages such as Python and Torch would need a C FFI in any case, and having C++ on top of C is definitely better than having C on top of C++.

And the heavy use of templates, especially template metaprogramming (I am sure we are not talking about this here, but FWIW) is also pretty controversial. One can find a lot of similar cases in the argument for and against Eigen actually.

Yangqing on 4 Feb 2017

It depends on the purpose of the library.

If you want to use it standalone, then having C interface is better.

But if you want to use it through frame works, C++ is much more convenient because you can pass the Tensor object around in your framework and eventually replace your own data blob with a standard Tensor. Using C interface means you have to convert to/from your data blob to Tensor repeatedly in each operator. It's slow and need a ton of wrapper code. For small operators its not worth doing.

Take cub https://github.com/NVlabs/cub for an example, it's a header only library with c++ interface.

Torch's internal THC also uses C++. It's only the interface that's pure C.

piiswrong on 4 Feb 2017

Anyway, just my 2 cents :)

Yangqing on 4 Feb 2017

I also don't like template metaprogramming, because it's a pain to maintain. Very few people can understand it.

But simple template arguments like template <typename Device, typename ndim> is OK

piiswrong on 4 Feb 2017

I'm against c++ for the abi. let's stick to C and build a c++ wrapper on top if needed. Hourglass interfaces are great

soumith on 4 Feb 2017

👍1

Just to add more info on the ABI compatibility issue.

bhack on 4 Feb 2017

A resource for Hourglass

bhack on 4 Feb 2017

👍1

Just to make sure every one is on the same page:

How do you think this lib will be used? From framework backend? From frontend? Header only? Compiled together with framework? Compiled separately and linked to framework?
ABI is only relevant if you want to distribute this lib as binaries and link it against frameworks dynamically. What's the use case for this?
What kinds of operators do you think this lib should cover? elementwise? broadcast/reduce? NN operators?

piiswrong on 4 Feb 2017

I agree with @piiswrong. Probably an agreement on this kind of general boundings it is required.

bhack on 4 Feb 2017

this let me recall my first boss, but this is great.

kernel8liang on 4 Feb 2017

Here is my opinion on @piiswrong 's points, to summarized, I think the advantage of such shared belief is not only for sharing between projects, but also encourage others to help.

This library will be used from back-end, the standard data structure will be header only.
- Additional library can be compiled and linked optionally
- This is important to encourage not only the participant in this issue to conform to the standard, but also encourage third party vendors to do so, and expose binaries.
ABI compatibility is great if some of the library can be compiled and linked separately for vendors to implement some of the pre-defined interfaces.
I think the basic agreement is on the common tensor structure, plus the categorization principle of the operators.
- Ideally I think it should cover ewise, broadcast/reduce and NN
- We should also encourage implementers to conform to the same data structure, and possible recommended signatures,

My guess is that when there is sharing, there is interface and compatibility issue and we always need to consider what is the minimum interface out there.

tqchen on 4 Feb 2017

I agree with @piiswrong that it might be helpful to have everyone agree on certain things before we proceed. One way to do so is each of us post a code snippet on what the data structure and interface should look like, then we summarize the points that need to be taken.

Once we have enough agreement on what the initial data structure and function signature look, we can open a repo and iterate from there.

Here is my version of strawman's design(based on my summarization of majority opinion here), that summarizes some of the opinions in here.

typedef {
  int device_id;
  int device_type;
} CContext;

typedef struct {
   // alignment can be checked directly in address
   void* data;
   size_t ndim;
   size_t* shape;
   // can be nullptr, indicating compact version.
   size_t* strides;
   // device
   CContext ctx;
   // data type
   int dtype;
} CTensor;

// utilize thread local storage to get last error encountered.
const char* GetLastError();
// used by implementer to set error message.
void SetLastError(const char* msg);
// C API returns 0 when success, -1 when failure, 
//  -2 indicating contiguous input is needed, ask user to convert before calling again
int BroadCastAdd(CTensor* x, CTensor* y, CTensor* out);

tqchen on 4 Feb 2017

Need we to handle not contiguos cases?

bhack on 4 Feb 2017

@bhack I would rather not in most cases, and ask user to get a contiguous version before calling(I guess @Yangqing shares the same opinion), but I think majority opinions here want stride as argument.

tqchen on 4 Feb 2017

Suppose you have a c++ interface and source release from venders and now you need a new simple unary op, you can easily do it by defining a new inline function and pass it to template argument.

But if you have c interface and binary from vendor there is no way to do it yourself.

I think we can push for source release from vendors. Nv and intel are already doing this with mkldnn and cub.

piiswrong on 4 Feb 2017

On c++ side I was also thinking what kind of limits we could have with a plugin like cross-platform solution in c++

bhack on 4 Feb 2017

Yeah I agree with TQ that we prefer non strider version, but if things are better strides for performance and or other reasons, it is not a hard constraint - we will need to stride stuff sometimes anyway.

Per Eric's question about C++, I suspect we will never have the vendors give us all the source code for the high performance part :p

Yangqing on 5 Feb 2017

@Yangqing
But it should be easy in most of the cases to write the compatible wrapper around the high performance code... nice to see you here btw. :)

naibaf7 on 5 Feb 2017

We can have both versions. Strided and optionally non strided for each function.

Let's see how much source they are willing to give out @ap-hynninen @glingyan @zhenlinluo

piiswrong on 5 Feb 2017

What is the device type and id in cases like https://github.com/CNugteren/CLCudaAPI/issues/11 and generally with unified memory devices?

bhack on 5 Feb 2017

@bhack In my understanding, the UMA does not prevent you from having a primary device id, except that that it will permit operators to take input from different device_id and launch on a certain device

The same thing goes with OpenCL kernels. Although OpenCL creates an illusion that a buffer can belong to any device. Having a primary device id per memory is still helpful. The fact is that the underlying driver will create mirror copies on certain devices, having a tensor primary operate on one device will certainly have the benefits(in terms of scheduling etc)

tqchen on 5 Feb 2017

@tqchen How should offsets inside memory objects/tensors be provided? What if multiple tensors are packed inside one OpenCL buffer object?

naibaf7 on 5 Feb 2017

i like @tqchen 's structs and APIs.

@naibaf7 *data can be an offsetted pointer?

soumith on 5 Feb 2017

Yeah offset pointer would be great, and if I may, can we enforce things to
be aligned to like 256 or 512? That might make things maximally compatible
with multiple optimization mechanisms.

On Sat, Feb 4, 2017 at 5:09 PM Soumith Chintala notifications@github.com
wrote:

i like @tqchen https://github.com/tqchen 's structs and APIs.

@naibaf7 https://github.com/naibaf7 *data can be an offsetted pointer?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277489931, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAho7x0E9D6byc96UsJ1pV_se6Dp-dGBks5rZSFYgaJpZM4Lobt0
.

>

Sent from Gmail Mobile - apologies for any typoz.

Yangqing on 5 Feb 2017

👍1

To address @naibaf7 's concerns, and to take into account cleaner Tensor views, here's what I propose (we do this in Torch already):

typedef {
  int device_id;
  int device_type;
} CContext;

typedef {
   void* data;         // data pointer
   CContext ctx;       // device
   int dtype;          // data type
   int refcount;
} Buffer;

typedef struct {
   Buffer data; 
   int offset;             // offset in buffer
   size_t ndim;
   size_t* shape;
   size_t* strides;    // can be nullptr, indicating compact version.
   int refcount;
} CTensor;

This way two problems are solved:

buffer offsets are explicit
When you have to create a View "B" on a Tensor "A", you simply point B to the underlying Buffer of A. That way, A can be freed anytime, independent of B.

In Torch, we call them Storage and Tensor, but whatever naming is more standardized...

I'm not super particular about this change, just proposing it based on feedback.

soumith on 5 Feb 2017

👍1

@soumith Okay, however that's not going so nicely with OpenCL right now. Maybe a separate offset parameter should also be considered. EDIT: Too late, I like your updated suggestion :).

@Yangqing Should such alignment requirements be enforced or should there be metadata for operators that can specify optimal and necessary workspace conditions?

naibaf7 on 5 Feb 2017

A tensor shouldn't own memory since it's only a wrapper around memory managed by the framework. So you won't need an offset.

piiswrong on 5 Feb 2017

You would need an offset even if the Tensor doesn't own memory. The Tensor can only be constructed on a slice of a larger Buffer. This is what @naibaf7 is saying we should cover, and I know from Torch's experience that this comes up very often.

soumith on 5 Feb 2017

@piiswrong I think that concept is also maintained by what @soumith says. Basically the CTensor and eventually CMatrix or CVector structs carry metadata of a tensor/matrix/vector, one of which is a CBuffer struct that specifies what memory from the host framework is being used.
CMatrix and CVector could also be used to specify dimension reduced areas of a tensor in order to simplify the use of BLAS libraries alongside the new DNN operators.
The operators can then internally combine the information, such as checking memory requirements and passing buffer+offset into the kernel (OpenCL) or compute the unified memory pointer (CUDA, CPU) and execute the compute kernel.

naibaf7 on 5 Feb 2017

refcount is not needed if memory container is managed by the framework. Note that you can always add it back and maintain compatibility by sub-classing Tensor. If memory is not managed, maybe Buffer is also not needed (only have a Tensor with possible offset).

@piiswrong offset is useful for things like OpenCL, where the cl_mem object is opague and you cannot do explicit addressing.

tqchen on 5 Feb 2017

I think that we are basically converging on a solution. But with clSVMAlloc what device id we set?

bhack on 5 Feb 2017

@bhack I think this doesn't need to be of concern for the interface. That kind of memory can be associated with multiple CBuffer/CTensor objects of multiple devices, if the host framework knows that this is safe to do.

naibaf7 on 6 Feb 2017

@naibaf7 Ok but so the device id is it not an info more oriented to the "executor"?

bhack on 6 Feb 2017

And what interfaces do you propose for constructors? Which are the required and optionals? Besides, do we want to support tensor with dynamic size or just static?

FYI, in case you want to start to play I've setup this https://godbolt.org/g/YMnxHB

edgarriba on 6 Feb 2017

I think the data offset @soumith proposed looks good. Here is one more modification I propose(also incorporated @Yangqing comment about requrement of data alignment):

typedef {
  int device_id;
  int device_type;
} CContext;

typedef struct {
   void* data;           // data pointer(or cl_mem handle in OpenCL), align to 256 
   CContext ctx;       // device
   int ndim;
   int dtype;              // data type
   size_t offset;         // offset in buffer
   size_t* shape;
   size_t* strides;    // can be nullptr, indicating compact version.
} CTensor;

I propose to merge the buffer into the Tensor structure and remove ref counting, since CTensor is not a container type (which manages memory).

As a matter of fact, if we really want the data structure to be able to do memory management, Buffer should not be a member of Tensor, but instead we need Buffer* to be member(so ref count on unique Buffer object can be done correctly). This is a bit overkill when we don't want memory management.

Here is what we can do in frameworks side to bring the memory management back. I use C++ style with shared_ptr for simplicity, allow thread safe destruction(can also use refcount).

// memory container
class TensorContainer: public CTensor {
  public:
    // This is debatable
    // Ideally should be small_vector<4, size_t> shape
    // so small shape sits in stack.
    // The dimension also duplicates with ndim field.
    std::vector<size_t> shape_;
    void Reshape(const std::vector<size_t>& shape) {
       shape_ = shape;
       this->shape = BeginPtr(shape);
       this->ndim = shape.size();
       // reset memory
    }
    ~TensorContainer() {
       if (origin_ == nullptr) {
         allocator.delete(data);
       }
    } 
  private:
    // If it is a view, buffer_ is the pointer to the original container.
    // Otherwise, buffer_ is nullptr
    std::shared_ptr<TensorContainer> origin_;
};

// handle type that uses the container.
class Tensor  {
  public: 
     // can pass to argument that takes CTensor
    operator CTensor() const { 
      const CTensor& dat = *data_;
      return dat; 
    }     
     // can pass to argument that takes CTensor*
    operator CTensor*() const { 
      return dat.get();
    }     
  private:
    std::shared_ptr<TensorContainer> data_;
};

tqchen on 7 Feb 2017

@edgarriba I think there are two concepts here. The CTensor which serves as memory handle(like pointer) which may not have memory management.

And a Container type that does memory management, hopefully via c++1x which wraps the C API. I think on that end. The requirement on dynamic size etc. is on the container type.

tqchen on 7 Feb 2017

So the op will consume or write data only on the device type it finds in device_id right?

bhack on 7 Feb 2017

I think we are not yet very clear on operator specification. Especially, where does the context get set(e.g. cudaSetDevice), how to specify context resources(e.g. cuDNN handle).

It can be done by a thread local workspace that can be mutated by setter(e.g. cublasSetStream)
Alternatively, this can be done by an additional handle argument (which could make interface a bit more complicated)

tqchen on 7 Feb 2017

This has been inactive for a few days. I would like to ping back and see if the people involve in this thread are interested in any form of moving forward.

We can start with a neutral repo with the involved project lead as owners(I can do it under dmlc, my own repo, or we can start a org), eventually we can move the repo to a neutral organization.

The first things to be finalized are

The tensor interface
Example operator interface in three category
The C++ wrapper of the interface

I do not have a very good idea on how we should name the project. One possible name could be ctensor (as c indicates common). Please suggest better name candidates you have in mind

tqchen on 12 Feb 2017

@tqchen It Is ok for us (tiny-dnn).. I don't know if it is limited to Tensor for naming cause we are already talking about ops interfaces.

bhack on 12 Feb 2017

I think we are not restricted to tensor for naming, I just proposed one because I am kinda of running out of ideas for naming, so please suggest names

tqchen on 12 Feb 2017

CTensor sounds good to me

edgarriba on 12 Feb 2017

Common Dnn Api/Core Dnn Api/Common Kernel Api etc..?

bhack on 12 Feb 2017

@tqchen I'd happy to wrap my LibDNN standalone library with the new interface once a reasonable collection of headers is available, to see how well that would work :)
And I guess @edgarriba could test out the host-side code in tiny-dnn?

@bhack What do you mean by that? Probably nothing beyond the interface for operators should be shared in this project?

naibaf7 on 12 Feb 2017

@naibaf7 I think that the boundaries are operators API that ingest this common tensor design.. I don't think that executors/scheduling will come on the table. What is your vision?

bhack on 12 Feb 2017

how about libop?

i prefer to not use tensor because mathematically tensor has a rich of properties, while most operators we are using are just elemental-wise, so n-dimensional array is better to name the data structure.

mli on 13 Feb 2017

Following the idea of BLAS we can probably call it BDAS (basic deep-learning algebra subprograms) - sounds like "badass".

Yangqing on 13 Feb 2017

LOL. In that spirit how about BAsic Neural Artificial Network Algebra Subroutines (BANANAS)

piiswrong on 13 Feb 2017

😄1

Or Deep Learning PACKage, DLPACK, motivated from LAPACK. We are providing
more than basic subprograms.

On Sun, Feb 12, 2017 at 11:28 PM, Yangqing Jia notifications@github.com
wrote:

Following the idea of BLAS we can probably call it BDAS (basic
deep-learning algebra subprograms) - sounds like "badass".

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-279312506, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAZv4RjloVzmTaOnuGKOIPLiBDtTBCQ4ks5rcAYHgaJpZM4Lobt0
.

mli on 13 Feb 2017

👍4

DLPACK is not so bad.

bhack on 13 Feb 2017

TensorFlow & Keras combined have the largest user base and are growing most rapidly. You should bring those guys on board for this proposal to make the biggest impact.

http://www.timqian.com/star-history/#tensorflow/tensorflow&fchollet/keras&dmlc/mxnet&BVLC/caffe&Microsoft/CNTK&torch/torch7&Theano/Theano

futurely on 13 Feb 2017

/cc @fchollet

bhack on 13 Feb 2017

Just two points:

Before naming the project, it should be clear about the targets and boundaries of this project clear first: how many issues are there and what issues are to be resolved at what stage.
Using an agile method to start from easy targets: those are probably to be resolved quickly and easy to get agreement from all participants.

cyberfire on 14 Feb 2017

As a side topic, I personally think how to allow MXNet scale out over micro-kernel multi-server OSes and scale down on limited-battery devices is also important.