See This Link for discussion repo
This discussion started from https://github.com/dmlc/minpy/issues/129, with @soumith THC is a tensor library that backs torch. I open this issue in MXNet repo so more developers can see it.
First of all, it is possible reuse operator libraries between frameworks, for example
It is always interesting to see interchangeability happen. For example, schedule pytorch operations in mxnet's async engine, or run mxnet's declarative API to directly share data with pytorch's array.
However, there is some engineering obstacles in doing so, which I would like to explain what these obstacles are, and hopefully this can motivate the community to move forward, and make this easier.
An operator can mean many things, here are some basic components on what the operators are:
Why such coupling prevents reuse? There are two reasons
To resolve this problem, an operator library design should enable operators that accept user managed memory resources, when possible, not introduce allocator or resource management, but give hints to the user(CuDNN's workspace requirement eliminates the need to internal memory allocator).
From this point of view, CuDNN an cuBLAS are good examples. THC is nice, but still encapsulate memory allocator(which is needed sometimes for dynamic operators).
The second obstacle is mainly lack of common operator interface. This is a problem of CUDNN and THC that prevents reusing. Take CuDNN for example, each CuDNN API is a C function, with its own interface, to adopt the operator, there need to be one(or multiple) adapting function per operator.
Consider instead, if there is an unified operator interface(the following is a mock design), where each TBlob is a reference to the data fields and shape, and every function gets registered to the registry with their name
using FCompute = std::function<void (
array_view<TBlob> ins, array_view<TBlob> outs, map kwargs, stream stream)>
Then it only takes one function to extract, and reuse all operators and automatically expose them to front end. In MXNet, it even directly generates the symbolic counterpart from the same imperative operator, if gradient is provided.
There is always a flip side of the coin. Assume that we go with a unified operator interface. As a matter of fact, that is what MXNet, TensorFlow and Caffe have done. The problem now becomes what the interface should look like? One trap that framework designer always falls into is that we need one interface that rules them all.
Since one interface rules them all, we want to support all possible operators, what about the ones that need runtime memory allocations? Maybe add memory allocator to it, what about the ones that is asynchronize? In the end, the interface have to include memory-allocator, scheduling module in some way,
and that introduces the "Coupled Operator Data Structure Components" problem. The operator interface become deeply coupled with the rest of the framework and not reusable.
Can we get the best of both worlds, having as few data structures and interfaces as possible, while still not introducing coupling to allocator and scheduling as much as possible? I think the answer is yes and we need to jump out from the ideal of one interface that rules all the operators.
I can categorize the operators roughly in three categories
If we design for general operator interface, the answer will usually looks like type3. However, type 1 and 2 dominates 90%+ of the major operators we are using.
If we design one operator interfaces for each type, this problem is solved. So that frameworks can pull and interact with each type in their own way.
It is much easier to do things like static memory planning if type1 and type2 are explicitly introduced. This is one additional layer of wrapping on top of THC and CuDNN is is lacking so far.
A registry system like NNVM could come very handy to easily resgister these informations, and get pull out by the libraries.
I have always hopped that there is a minimum set of operator interface standard in C++, that can be shared across libraries. I think we have a good idea on what the solution looks like. While most system tends to become opague and coupled, I think this kind of transparent way can help evolve the community in a healthy way. This being said, there is always effort to make these happen. This involves a open discussion on what the interfaces should be and commitment from framework builders. I would really love to see this happen, and that is why I spend more than one hour writing this.
Unfortunately, most frameworks already have kinda of "enough collection of operators", so having a unified operator interface will contribute little to each framework in terms of usability in short term. Naturally this would be given lower priority. That is why commitment is needed to bring this out for longer term benefit
I also had similar discussion with @Yangqing and @ajtulloch before.
Great initiative! I think a lot of components can be shared if we refactor them in simple APIs. Would love to work together on this front.
The fundamental issue into having a Unified interface is that it needs a full buy-in. Anything short will make it a partial or full failure.
For this reason, I think what the CuDNN team did is actually correct.
For this reason, i think focusing on simplicity and reducing the friction of buy-in, and allowing a way to have partial buy-in will make more folks participate.
So, I think we should define:
Keeping it stupid and simple like this is the path of least resistance that will get us forward.
I dont feel confident that defining and maintaining a common registry will practically break ground, especially because it has a huge initial overhead from each of the framework writers (who are all busy with their own problems).
What do you guys think?
what I proposed only works for stateless operations initially, but i think that's where we should start from. Defining statefulness right now will lead to disagreements and complications.
I think stateless is a good starting point(essentially the type 1 operator). I do however, would like to have a few set unified interfaces in someway, and a registry that is decentralized.
So the scenario I hoped looks like
#include <common_nn_op.h>
void InitMXNetOps() {
for (auto reg: Registry::ListBinaryOps()) {
register(reg->name, reg->function);
}
}
This enables one function to import all the operators that is provided in the operator library. It would indeed require a bit of registry code from the operator library side, for example, a wrap around THC or the library @soumith suggested.
This reduces the effort of importing and adapting new operators. However, the interface indeed need to be simple enough, like the one that contains a few tensor data structures.
As I mentioned earlier, I do not agree on one unified operator interface, but I do like to see if there can be a few candidates that we can agree on. For example, binary operators.
BinaryOp(const Tensor& lhs, const Tensor& rhs, Tensor*out);
BianryOpShape(const Shape& lhs, const Shape& rhs, Shape* out);
The idea is to reduce the overhead of adaptation code, which otherwise need to happen for each operator, and it makes the framework builder harder to switch in.
The easiest way of doing this is copy-pasting kernels, which we having been doing for a while.
a blas like interface is a good idea. But this is only worth while if the operator is complicated enough (i.e. longer than the code required to call it...). Sharing elementwise add probably isn't necessary.
having a TensorDescriptor like data structure further complicates this since you need to spend 20+ lines constructing these descriptors.
Unified operator interface is in theory the right way to do things, but obviously we all think our own interface is the best interface. So not sure if this will go anywhere anytime soon.
Examples of operators that are worth sharing: broadcast-reduce ops, embedding.
One thing we can do without having to agree to anything is some "principled copy pasta" wiki page where we share operator implementations without necessarily using the same interface.
A easily pair testing framework for verifying correctness on top of that is also good.
Also instead of sharing compiled code, a header only library where you use all data structure and array indexing as macros that can be redefined for each framework is much easier.
For example mxnet doesn't support stride while torch does. So indexing works differently. This can be solved by defining different macros.
I see there are two ends of the current existing discussions
* Build One Library that everybody will use*
Can use simple common data structure, with each framework calls into the same function. @soumith 's proposal is a better solution for this end. Since as long as the data structure is agreed, there is no problem of calling into the functions.
The problem is it is hard to convince developer to be fully committed to a shared core library.
* Being able to import operators from other libraries *
The major concern of me to do this is what is the overhead of importing? That is why some simple common interface, as with data structure might be desirable. So the cost of importing does not involve effort per operator, but instead one effort for importing all operators that all frameworks currently define.
* What are the set of Interfaces*
To be clear, I do not think the MXNet's(nor interface of existing frameworks) interface is the best way to do operator sharing. But I do think there are set of cleaner and minimum interfaces that we might agree on, just like we can agree on the data structures
Interface wise, if I may - the cudnn type interface is a good start and this is what I have been telling other vendors too. If there are implementations for e.g. OpenCL, OpenGL, Vulkan etc, this will make the frameworks' life much better.
I think the interface includes two parts: (1) the function routines (2) tensor data structure. For example, I remember in THC, it has support for stride and offset, which is lacking in MXNet. If we use cudnn's way, then we need to include all these information in function arguments, which may be a problem for future extension.
It makes sense to start with a non-strided version, I think - Caffe/2 does not use stride either and assumes a 256-byte alignment for storage.
RE cudnn - I actually think that having a pure C interface is good for extension, since it would make cross-language integration much easier. For example Python C extension doesn't do a very good C++ support.
(Oh by the way, pybind11 is awesome.)
I am all for C interface for stable ABI, on the other hand. As a matter of fact, almost all dmlc projects interfaces through C API.
It is always possible to have an auxiliary c++ registry if we can categorize the functions, and return the function handles.
Anybody is interested in Tensor API?
You can find the source code of an inital AMD implementation of the standard
My concern with the Tensor API(and there are a few of its kind) is that they are opaque, not only is this a standardization of Tensor, but it is also a standardization of graph based dl framework
Personally, I think what we really need is separation of these things(Tensor data structure, the computation, memory management, scheduling), adopt unix philosophy, do one module more transparently and interact with others, specifically:
I think what we are discussing here is not really to support XYZ features, that job of deep learning frameworks, but to come up with a minimum module that can be shared across frameworks.
As I may quote
"Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away"
seems there is enough interest in this issue. What I may suggest is for some of us to post strawman design of Tensor structure and possible operators and we can comment from these points, these will help things to move toward a concrete direction.
I largely agree with @soumith and @Yangqing on a C language structured minimum tensor object (maybe design preferred compact layout, with optional stride support)
How much hw vendors are involved in openvx neural network extension? Nvidia is in the team list as you see and had a strategy to implement Openvx computing graph over cuda (see VisionWorks). Also Samsung, Intel, Amd, Arm are in the committee.
In tiny-dnn, that it is strongly c++1x features oriented we have also evaluated array_ref proposal but I think many here are not interested in the differents c++ standardization efforts.
I also like the idea. Not sure how much time we can put in this. INR thing
that can help all team find the time would be to identify one feature we
don't have and others have. If we can tell, we need this and an relatively
easy way to get it and is a long term solution is do it, then it will
probably get done.
Otherwise I'm pretty sure this will get too low in most people priority
stack in a few weeks.
What do you think of trying to do that now?
About the registry of fct ptr, it isn't enough. A big part of time wrapping
a lib is in the error handling and other stuff like this. A registry would
need to tell the signature of the fct, ... So it will get complicated and
not used by many. So I think blas, cudnn like interface of the best.
Ping @lamblin @abergeron @bartvm so they know about this.
Le jeu. 2 févr. 2017 12:41, bhack notifications@github.com a écrit :
How much hw vendors are involved in openvx neural network extension? Nvida
is in the team list as you see and had a strategy to implement Openvx
computing graph over cuda (see VisionWorks). Also Samsung, Intel, Amd, Arm
are in the committee.
In tiny-dnn, that it is strongly c++1x feature oriented we also evaluated
array_ref https://github.com/kokkos/array_ref/tree/master/proposals
proposal but I think many here are not interested in the differents c++
standardization efforts.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277027997, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AALC-_8WCRWh_Ie6tac61WafZkN6Xc-Jks5rYhVFgaJpZM4Lobt0
.
@nouiz What is your opinion of the openvx kernel/node api?
I don't think it is that interesting since it doesn't have any support for looping or branching.
Also it seems heavily oriented towards image processing.
I think anything that tries to manage memory in an opaque way won't get much adoption.
The challenge of a blas like interface is how to minimize wrapping code. Currently you need to write hundreds of lines of code to call convolution with cudnn. For small ops it's completely not worth it.
A C++ interface using templates can still have a blas like philosophy. That sounds more promising.
This is the graph formalism and this the neural network extension overview. It is still provisional and we could work upstream if we want.
/cc @naibaf7
If anybody want to take a look we have an in internal header only tensor and tensor storage under construction. Any feedback it is appreciated.
Hi! I like this initiative. Simple C signatures seems fair for everybody. Having this as header only project could make it more portable and easy to plug to whatever framework. For design maybe we can start by sharing some UML prototypes.
@bhack Had a quick look. One thing it's missing is the ability to wrap around external memory.
Also host data and device data should be in separate tensors. Not every tensor have a host mirror
@piiswrong yes both are in the roadmap.. we can already cover this with upstream CLCUDAPI header using a start and end interator of a c++ container and we have distinct BufferHost and Buffer concepts.
What will be a minimal MVP for a Tensor in the scope of this issue? Is A Tensor interface the first step of a plan? Will each framework needs to handle conversion from/to this "rosetta stone" Tensor?
Here is what I would recommend
typedef struct {
void* data;
size_t ndim;
size_t* shape;
size_t* strides;
} CTensor;
class Tensor {
public:
operator CTensor() const;
};
As long as the operator CTensor()
is provided. It is likely you can call C API with the same signature without doing manual converison.
I think a minimum tensor object should have support from strides.
Is there any reason for a pure c interface? I think most people will be
happy with c++
The problem with c is you have to encode things like data type with flag
field instead of template arguments
Tianqi Chen notifications@github.com于2017年2月3日 周五下午3:14写道:
Here is what I would recommend
- A minimum C style Tensor object, which most functions wraps into,
for exampletypedef struct {
void* data;
size_t ndim;
size_t* shape;
} CTensor;
- Optionally, a header only C++ Tensor object that provides automatic
conversion from to the C tensor, which might provide some util like shape
management (maybe not memory management)class Tensor {
public:
operator CTensor() const;
};As long as the operator CTensor() is provided. It is likely you can call
C API with the same signature without doing manual converison.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277388740, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAiudKg8vgEOChJhfS6wkhSRbhw8nrdDks5rY7TXgaJpZM4Lobt0
.
So how ops will access to the tensor associated memory (I.e. device, framework, context)?
I think the MVP for each framework, it is at least for Theano for now, it
is to have this interface plus one operation we could reuse.
For example, CTC is in an external repo not in Theano. If the source of CTC
offer this interface, then using it while moving it in Theano would be a
minimal MVP. But as it is already available, another operation would be
better.
Le ven. 3 févr. 2017 17:13, bhack notifications@github.com a écrit :
What will be a minimal MVP for a Tensor in the scope of this issue? Is A
Tensor interface the first step of a plan? Will each framework needs to
handle conversion from/to this "rosetta stone" Tensor?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277376871, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AALC-2P3igraPl0XWJvRjn-8MaJ-IcyIks5rY6ahgaJpZM4Lobt0
.
I vote for C++ and strides
Another thing we need to decide is whether ndim and dtype are template
argument or field.
It depends on whether you what to switch for type outside or inside API
Edgar Riba notifications@github.com于2017年2月3日 周五下午3:30写道:
I vote for C++ and strides
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277391385, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAiudCLPDkLWHBNktDrDMpizmYsfSgAGks5rY7ijgaJpZM4Lobt0
.
The main advantage of C API is ABI stability. There is no standard C++ ABI, which means the compiled library can depend on compiler version even on the same platform.
For example, it is quite common to compile CUDA code with MSVC on windows, while link that library from MinGW if you are building an R binding(because R's windows builds on minGW). This is impossible if you use C++. If the code goes C++, essentially only source can be distributed, instead of binary. Which might make it a bit vendor unfriendly, if they want to distribute binary (like CuDNN).
On the other hand, c++1x is great and I think it would be great to have a header only library that wraps the C API, which allows simpler syntax.
In terms of context, device and resources. There are usually two ways.
The second way might be cleaner, but do have a overhead of fetching a TLS each function call, which is negligible(in micro second level). as a matter of fact, most runtime API like CUDA utilize TLS to make calls threadsafe.
I think having strides is good, though many function can only support non-strided version, which a failure flag to ask the framework to call MakeContiguous first.
The danger in potential MakeContiguous() is that allocator gets involved(or the library have its own private workspace), which need a bit careful consideration.
I agree with @edgarriba.
But I think in general operators should only be shared alike a BLAS and leave everything else up to the DNN framework. Stride and format of tensors should be kept open, but the DNN operators should specify support for formats alike BLAS libraries.
For memory interface for complicated operators, I think many of them could be simpler, without an allocator, instead having two functions, CuDNN is actually
This cannot cover all the complicated operators(there are some that depends on content of the data) but already include most cases. This removes the need of a memory allocator or lambda function.
@tqchen Gets messy very fast with multi-device operators that are coming up, or if workspace memory needs to be consolidated between multiple operators to save memory. I think it will be hard to get around the memory allocator duality and additional life-cycle functions for stateful operators.
@naibaf7 I cannot speak for the multi-device operators. But the workspace consolidation problem can be handled easily from the framework side. As a matter of fact, it can even be done statically when a computational graph is available, without relying on dynamic memory allocation
@tqchen If the workspace memory for an operator is fixed maybe, but it's often not; also reshaping a network or operators that switch and autotune algorithms can have dynamic memory requirements.
I wouldn't want to take this possibility away for future operators that might come up.
The assumption is that the workspace for an operator is fixed for the fixed input tensor shape, while the requirement can be re-calculated when the shape goes up. The requirement can be a rough estimation of a maximum space needed, as CuDNN did.
This can always fall back into the dynamic memory approach from the caller side, but leaves that decision to the user of the library.
What's the argument against lambs allocator?
It's cleaner and more flexible.
Tianqi Chen notifications@github.com于2017年2月3日 周五下午4:51写道:
For memory interface for complicated operators, I think many of them could
be simpler, without an allocator, instead having two functions, CuDNN is
actually
- A workspace requirement function that have the same signature of the
execution function, but allow data field to be nullptr, and returns
workspace requirement- A execution function that takes workspace pointer as additional
argument.This cannot cover all the complicated operators(there are some that
depends on content of the data) but already include most cases. This
removes the need of a memory allocator or lambda function.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277402694, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAiudKLOfwM9RhavhQ8I_D8QzOavv-7wks5rY8uPgaJpZM4Lobt0
.
It somewhat prevent the chance of static allocation of workspace. The workspace requirement interface is more restrictive, and enables the two phase strategy(allocation then execution)
I think the argument is not against using allocator when necessary, there are some cases where it is inevitable. but instead provide three categorizations
The former ones can be relaxed to the later ones. In general putting operator into the most restrictive types leaves chances for the user to decide what to do with them.
@tqchen Yup. I agree with that last statement of yours. Those that do require an allocator can also be kept simple: They can use their internal allocator and destructor for device memory IF the framework does not care about having full memory management administration of devices.
For more restricted operators, the two-phase workspace configuration and execution can be incorporated in the life-cycle functions.
@tqchen nice! go ahead and create a repo with simple C structures so that we can start to iterate.
I can start with the c++ wrapper once the baby starts to walk
I think a c++ wrapper on top of pure c interface is overkill.
This library is intended to be used through DL frameworks, not standalone. You wouldn't create a frontend binding just for this library, which would be essentially reinventing Torch. The top priority for it should be easy compiling/linking. If possible it should be a header only library with c++11 interface from ground up. Although if we want to support opencl it's hard to make it header only
This library is intended to be used through DL frameworks, not standalone.
@piiswrong It is? Apart from the name of the issue :D
A tensor should have the following attributes:
ndim,
dtype,
device (cpu/gpu/ocl),
data_ptr,
shape,
stride
We need to decide which one goes to field and which one goes to template argument. If ndim is in template argument we enjoy the benefit of on stack allocation of shape/stride. The down side is caller need to switch over ndim
@edgarriba I hope so. Otherwise it's reinventing Torch, which doesn't make much sense. I'm hoping we can solve, or at least mitigate, the problem that there are too many redundant work in DL frameworks, not add to it ;)
Honestly speaking, I would vote for a C interface, not because I personally like C, but because over the years, I have seen a lot of opinions around C++. Tianqi's argument about ABI compatibility is one prominent reason for using C as a compatibility layer. Also, languages such as Python and Torch would need a C FFI in any case, and having C++ on top of C is definitely better than having C on top of C++.
And the heavy use of templates, especially template metaprogramming (I am sure we are not talking about this here, but FWIW) is also pretty controversial. One can find a lot of similar cases in the argument for and against Eigen actually.
It depends on the purpose of the library.
If you want to use it standalone, then having C interface is better.
But if you want to use it through frame works, C++ is much more convenient because you can pass the Tensor object around in your framework and eventually replace your own data blob with a standard Tensor. Using C interface means you have to convert to/from your data blob to Tensor repeatedly in each operator. It's slow and need a ton of wrapper code. For small operators its not worth doing.
Take cub https://github.com/NVlabs/cub for an example, it's a header only library with c++ interface.
Torch's internal THC also uses C++. It's only the interface that's pure C.
Anyway, just my 2 cents :)
I also don't like template metaprogramming, because it's a pain to maintain. Very few people can understand it.
But simple template arguments like template <typename Device, typename ndim>
is OK
I'm against c++ for the abi. let's stick to C and build a c++ wrapper on top if needed. Hourglass interfaces are great
Just to add more info on the ABI compatibility issue.
A resource for Hourglass
Just to make sure every one is on the same page:
I agree with @piiswrong. Probably an agreement on this kind of general boundings it is required.
this let me recall my first boss, but this is great.
Here is my opinion on @piiswrong 's points, to summarized, I think the advantage of such shared belief is not only for sharing between projects, but also encourage others to help.
My guess is that when there is sharing, there is interface and compatibility issue and we always need to consider what is the minimum interface out there.
I agree with @piiswrong that it might be helpful to have everyone agree on certain things before we proceed. One way to do so is each of us post a code snippet on what the data structure and interface should look like, then we summarize the points that need to be taken.
Once we have enough agreement on what the initial data structure and function signature look, we can open a repo and iterate from there.
Here is my version of strawman's design(based on my summarization of majority opinion here), that summarizes some of the opinions in here.
typedef {
int device_id;
int device_type;
} CContext;
typedef struct {
// alignment can be checked directly in address
void* data;
size_t ndim;
size_t* shape;
// can be nullptr, indicating compact version.
size_t* strides;
// device
CContext ctx;
// data type
int dtype;
} CTensor;
// utilize thread local storage to get last error encountered.
const char* GetLastError();
// used by implementer to set error message.
void SetLastError(const char* msg);
// C API returns 0 when success, -1 when failure,
// -2 indicating contiguous input is needed, ask user to convert before calling again
int BroadCastAdd(CTensor* x, CTensor* y, CTensor* out);
Need we to handle not contiguos cases?
@bhack I would rather not in most cases, and ask user to get a contiguous version before calling(I guess @Yangqing shares the same opinion), but I think majority opinions here want stride as argument.
Suppose you have a c++ interface and source release from venders and now you need a new simple unary op, you can easily do it by defining a new inline function and pass it to template argument.
But if you have c interface and binary from vendor there is no way to do it yourself.
I think we can push for source release from vendors. Nv and intel are already doing this with mkldnn and cub.
On c++ side I was also thinking what kind of limits we could have with a plugin like cross-platform solution in c++
Yeah I agree with TQ that we prefer non strider version, but if things are better strides for performance and or other reasons, it is not a hard constraint - we will need to stride stuff sometimes anyway.
Per Eric's question about C++, I suspect we will never have the vendors give us all the source code for the high performance part :p
@Yangqing
But it should be easy in most of the cases to write the compatible wrapper around the high performance code... nice to see you here btw. :)
We can have both versions. Strided and optionally non strided for each function.
Let's see how much source they are willing to give out @ap-hynninen @glingyan @zhenlinluo
What is the device type and id in cases like https://github.com/CNugteren/CLCudaAPI/issues/11 and generally with unified memory devices?
@bhack In my understanding, the UMA does not prevent you from having a primary device id, except that that it will permit operators to take input from different device_id and launch on a certain device
The same thing goes with OpenCL kernels. Although OpenCL creates an illusion that a buffer can belong to any device. Having a primary device id per memory is still helpful. The fact is that the underlying driver will create mirror copies on certain devices, having a tensor primary operate on one device will certainly have the benefits(in terms of scheduling etc)
@tqchen How should offsets inside memory objects/tensors be provided? What if multiple tensors are packed inside one OpenCL buffer object?
i like @tqchen 's structs and APIs.
@naibaf7 *data can be an offsetted pointer?
Yeah offset pointer would be great, and if I may, can we enforce things to
be aligned to like 256 or 512? That might make things maximally compatible
with multiple optimization mechanisms.
On Sat, Feb 4, 2017 at 5:09 PM Soumith Chintala notifications@github.com
wrote:
i like @tqchen https://github.com/tqchen 's structs and APIs.
@naibaf7 https://github.com/naibaf7 *data can be an offsetted pointer?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-277489931, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAho7x0E9D6byc96UsJ1pV_se6Dp-dGBks5rZSFYgaJpZM4Lobt0
.>
Sent from Gmail Mobile - apologies for any typoz.
To address @naibaf7 's concerns, and to take into account cleaner Tensor views, here's what I propose (we do this in Torch already):
typedef {
int device_id;
int device_type;
} CContext;
typedef {
void* data; // data pointer
CContext ctx; // device
int dtype; // data type
int refcount;
} Buffer;
typedef struct {
Buffer data;
int offset; // offset in buffer
size_t ndim;
size_t* shape;
size_t* strides; // can be nullptr, indicating compact version.
int refcount;
} CTensor;
This way two problems are solved:
In Torch, we call them Storage
and Tensor
, but whatever naming is more standardized...
I'm not super particular about this change, just proposing it based on feedback.
@soumith Okay, however that's not going so nicely with OpenCL right now. Maybe a separate offset parameter should also be considered. EDIT: Too late, I like your updated suggestion :).
@Yangqing Should such alignment requirements be enforced or should there be metadata for operators that can specify optimal and necessary workspace conditions?
A tensor shouldn't own memory since it's only a wrapper around memory managed by the framework. So you won't need an offset.
You would need an offset even if the Tensor doesn't own memory. The Tensor can only be constructed on a slice of a larger Buffer. This is what @naibaf7 is saying we should cover, and I know from Torch's experience that this comes up very often.
@piiswrong I think that concept is also maintained by what @soumith says. Basically the CTensor and eventually CMatrix or CVector structs carry metadata of a tensor/matrix/vector, one of which is a CBuffer struct that specifies what memory from the host framework is being used.
CMatrix and CVector could also be used to specify dimension reduced areas of a tensor in order to simplify the use of BLAS libraries alongside the new DNN operators.
The operators can then internally combine the information, such as checking memory requirements and passing buffer+offset into the kernel (OpenCL) or compute the unified memory pointer (CUDA, CPU) and execute the compute kernel.
refcount is not needed if memory container is managed by the framework. Note that you can always add it back and maintain compatibility by sub-classing Tensor. If memory is not managed, maybe Buffer is also not needed (only have a Tensor with possible offset).
@piiswrong offset is useful for things like OpenCL, where the cl_mem object is opague and you cannot do explicit addressing.
I think that we are basically converging on a solution. But with clSVMAlloc what device id we set?
@bhack I think this doesn't need to be of concern for the interface. That kind of memory can be associated with multiple CBuffer/CTensor objects of multiple devices, if the host framework knows that this is safe to do.
@naibaf7 Ok but so the device id is it not an info more oriented to the "executor"?
And what interfaces do you propose for constructors? Which are the required and optionals? Besides, do we want to support tensor with dynamic size or just static?
FYI, in case you want to start to play I've setup this https://godbolt.org/g/YMnxHB
I think the data offset @soumith proposed looks good. Here is one more modification I propose(also incorporated @Yangqing comment about requrement of data alignment):
typedef {
int device_id;
int device_type;
} CContext;
typedef struct {
void* data; // data pointer(or cl_mem handle in OpenCL), align to 256
CContext ctx; // device
int ndim;
int dtype; // data type
size_t offset; // offset in buffer
size_t* shape;
size_t* strides; // can be nullptr, indicating compact version.
} CTensor;
I propose to merge the buffer into the Tensor structure and remove ref counting, since CTensor is not a container type (which manages memory).
As a matter of fact, if we really want the data structure to be able to do memory management, Buffer should not be a member of Tensor, but instead we need Buffer* to be member(so ref count on unique Buffer object can be done correctly). This is a bit overkill when we don't want memory management.
Here is what we can do in frameworks side to bring the memory management back. I use C++ style with shared_ptr for simplicity, allow thread safe destruction(can also use refcount).
// memory container
class TensorContainer: public CTensor {
public:
// This is debatable
// Ideally should be small_vector<4, size_t> shape
// so small shape sits in stack.
// The dimension also duplicates with ndim field.
std::vector<size_t> shape_;
void Reshape(const std::vector<size_t>& shape) {
shape_ = shape;
this->shape = BeginPtr(shape);
this->ndim = shape.size();
// reset memory
}
~TensorContainer() {
if (origin_ == nullptr) {
allocator.delete(data);
}
}
private:
// If it is a view, buffer_ is the pointer to the original container.
// Otherwise, buffer_ is nullptr
std::shared_ptr<TensorContainer> origin_;
};
// handle type that uses the container.
class Tensor {
public:
// can pass to argument that takes CTensor
operator CTensor() const {
const CTensor& dat = *data_;
return dat;
}
// can pass to argument that takes CTensor*
operator CTensor*() const {
return dat.get();
}
private:
std::shared_ptr<TensorContainer> data_;
};
@edgarriba I think there are two concepts here. The CTensor which serves as memory handle(like pointer) which may not have memory management.
And a Container type that does memory management, hopefully via c++1x which wraps the C API. I think on that end. The requirement on dynamic size etc. is on the container type.
So the op will consume or write data only on the device type it finds in device_id right?
I think we are not yet very clear on operator specification. Especially, where does the context get set(e.g. cudaSetDevice), how to specify context resources(e.g. cuDNN handle).
This has been inactive for a few days. I would like to ping back and see if the people involve in this thread are interested in any form of moving forward.
We can start with a neutral repo with the involved project lead as owners(I can do it under dmlc, my own repo, or we can start a org), eventually we can move the repo to a neutral organization.
The first things to be finalized are
I do not have a very good idea on how we should name the project. One possible name could be ctensor (as c indicates common). Please suggest better name candidates you have in mind
@tqchen It Is ok for us (tiny-dnn).. I don't know if it is limited to Tensor for naming cause we are already talking about ops interfaces.
I think we are not restricted to tensor for naming, I just proposed one because I am kinda of running out of ideas for naming, so please suggest names
CTensor
sounds good to me
Common Dnn Api/Core Dnn Api/Common Kernel Api etc..?
@tqchen I'd happy to wrap my LibDNN standalone library with the new interface once a reasonable collection of headers is available, to see how well that would work :)
And I guess @edgarriba could test out the host-side code in tiny-dnn?
@bhack What do you mean by that? Probably nothing beyond the interface for operators should be shared in this project?
@naibaf7 I think that the boundaries are operators API that ingest this common tensor design.. I don't think that executors/scheduling will come on the table. What is your vision?
how about libop
?
i prefer to not use tensor
because mathematically tensor
has a rich of properties, while most operators we are using are just elemental-wise, so n-dimensional array is better to name the data structure.
Following the idea of BLAS we can probably call it BDAS (basic deep-learning algebra subprograms) - sounds like "badass".
LOL. In that spirit how about BAsic Neural Artificial Network Algebra Subroutines (BANANAS)
Or Deep Learning PACKage, DLPACK, motivated from LAPACK. We are providing
more than basic subprograms.
On Sun, Feb 12, 2017 at 11:28 PM, Yangqing Jia notifications@github.com
wrote:
Following the idea of BLAS we can probably call it BDAS (basic
deep-learning algebra subprograms) - sounds like "badass".—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/4735#issuecomment-279312506, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAZv4RjloVzmTaOnuGKOIPLiBDtTBCQ4ks5rcAYHgaJpZM4Lobt0
.
DLPACK is not so bad.
TensorFlow & Keras combined have the largest user base and are growing most rapidly. You should bring those guys on board for this proposal to make the biggest impact.
http://www.timqian.com/star-history/#tensorflow/tensorflow&fchollet/keras&dmlc/mxnet&BVLC/caffe&Microsoft/CNTK&torch/torch7&Theano/Theano
/cc @fchollet
Just two points:
As a side topic, I personally think how to allow MXNet scale out over micro-kernel multi-server OSes and scale down on limited-battery devices is also important.
created a repo here https://github.com/dmlc/dlpack
Let us move the discussion to https://github.com/dmlc/dlpack/issues,
Most helpful comment
Or Deep Learning PACKage, DLPACK, motivated from LAPACK. We are providing
more than basic subprograms.
On Sun, Feb 12, 2017 at 11:28 PM, Yangqing Jia notifications@github.com
wrote: