Relay is a new high level intermediate representation (IR) intended to act as v2.0 of NNVM.
Computation graphs are a powerful program representation as demonstrated by the first generation of DL frameworks. Most popular frameworks have employed computation graphs as their input, intermediate representation, and execution data structure.
However, as workloads continue to evolve, the design of our high level IRs needs to evolve to better support the needs of developers and users
Graph-level challenges such as control flow and sub-graphs have become necessary features to natively support and optimize.
The tight coupling between runtime representation and compile-time representation has limited flexibility and frustrated developers; Relay will decouple the representations.
Finally we believe the high level must be designed in tandem with the low level IR, allowing for the two layers to communicate during compilation to achieve optimal performance.
The first version of NNVM set out to solve some of these challenges, and we view Relay as second generation IR designed specifically for integration into the TVM stack as the input layer. Our goal is to focus on TVM as our primary backend, easing development and maintenance for both TVM developers and current NNVM users, as well as enabling new features.
In order to address the challenges presented above we designed Relay to build on the things computation graphs are good at (pure, dataflow, compositional), and improve on the things they struggle with (control flow, subgraph, runtime/compilation distinction).
Relay is a typed pure functional IR, with a few basic features such as functions, if-then-else control flow, recursion, operator and function calls, and variable binding.
We have iterated on Relay's design over the past 8 months. This versions represents the culmination of our experiments. This PR does not contain all the pieces of the previous version, instead we focus on introducing the core IR, its associated data structures, and a few integral passes.
The core IR is defined in just a few files:
include/tvm/relay/base.h (the base classes and common data)include/tvm/relay/type.h (the type system and all relevant nodes)include/tvm/relay/expr.h (the expression language)All Relay programs are typed, similar to more conventional languages such as C++.
A type system allows us to statically (i.e at compile time) distinguish between different sorts of values. This means we know whether an expression will evaluate to a tensor, a function (i.e (float32, float32) -> float32) or a tuple (float32, int32). Furthermore, our type system has the ability to be shape generic (i.e polymorphism, templating).
Type inference and checking take the place of shape inference in traditional computation graphs style IRs.
This PR implements type inference and checking for Relay, the code can be found in src/tvm/relay/pass/type_infer.cc, and relevant helper utilities in src/tvm/relay/pass.
Relay adds a notion of control flow to the IR, in the form of simple if (cond) { true_branch } else { false_branch}. Relay requires that the condition variable computes a single boolean value controlling which branch is taken. if is an expression in Relay, meaning the result of the entire
expression is the result of the branch taken.
We introduce this to add a formal way to distinguish between data flow and control flow without having to conflate the two in the representation. Because we separate the control signal, we can easily batch a program without affecting control flow.
The definition of control flow can be found in include/tvm/relay/expr.h.
Relay supports the definition of functions which can be used to represent "sub-graphs" (i.e chunks of reusable computation).
Relay functions are like traditional functions: they represent some set of parameters (i.e placeholders) and a body which is a chunk of computation involving the the parameters (i.e sub-graph). We can build a full network/model by composing together functions.
The Relay IR is designed as a compile time representation of models. The new features are exposed only in Relay's abstract syntax tree, and used for compile time program manipulation. We do not intend to use Relay's IR as a data structure for serious interpretation or execution.
These new features increase the expressivity of the current computation model, and one may ask how to execute programs using these features with the existing runtime. Our goal is to introduce Relay as the compiler representation in this PR, and reuse the existing runtime maintaining compatibility on both the frontend and backend. We anticipate a new version of the runtime having native support for Relay's new constructs in the future.
We made an effort to model Relay's implementation after TVM and reuse much of the existing infrastructure in order to provide better compatibility between TOPI operators and Relay programs. One big design decision is reusing the TVM node system to expose the Relay language to Python in the style of TVM. Users who are familiar with TVM's expression language should feel comfortable working with the Relay AST's definition in C++, and Python. We also share representations for many data structures. For example tensor containers (i.e tvm::runtime::NDArray), and generic attributes which can be shared between Relay and TVM are two such shared structures.
We plan on adding a guide for transitioning programs from NNVM to Relay. This is one of the remaining work items before releasing the Relay Alpha. The goal is users can use the Relay operators and builder API to construct Relay programs, and we will follow-up with a compatibility layer to make transitioning from NNVM smooth.
For an implementation see #1672 which implements this bit.
hi @jroesch, looks cool. I am hoping that this is a TVM's answer to Tensorflow's somewhat awkward tf.while_loop or other control flow constructs.
Regarding this sentence,
Finally we believe the high level must be designed in tandem with the low level IR, allowing for the two layers to communicate during compilation to achieve optimal performance.,
does HalideIR correspond to the "low level IR"? This sounds very interesting.
We know that the goal of designing a new IR is to benefit potential optimization, so could you be more specific that what kinds of optimization Relay is planning to support?
@junrushao1994 can you also elaborate on your use-cases and how easy/hard it is to bring this to the current proposal?
This is one of the major design chance and we would love to have participation from the community to review and improve the proposal @dmlc/tvm-team
Thanks @jroesch for the proposal, I am going to elaborate some of my take on this proposal. Note that no design is perfect and that is why we need help of everyone to work together the evolve the IR.
These are things that pops up from my head, feel free to add more.
I would say the design itself is perfect, which addresses almost all problems that this design targets. Yes, this is the first systematic approach to address the lack of Turing-completeness in deep learning frameworks, rather than quick hacks like TensorFlow's while_loop. Also, The implementation is elegant and I love it.
So please allow me to take the liberty to talk about some concerns that might be out of the target of the current design. Briefly, I will comment in the following aspects.
Sometimes I prefer to think that worse is better, and guess that it might be kind of restrictive to ask deep learning practitioners to do "the right thing". After all, we could not assume each user has a PhD degree in PL. Here are several things we might need to consider in the future.
I would love to discuss the necessity of having containers like List<T>, Dict<K, V>. This concern raises from some deep learning model in NLP. For example, self-attention, which is used in sequence generation tasks.
def self-attentive-generator():
initialize states
initialize outputs = []
while True:
prev_outputs = concat(outputs)
context_vector = self-attention(prev_outputs, inputs, state)
step_output, states = DecoderStep(states, prev_outputs)
outputs.append(step_output)
if some_condition:
break
outputs = concat(outputs)
return outputs, states
We already have Tuple<T_1, ...> in Relay, which is great. We could definitely ask users to convert it to a FP style so that everything has no side effect, and while performance is not quite affected, but are we able to reduce the memory footprint in this case?
Supporting Dict<K, V> is somewhat weird requirement, but it seems to me that states in RNN are often represented using a dict (need confirmation from @szha). I guess we could probably replace this with something like namedtuple.
Other weird use cases may include Fast WaveNet, Pixel CNN++, beam search in decoding, many of while requires users to write a normal program to manipulate containers, which is easier for common users to write in an imperative style, rather than FP. I am a big fan of FPs like Haskell, but yet a little bit worried
1) whether the market likes this style of programming.
2) whether this IR could support such optimizations.
For example, in dropout, we should definitely introduce something with randomness.
This is just a remainder that these are stuff we should take into consideration.
I am also wondering if ctx could be put into the type system. This is kind of co-design with TVM.
This is kind of off the topic, but I am personally interested in seeing brilliant ideas about how we could handle these situations.
A. SyncBN: This introduces allreduce in the forward pass. Of course, such ugly thing like forward/backward will never exist in Relay, which is great, but it seems that we should introduce synchronization primitives in the design.
B. Timeout: It is common practice on an edge device to store a smaller DL model offline, which is used to produce a coarse result; in the meantime, an online model on remote servers computes some part of and send it back. In case the online model failed to send the fine-grind result in time, the edge device would use the offline result. This is another kind of side-effect, should we handle this in a deep learning framework, or leave it to others?
+1 on making things accessible(worse is better), this is exactly what we should push for.
Making most part functional makes differentiation easy and allows build the things around. It is important to be able to support mutation in some of the outer loops, where diff is not needed. Have such clear distinction is important.
The List Dict programming style is more like multi staging program, where the graph is staged in the data structure before we call backward. While it is definitely possible to do so via imperative autodiff, it is an interesting question to ask if we can desugar this into some form of functions. Note that this is different from mutation(because differentiation is necessary)
Distributed sync prImitives and timeout can likely be implmement via special core operator(like add and sub) and there should be no problem handling this: making things more sideeffect free will make distributed parts easier
@tqchen and @junrushao1994 in some of other experimentation we have worked on rewriting imperative Python code into the IR directly. This should allow users to write imperative looking code which is translated into a form that is easy to optimize and perform automatic differentiation on. In general we could expose an API which looks destructive to the user but is actually pure under the covers.
I would like to track effects as an operator attribute so we can use this information during optimization. Roughly we will respect effects when transforming programs, for example no reordering over I/O or stateful computations.
@tqchen, as you mentioned transformer, I'm curious how Relay is going to handle this kind of workloads where shapes changes at runtime and can't be statically inferred. Is there a plan to support it in TVM and Relay?
@zheng-da in the standard way.
@tqchen, as you mentioned transformer, I'm curious how Relay is going to handle this kind of workloads where shapes changes at runtime and can't be statically inferred. Is there a plan to support it in TVM and Relay?
TVM support variable length input via symbolic variable, so in theory we could build op that takes in input shape (n, 128) where n is a symbolic variable. Relay also adopt this in type system that allows handle cases of fixed dimension but symbolic shape. How to do generic code gen is another question that we can followup, but the IR itself can handle shape inference of this kind
@jroesch I see. That looks cool.
Frontend
@tqchen and @junrushao1994 in some of other experimentation we have worked on rewriting imperative Python code into the IR directly. This should allow users to write imperative looking code which is translated into a form that is easy to optimize and perform automatic differentiation on. In general we could expose an API which looks destructive to the user but is actually pure under the covers.
Side effects, e.g. I/O and random monad
I would like to track effects as an operator attribute so we can use this information during optimization. Roughly we will respect effects when transforming programs, for example no reordering over I/O or stateful computations.
Do I understand it correct?
Suppose I have an operator that increases the size of all dimensions of the output by 1 compared with the input, i.e., if the input shape is (x, y), the output shape will be (x+1, y+1).
Now if one of the input shape is unknown, e.g., (n, 128), as you mentioned, the output shape will be inferred as (n+1, 129). In other words, the system can infer the size of all known dimensions and capture the relation of the unknown dimensions.
Now, I have this operator op described as above (it increases the size of all dimensions by 1) and use it in a while loop as follows:
i = 0
data = mx.nd.random.uniform((10, 10))
out = mx.nd.zeros((10, 10))
while (sum(data[i]) > 1):
out = op(out)
i = i + 1
Can the shape of out still be inferred, maybe expressed in a symbolic way?
+1 for incorporating context or target into the type system, so that it can directly support heterogeneous runtime.
Shall we provide an approach to convert RelayIR to graph representation (if it can)? I'm thinking about passing subset to accelerators like TensorRT.
@zheng-da If we convert your code to a functional one, yes, this is called shape generic?
Do I understand it correct?
Suppose I have an operator that increases the size of all dimensions of the output by 1 compared with the input, i.e., if the input shape is (x, y), the output shape will be (x+1, y+1).
Now if one of the input shape is unknown, e.g., (n, 128), as you mentioned, the output shape will be inferred as (n+1, 129). In other words, the system can infer the size of all known dimensions and capture the relation of the unknown dimensions.Now, I have this operator op described as above (it increases the size of all dimensions by 1) and use it in a while loop as follows:
i = 0 data = mx.nd.random.uniform((10, 10)) out = mx.nd.zeros((10, 10)) while (sum(data[i]) > 1): out = op(out) i = i + 1Can the shape of out still be inferred, maybe expressed in a symbolic way?
@junrushao1994 I don't know what it means. Could you show me how the shape of out is expressed? Also, how is the shape used after it's inferred?
@zheng-da Sorry for making you confused. There are two steps, the first step is to convert the code to a purely functional one, which means you use pattern matching + recursion to substitute the loop. The second step is to look at the function, then you will see the function that represents the loop body is generic.
@junrushao1994 I don't know what it means. Could you show me how the shape of out is expressed? Also, how is the shape used after it's inferred?
@zheng-da It is called MGU (@jroesch correct me if it is not). The shape of out is called IncompleteType somewhere in the code. (I briefly glanced through the code, but didn't remember the exact name) A simple union-find set perfectly solve this problem.
@junrushao1994 I don't know what it means. Could you show me how the shape of out is expressed? Also, how is the shape used after it's inferred?
In my unstanding, type inference was designed to understand things in compile time, so in the case of random and dimension expansion,it is impossible to decide the final dimension and inference will likely return things like incomplete type, or use shape of node that defers things to runtime. The fixed dimension symbolic shape case was the most common one that we can still take benefit from such static info
Agreed. The fixed-dimension symbolic shape is very useful. I think mxnet can greatly benefit from it. Could you point me to the code in TVM that does it?
I think my original question was whether there is a plan of supporting the case that the shape really can't be inferred in TVM and relay? For example, in mxnet, I'm thinking of doing something like this: https://github.com/apache/incubator-mxnet/pull/12400. Of course, my solution is hacky.
Supporting runtime dynamic shape is really a problem of the runtime, since in terms of high level ir, we can always introduce a type(incomplete or dynamic) that indicate it can be anything. And we will need to add JIT in runtime to build and run the stages when we have more information
@tqchen Don't think it is a big deal for the runtime if we support only PackedFunc wrapping libs like cuDNN. Many passes could be represented using a sparsely conditional constant prop, or other very mature compiler techniques.
However, if we want auto-tuning in such scenario, it could be cutting edge research.
Supporting runtime dynamic shape is really a problem of the runtime, since in terms of high level ir, we can always introduce a type(incomplete or dynamic) that indicate it can be anything. And we will need to add JIT in runtime to build and run the stages when we have more information
To follow up on @yzhliu 's comment on whether ctx should be part of type system.
We don't have to enforce everything as part of type in order to do such optimization. Context assignments(or machine assignments) in distributed setting can also be presented in column meta data(like in NNVM). We have quite a lot of cases like this: alterative data layout, distributed machine assignements etc.
The possible pros/cons of the type system vs the additional metadata are
So in the case of shape vs context
Because of this reason, we can argue that context is preferred not as part of the type, but more like a metadata of say function or a call.
@masahi I think in here low-level IR refers to the tensor expression part of TVM, including autoTVM, topi, compute primitives.
@tqchen thanks, makes sense. Those are in turn based on HalideIR, so in some sense HalideIR is the foundation for everything.
@tqchen Another question is how to integrate with some backward libraries in Relay. Maybe this isn't really a Relay question, but it's something we need to consider after TVM moves to Relay. I suppose Relay is good at pattern matching. Is it easy to take out the matched pattern and put it somewhere (maybe in an operator) to invoke TensorRT? How do you think about supporting stateful operators, both from the perspective of Relay and TVM? Having a stateful operator may be easier for us to integrate with TensorRT.
@zheng-da Can you be more specific about âgood at pattern matchingâ?
@tqchen Another question is how to integrate with some backward libraries in Relay. Maybe this isn't really a Relay question, but it's something we need to consider after TVM moves to Relay. I suppose Relay is good at pattern matching. Is it easy to take out the matched pattern and put it somewhere (maybe in an operator) to invoke TensorRT? How do you think about supporting stateful operators, both from the perspective of Relay and TVM? Having a stateful operator may be easier for us to integrate with TensorRT.
@junrushao1994 actually, I don't know. I guess Relay should be able to do pattern matching. One example of pattern matching is to find a set of operators that can be fused in TVM.
@zheng-da NNVM can already do operator fusion. TVM supports cuDNN offload out of the box. This tutorial maybe helpful.
I think Da's asking similar question as mine, since TensorRT eats a graph, if we extract a subset of Relay IR and pass it to TensorRT for accelerating, then an intermediate graph representation is required.
@zheng-da I don't think it is a problem to find out operators meet a specific requirement - we can still traverse the Relay AST.
@zheng-da I got what you mean. I was confused because âpattern matchingâ typically refers to conditional statements in FP.
I think @zheng-da is right. This is not a Relay question. I would say it is pretty trivial to do by traversing the IR and extract whatever TensorRT supports,as @yzhliu suggests. However, there are two things we should keep an eye on:
1) I am not a big fan of a stateful operator (and I believe nobody is). We should try to separate their state out as the arguments, and return the new states back after computation.
2) We should be careful with infinite recursion.
@junrushao1994 actually, I don't know. I guess Relay should be able to do pattern matching. One example of pattern matching is to find a set of operators that can be fused in TVM.
@yzhliu I think what you are trying to say is âwe need an IR converterâ. Yes, that makes a lot of sense. Ideally, there should be a bridge IR to which every DL framework converts themselves, and from which the low-level lib provider writes a pass to convert to their own IR. But such thing does not seem to exist yet (or ONNX?).
We could definitely make it more systematic though, but it does not seem that necessary for now, because only a very small number of low-level lib consumes a graph.
As for integrating operator-level libraries like cuDNN, this is never a problem in TVM...
I think Da's asking similar question as mine, since TensorRT eats a graph, if we extract a subset of Relay IR and pass it to TensorRT for accelerating, then an intermediate graph representation is required.
@zheng-da I don't think it is a problem to find out operators meet a specific requirement - we can still traverse the Relay AST.
@tqchen @jroesch Let's talk about introducing data structures like lists, dict. (I re-structured the stuff I wrote yesterday)
Solution A: use confluently persistent data structure. (immutable)
There is theoretically no difficulty in implementing functional data structures like functional List or functional Map. But we may have two concerns:
1) overhead for being functional. For example, I often write the functional Map using merge/split Treap, which maintains an expected O(log n) time & space complexity, but my C++ implementation is likely to be 4x - 10x slower than non-functional ones. If we have a good runtime, the latency should somewhat be hidden, but I am not sure.
2) Another thing is if we introduce immutable lists / dicts, the semantics of list and map would be changed. But of course, we could force users use things like tvm.immutable_list, and declare "we don't want you guys to use Python list/map".
Solution B: let's leave it dirty. (mutable)
There are also two concerns.
1) In my opinion, we should try to avoid any tracing-based autodiff, otherwise it loses both the elegancy of our design and part of the meaning of being functional.
2) It leaks side effect everywhere. To avoid this, it is possible to slice the graph. But it is relatively bad idea especially when it is inside a loop (see the self-attention example), because it discourages a lot of global optimizations.
I feel that this part is tightly entangled with runtime though...
Optimization for deep learning kernels (vectorized) seems totally different from that for scalar operations (e.g. +, *, /, ->, etc). I am not an expert, but please allow me to define two terms to distinguish these two kind of optimizations:
Note that kernel optimization is a superset of scalar optimization by definition, but here let's assume kernel optimization refers to the part that is not scalar.
As far as I could tell,
However, by prioritize instructions, optimization in speed is also possible, when control flow exists or in the distributed setting.
One good thing for a purely FP is liveness analysis is pretty trivial. I would prefer just to add a tag DEAD(some-memory-chunk) to inform the runtime.
There are several situations we need to consider:
1) An ordinary function, and no other functions are its arguments, no infinite recursion, no pattern matching: Memory preallocation is directly doable once all shapes are known.
2) An ordinary function, and no other functions are its arguments, no infinite recursion, but has pattern matching: This creates exponential number of combinations of memory footprint. We could build a bin estimator statistically analyzing the memory footprint.
3) A higher-order function, or a function could trigger self/mutable recursion: it is hard, let's not do such optimization for this kind of function.
Also, memory sharing is trivial across functions.
This is inspired by @tqchen's paper [1], but in a less elegant way. I also prefer to leave this to the runtime.
From the NLP community, it has raised lots of concern to detect contiguous memory allocation. Again, I would love to take self-attention as the example, which is widely used in generating sequences of better quality in machine translation, question answering, etc.
def self-attentive-generator():
initialize states
initialize outputs = []
while True:
prev_outputs = concat(outputs)
context_vector = self-attention(prev_outputs, inputs, state)
step_output, states = DecoderStep(states, prev_outputs)
outputs.append(step_output)
if some_condition:
break
outputs = concat(outputs)
return outputs, states
At each step, we have a concat, which produces a new chunk of memory. Unfortunately, this memory could not be released, because we probably are going to back-propagate through it. This causes quadratic memory consumption.
I think it is worth mentioning that it is not a corner case, but seems to me a trend that many practitioners now knows it works, and they wants to add this seemingly free lunch to their sentence generation model.
It is possible to optimize this, as long as we could detect this memory allocate pattern. Fortunately, yes, it is doable in this IR.
This optimization is pretty useful, but not easy to implement under the current IR.
For example, a rank-n Tensor is np.swapaxes, and subsequently a reshape is called. Numpy will check if it is possible not to do zero-copy in reshape. However, it seems that we did not expose such thing in current level of IR because it requires ndarray.strides.
@tqchen As I suggested in my proposal, we should expose everything in some level of IR.
Scalars are currently represented using rank-0 tensors, which is unified under current IR. However, it remains a question to me whether we want to do this by like launching a kernel to some GPU stream, or MKLDNN stream, then waiting for callback, merely for computing scalars like a + b.
I would propose to do program slicing in the lower-level IR to distinguish scalar and kernel operations, then grant scalar operations to LLVM RTC for optimization.
This is similar to the previous section, that in a lower-level runtime, we would decouple these three things into 3 program counters, do speculative execution over some PCs, in order to let kernel operations launch continuously, in this case, launching gap will be fully eliminated.
I don't think it is doable in current version of TVM, but it could bring significant speed improvement when viable. For example, fixing weights of a RNN cell into some registers, make sure they aren't spilled.
This optimization is also related to pipelining instructions layer by layer in inference. Let's leave it to f future work...
[1] Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016). Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
@junrushao1994 I am the main designer/implementer of relay's automatic differentiation system.
In general, doing non-tracing based reverse mode automatic differentiation on arbitrary lambda is extremely hard. There is only one paper (Reverse Mode AD in a functional framework) that does it, which work by traversing the reflected program, is complicated, and is untyped.
We might be able to type it, but it will bring a huge source of complexity, and optimizing on reflection will not be easier then optimizing trace. So, we actually use a tracing based approach, which is very similar to (Demystifying Differentiable Programming), except we do not use continuation, only Mutation.
IMO as there is already effect everywhere (random, IO in reinforcement learning, mutation in NLP, distributed training) etc, the problem is less of 'whether there should be effect or not', and is more 'how should we capture effect? Monad or Eff-like effect system or doesnt at all (as in OCaml/SML), only in static analysis?' I do agree that it is a problem in it's own right, but I think some notion of effect is inevitable.
Back to your particular problem, I think there is a 'best of both world solution'.
introduce a type Ref a. It mean a pointer pointing to a, which can change it's content. the pointer cannot change what it point to though (albeit it can be achieved with Ref(Ref a)).
There is 3 function on Ref.
MkRef : a -> Ref a
GetRef : Ref a -> a
SetRef : Ref a -> a -> ()
and possibly, updateRef : Ref a -> (a -> a) -> (), which is atomic.
introduce effectless list/dict.
translate python list into Ref(List a).
in the compiler, add special hook for Ref(List a), and use custom mutable datastructure.
we can also change list a to mutable one(in compiler) if it is not being shared.
I think this address the (1, 2) in solution A, and the previous paragraph address (1) in solution B.
Let's talk about (2) B.
I do agree that reference hinder optimization. However, so does reflection - which is the only other way for higher order reverse mode differentiation on higher order function. I also postulate that with constant folding, the reference can be optimized away when the control flow is known. It will only exists at the boundary of unknown function call. If some variable are only used locally, never leaked outside, and their usage does not vary to the control flow, they should not generate Ref.
Of course, it is only a postulation at this point, but we also has a first order reverse mode automatic differentiation algorithm implemented, with no wengert list at runtime. The down side is that it does not work with control flow. We can always add special case to make sure no Ref is generated here, to achieve better speed.
Also IMHO we are pondering into the future too far ahead. AFAIK No one know how will reference, data-structure, tensor, ad, play together, when we try to compile efficient code on GPU. I think we should hold such design decision until much later phase, when we have a clearer picture.
Does I address your question?
@tqchen @jroesch
I guess I have done the optimization part. I believe most of the optimization techniques for deep learning workloads that I could up with could be somehow covered by using in RelayIR, thanks to the purity of FP. Could you guys give some feedback about this?
I am really interested in how you guys could implement a low-level runtime environment, because I didn't see the design in the Relay paper. Could you guys kindly share some information with me?
WRt to optimizations, we should try to push most optimization to compiler and leave runtime lightweight.
Scalar slicing and fusion. This is likely can be done already in relay, by slice out 0 rank tensor and generated a fused function to compute them, only one memory store is necessary and they act similarly as infer shape.
Reshape opt is not as important, usually inplace type memory optimization is not as important as long as there is memory reuse. Because the memory before reshape can get reused in the next stage. Compact memory is much better for speed optimizations
Concat, directly add to slice
This has something to do with custom calling convention, when building a relay function, there are multiple ways on how to handle calls. The tensor space can be caller allocated or callee allocated. And we can specify if there is a fused op for return value. For example we can support customized calling convention, like invoke this function and add the result to a preallocated array. Combining compile time analysis with customized calling convention likely can solve this problem
@MarisaKirisame Thank you so much for your very helpful comments! It does address most of my questions.
@junrushao1994 I am the main designer/implementer of relay's automatic differentiation system.
In general, doing non-tracing based reverse mode automatic differentiation on arbitrary lambda is extremely hard. There is only one paper (Reverse Mode AD in a functional framework) that does it, which work by traversing the reflected program, is complicated, and is untyped.
We might be able to type it, but it will bring a huge source of complexity, and optimizing on reflection will not be easier then optimizing trace. So, we actually use a tracing based approach, which is very similar to , except we do not use continuation, only Mutation.
IMO as there is already effect everywhere (random, IO in reinforcement learning, mutation in NLP, distributed training) etc, the problem is less of 'whether there should be effect or not', and is more 'how should we capture effect? Monad or Eff-like effect system or doesnt at all (as in OCaml/SML), only in static analysis?' I do agree that it is a problem in it's own right, but I think some notion of effect is inevitable.
Back to your particular problem, I think there is a 'best of both world solution'.
introduce a type Ref. It mean a pointer pointing to a, which can change it's content. the pointer cannot change what it point to though (albeit it can be achieved with Ref).
introduce effectless list/dict.
translate python list into Ref.
in the compiler, add special hook for Ref, and use custom mutable datastructure.
we can also change list a to mutable one(in compiler) if it is not being shared.I think this address the (1, 2) in solution A, and the previous paragraph address (1) in solution B.
Let's talk about (2) B.
I do agree that reference hinder optimization. However, so does reflection - which is the only other way for higher order reverse mode differentiation on higher order function. I also postulate that with constant folding, the reference can be optimized away when the control flow is known. It will only exists at the boundary of unknown function call. If some variable are only used locally, never leaked outside, and their usage does not vary to the control flow, they should not generate Ref.Of course, it is only a postulation at this point, but we also has a first order reverse mode automatic differentiation algorithm implemented, with no wengert list at runtime. The down side is that it does not work with control flow. We can always add special case to make sure no Ref is generated here, to achieve better speed.
Also IMHO we are pondering into the future too far ahead. AFAIK No one know how will reference, data-structure, tensor, ad, play together, when we try to compile efficient code on GPU. I think we should hold such design decision until much later phase, when we have a clearer picture.
Does I address your question?
One thing to keep in mind is that we need to codesign things, instead of simply think in terms of high level ir, for example, the function slicing likely have things to do with what low level code generator and hw has to offer. So most low level info need to be registered in high level and we need embed meta data to reflect certain info
@tqchen I also totally agree that we should give most of optimization work to the compiler and keep runtime light. But I actually also have a concern about the optimization passes which is probably more related to NNVM. There are already many passes and the number is expected to continue growing. As the number passes grows, I think it would be beneficial to have a more systematic way to manager them.
For example, I am thinking if it makes sense to introduce something like a "PassManager" (as in llvm) to maintain the passes. PassManager may provide the following things.
@[email protected] junrushao1994@gmail.com I don't know many
external libraries for deep learning, but I can name a few: TensorRT,
nGraph, which requires graph-level integration. As far as I know, both Intel and Nvidia are developing more
graph-level libraries. I think it's pretty common. I think we should think
about this problem.
As for stateful operators, I wonder what is the other option. If we
separate the state and pass it as an input argument, the data structure (it
can contain any arbitrary data required by the external libraries) might be
pretty complex. It doesn't seem to me that Relay can handle this kind of
data structure.
On Thu, Aug 30, 2018 at 3:39 AM Junru Shao notifications@github.com wrote:
@yzhliu https://github.com/yzhliu I think what you are saying is âwe
need an IR converterâ. Yes, that makes a lot of sense. Ideally, there
should be a bridge IR that every DL framework converts their IR to the
bridge IR, and the low-level lib provider writes a pass from the bridge IR
to their own IR. But such thing does not seem to exist.We could definitely make it more systematic though, but it does not seem
that necessary for now, because only a very few number of low-level lib
consumes a graph.â
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/tvm/issues/1673#issuecomment-417274015, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAETUVTEugZ4-2II2ws5iodl-qukmWsgks5uV8DigaJpZM4WRCFP
.
@zheng-da these are just trivial engineering choices, and could you name a specific thing that relay could not handle? @tqchen could you comment on this?
@[email protected] junrushao1994@gmail.com I don't know many
external libraries for deep learning, but I can name a few: TensorRT,
nGraph, which requires graph-level integration. As far as I know, both Intel and Nvidia are developing more
graph-level libraries. I think it's pretty common. I think we should think
about this problem.
As for stateful operators, I wonder what is the other option. If we
separate the state and pass it as an input argument, the data structure (it
can contain any arbitrary data required by the external libraries) might be
pretty complex. It doesn't seem to me that Relay can handle this kind of
data structure.
This is a good chance to look at data layout system. I think @yzhliu is currently working on refactoring layout in TVM: https://discuss.tvm.ai/t/datalayout-structure/80
To enable graph level optimization, every operator will require layout information. Maybe we can considering adding it Relay type system.
@junrushao1994 when i look at the type system (in the Relay paper), it supports Base type, shape, Tensor, function, type, reference, tuple. Do you suggest representing the data structure for any arbitrary external library with the Relay type system? For example, MKLDNN requires some data structure like mkldnn::memory::primitive_desc. It's a class that contains std shared_ptr. It's probably doable to store this data structure in Relay, but it might be more convenient to support something like OpaqueType for arbitrary operator states.
The other problem is that these external libraries may change the state after each invocation. However, we don't know if they really change or how they change the states. Therefore, the operator can't be pure functional. Does Relay need to deal with it?
@zheng-da Relay has some notion to track effects, so why not you guys put these arbitrary stuff inside something like a PackedFunc?
Update: as @MarisaKirisame mentioned, I am wrong. Please just ignore this reply.
@junrushao1994 when i look at the type system (in the Relay paper), it supports Base type, shape, Tensor, function, type, reference, tuple. Do you suggest representing the data structure for any arbitrary external library with the Relay type system? For example, MKLDNN requires some data structure like mkldnn::memory::primitive_desc. It's a class that contains std shared_ptr. It's probably doable to store this data structure in Relay, but it might be more convenient to support something like OpaqueType for arbitrary operator states.
The other problem is that these external libraries may change the state after each invocation. However, we don't know if they really change or how they change the states. Therefore, the operator can't be pure functional. Does Relay need to deal with it?
@kevinthesun Hey Yao, could you kindly share more thoughts about what information you think must be put into the type system? It will be very helpful!
This is a good chance to look at data layout system. I think @yzhliu is currently working on refactoring layout in TVM: https://discuss.tvm.ai/t/datalayout-structure/80
To enable graph level optimization, every operator will require layout information. Maybe we can considering adding it Relay type system.
@junrushao1994 we hasnt settle on any effect system yet. We would love to know what actual use case is there!
@MarisaKirisame This is the concern raised from colleagues working on external library integration, as mentioned by @zheng-da.
@junrushao1994 we hasnt settle on any effect system yet. We would love to know what actual use case is there!
@junrushao1994 For NNVM/Relay, the layout information is mainly used to insert layout transformation op when necessary. Currently this is achieved by FCorrectLayout attribute.It's like an "InferLayout" attr. We might want to preserve the latest valid layout of each op, so that we can easily fall back to last valid layout when the new layout pass is illegal for some ops. The logic should be similar to current NNVM implementation, but we might be able to better manage it in Relay.
@kevinthesun IMO the concern is valid, @tqchen what do you think?
@junrushao1994 For NNVM/Relay, the layout information is mainly used to insert layout transformation op when necessary. Currently this is achieved by FCorrectLayout attribute.It's like an "InferLayout" attr. We might want to preserve the latest valid layout of each op, so that we can easily fall back to last valid layout when the new layout pass is illegal for some ops. The logic should be similar to current NNVM implementation, but we might be able to better manage it in Relay.
Let us hear the opinion and possible proposals from everyone in the community :) at this moment for things that are in flux we donât necessarily need decisions. Instead we would love to see clear actionable options
Hi. Very interesting discussion we have. I have some questions related maybe more to the Relay paper than to this exact RFC. AFAIK they are closely connected, but please let me know if we have a better place to discuss the article.
I'm glad that you are going to employ Python typing facilities! In the article, type specifications look like def lenet(x: Tensor[Float, (1, 28, 28)]) -> Tensor[Float, 10]. Since python doesn't specify the exact typechecking algorithm, I assume we are going to extract type information from Python and pass it to C++ core to do shape math during graph compilation passes. Am I correct with that?
Next, If we use other typing tools (probably mypy) in the same project, then two things are going to happen: 1) Mypy will see Relay types while performing before-runtime typechecking. 2) Relay will see mypy types at runtime.
AFAIK, mypy will not accept Relay types as they are, it needs some stubs/definitions or maybe it needs to know which places in code to avoid looking at. Do you plan to maintain some degree of toolset compatibility?
Another thing I am concerning is autobatch (https://arxiv.org/abs/1705.07860), which is an important feature that could potentially speed up irregular workloads. This involves some sort of scalar arithmetics, indexing, concatenation stuff.
Relay could definitely represent these stuff easily. Just curious, if you guys want to support auto-batching, would you prefer to write it in frontend language, parse it out, and represent it with Relay, or write a low-level kernel or a PackedFunc?
Autobatching is probably best implemented as a transformation on Relay IR? Thatâs pretty much whatâs working for PyTorch.
On the other hand, dynet seems happy with putting it in the scheduler and batch whatever is available to run. I think whether IR transformation is better depends on the granularity of the graph, as well as the ability to discover reusable sub-graphs.
Oh I forgot, also depends on whether you need to keep the dependency graph for AD or not.
DyNetâs original approach (essentially a runtime scheduling pass) doesnât scale well with accelerators. But the next version of DyNet will likely use Cavs, which solves that by dramatically reducing the complexity of the scheduling problem at the cost of a new user-visible level of abstraction (an annotated static subgraph they call a vertex). In the long run a combination of IR transformation for SPMD-style batching and Cavs-like scheduling for more highly divergent cases seems like it might be the right approach.
Indeed, a combination of these two approaches will likely cover more cases.
A quick question, do you guys want to strictly enforce homo type of a list, or allow some generics. For example, will all tensors inside a list be of the same shape?
Update: seems like a silly question...
Very Probably. The generic across shape part will probably be done using existential type.
@grwlf I'm one of the people working on the Python frontend. Youâre correct. Type information is passed directly from the frontend to the C++ compiler. We can handle mypy type annotations in Relay code, since those annotations are encoded in the Python AST that we compile to Relay. Iâm not sure that Relay will ever encounter mypy types at _runtime_, but I may be missing something.
The Relay IR nodes (including types) are mirrored in Python via the TVM FFI, allowing us to construct Relay programs in Python. As a side effect, mypy can correctly handle some Relay types; however, it wonât be able to handle more complicated types like tensors. PEP 484 included @no_type_check_decorator, which allows us to disable mypy type checking for code annotated with the relay decorators. In fact we check the frontend with mypy, so you should expect good compatibility between Relay and mypy.
Great feedback everyone. Sorry for being absent the last couple of days, right after I shipped the initial version of the PR I had to move apartments and have been hastily moving things from one apartment to the next and trying to get my desk set back up so I could work đ.
I want to address a bunch of the feedback in order.
This is referring to @zheng-da, @tqchen and @junrushao1994's line of questions and discussions.
The type system design is general enough to support writing the functions you. For example we can write down a functions like: (n, s, k) -> (s + 1, k * n) (where n, s and k are symbolic) the problem is if we want to compile that function ahead of time without knowing n, s, and k. We need to generate code which works over arbitrary dimensions. In the current system we can still write such code we just rely on knowing n, s, and k at compile time.
Fortunately in deep learning many input dimensions are known at compile time allowing us to efficiently compile that code today.
The bigger issue is typing a loop which applies that function an unknown number of times. If you can convert the code to functional form like @junrushao1994 suggested then its possible to assign a symbolic relationship between the loop condition and the type. Realistically many of these need to be dynamic and require us to introduce a concept of an unknown dimension (which we do not currently support) and makes code generation significantly more complex if we want efficient kernels.
It is straight forward enough to extract a graph from Relay, I'm working on porting the Relay to TVM compiler to the newest version of Relay on the currently open PR. For example in the pure data flow case you would only accept Let and Call nodes, a program containing anything else would be rejected (this is roughly a basic block). I think extraction of graphs would be relatively easy to setup, and ideally the next generation runtime would allow us to generate and schedule code like this.
There was some discussion about ONNX and a unified interchange format but the problem is that ONNX is insufficiently expressive. For example PyTorch can't export dynamic graphs of any form due to ONNX limitations. There are also issues like full support for broadcasting operations in the IR.
I think data structures are interesting if we could assign them semantics ourselves I think optimization of them will be much easier, for example if we use data structures in a linear
fashion it is possible to heavily fuse/elide allocations. I think we should follow up on this
after the initial IR lands.
In terms of doing lower level optimizations my vision has always been that most users will interact with the IR before the "suffix" optimizations are applied. That is the user's perception will be of a high level functional graph, mostly free of effects, and can be reasoned about algebraically like most computation graphs. The "suffix" phases will do things like concretize the operators, do memory planning, convert from functional operations to destructive ones, etc. and then we will generate low level VM instructions/LLVM and a set of customized TVM operators. This is the place where we could break the uniform constant representation and schedule scalar code differently, etc.
@kevinthesun We have talked about extending typing with data layout but it wasn't clear how to insert it into the type system. Last time @tqchen and I talked we were planning on storing it as an attribute. My thought was to have a layout pass which inserts explicit layout change operators everywhere and then tries to minimize the number of layout transforms, could imagine employing some search here, for example maybe its good to keep some data around in both layouts? would be interested in hearing more.
@zhiics The older version of Relay had a basic version of the pass manager. I think one really interesting use case for it is learning pass ordering. Pass ordering traditionally in compilers has been something that is relatively fixed, can have substantial impact on performance, and the optimal ordering is often application specific. I think having the ability to apply auto-tvm style techniques here would be great, especially as the number of passes grow.
We have discussed implementing auto-batching as a compiler transform, but in general it sounds like a mixture of scheduling and program transform makes sense. It seems like there are many examples of this static/dynamic split in CS and especially compilers. One simple example is stack-frame/stack, we can simply compute the amount of storage for a single frame but in order to support recursion we allow some of the work to happen dynamically. We have discussed a similar strategy for the runtime, and I could imagine trying to heavily optimize static sub-graphs where possible and then bucket then dynamically during runtime. I'm not an expert in auto-batching so I would love to hear more ideas from the community.
@joshpoll is right about the type annotations we have setup them up to not interfere with MyPy, if MyPy gained support to type numpy computations we could probably design annotations which work both in MyPy and Relay. The previous version of Relay (not the one in the PR) had the required stubs to make everything work.
@jroesch Thanks for the summary. The phase ordering problem is also one of my major concerns, but I haven't looked into the auto-tvm style technique. I will read it.
Can't stop myself from adding 2 cents about pass managers. TL;DR IMHO maybe passes of one kind do not need pass managers, but passes of other kind should have one.
I think (an obvious thing) that absence of pass manager may lead to hardly-detectable bugs during compiler development. I also think it is correct to separate passes which do actually change IR but do preserve the algorithm (typechecking of all sorts, may be loop unrolling) and passes which keep IR the same but change the program written in it (the ones which removes or inserts function calls, like fusing).
The passes which change IR but preserve the algorithm may really benefit from some kind of pass manager, better static than dynamic. I imagine, that In C++ this may be done by template metaprogramming on AST classes. In Haskell there are solutions employing fixed-point datatypes able attach or remove pass-specific details to AST datatype while keeping it statically-typed. Dynamic pass managers are simpler but should already make life easier, so it may be a compromise solution.
The passes which change algorithm but preserve IR should be the playground of autotuning. I don't have experience here, need to better learn this area. Maybe one should think of something like first-class passes, if it is ever possible. Generally, I believe that one should avoid playing with optimizations which can't be represented with a language because of troubles with formal verification in future. Thank you.
@jroesch Thank you for summary. Adding layout attribute and a layout pass should be able to deal with current use cases.
@jroesch As to the ONNX limitations, can you share more background or details why it cannot satisfy the requirements of NNVM/TVM? I'm really interested on this. Thanks in advance!
@jroesch @MarisaKirisame What will the interface for defining gradients for operations look like? Currently in nnvm there is the FGradient attribute which expresses the gradient in terms of other nnvm operations. Will it be something like this for Relay too, or may be the gradients will be expressed directly in terms of topi and tvm operations?
Currently we are working on AD on the TVM level, so we are thinking about the possibility of integrating it into Relay. In our solution we calculate gradients for tensors produced by tvm.compute primitive. Are you going to support elementwise construction of tensors like this in Relay? And if so, are you going to implement automatic differentiation for tensors defined this way?
@sgrechanik-h
Somewhat like FGradient, Relay gradient will be expressed as other Relay gradient for closure property (we can differentiate on differentiated expression for arbitrary times).
Suppose an operator is of type x -> y, we are planning to expose an api that let you define it's gradient as an expression (probably) of type (x -> (y, y -> x)). It might be changed to some other similar type for efficiency/composability reason, but there is no plan to release an api base on compute, as we have no idea of it's consequence on performance.
If you can show good performance number we are very happy to change our mind, as it is certainly more principled to define gradient once and for all!
@MarisaKirisame Thanks for clarification. It seems like there will always be operations for which automatic differentiation does poor job, and gradients should be defined manually for them, so I think this is the right approach anyway.
@jroesch could you please clarify the place of scheduling operations in Relay? Do we plan to allow users to specify schedulers manually or we mean automatic scheduling during some phases of Relay->TVM transformation?
I'm asking this because I am working on performance measurements for a large model, exported from TensorFlow to NNVM. I am facing the case where automatic schedulers were able to accelerate some parts of the model (~2x faster than TF on CPU), but for other parts they showed poor results. In order to figure out the cause I had to "reverse engineer" the NNVM graph back to sources. (Here is a draft code for GraphDef staging). I am not finished yet but it seems I will need to schedule some parts manually and AFAIK it is not that simple in NNVM. What about Relay? What would be the workflow for cases where manual scheduling is required?
Thanks for very helpful discussions, relay IR has been mainlined. Let us open new issues for discussing specific aspect of the development
@jroesch As to the ONNX limitations, can you share more background or details why it cannot satisfy the requirements of NNVM/TVM? I'm really interested on this. Thanks in advance!
Hello Jammy,
Have you figured out why onnx cannot satisfy the requirement of TVM? I also interested on this.
Thanks a lot!
@calvin886 ONNX was very different then it is today back in 2017 when we began work on Relay. Furthermore ONNX is designed as an interchange format and has very little of the utilities that come with compiler IRs, these days Relay has many more extensions then ONNX including closures, data structures, traditional control flow (not pseudo ops), JSON serialization, a textual format, a human readable and parseable interchange, deep TVM support, strong Python support and frameworks for doing program analysis and pattern matching. We also have made several different design decisions such as unifying functions and graphs, unifying parameters and inputs, and so on.
You can read the various Relay papers to learn more about the specific advantages.
Most helpful comment
Thanks @jroesch for the proposal, I am going to elaborate some of my take on this proposal. Note that no design is perfect and that is why we need help of everyone to work together the evolve the IR.
Specific Technical Points
Some Possible Point to Discuss
These are things that pops up from my head, feel free to add more.