This was mentioned in https://github.com/dmlc/tvm/issues/2122#issuecomment-439581161; I also asked about it here a while back.
Basically I'd like to implement a compiler pass that replaces concat operations with simple slicing operations, where:
out[0:1] = concat(f1(), f2())
is replaced with
out[0] = f1()
out[1] = f2()
This can provide large improvements in some cases, e.g. DenseNet, which goes from having O(n^2) memory copies to having O(n) memory copies. It also reduces memory requirements in low-memory deployment scenarios. It can be implemented by partially specializing compute kernels over data size.
I'd like to implement this pass (partly just to get more used to working inside TVM), but I'm not exactly sure where in the stack this should go. It's somewhat backend-dependent, and AFAICT Relay can't really represent slicing in this way (although I might be wrong). Maybe a pass can be added that annotates data-movement operations, and backends can consume those annotations?
A place to start might be just implementing this in the graph runtime. If someone could provide some guidance that would be helpful ( @tqchen ?)
@kazimuth Probably you could write a Relay pass. Should be easy if it is intraprocedural.
AFAICT Relay can't really represent slicing in this way (although I might be wrong).
It's true. But I guess the more principled approach would be "let relay support such thing"...
Does it help to add a pass to analyze “buffers that should be allocated continuously”, and then hint the runtime to do this?
Does it help to add a pass to analyze “buffers that should be allocated continuously”, and then hint the runtime to do this?
I think this makes sense. How do I hint the runtime though? Should I just add a new Relay node / a field to one of the node types, or is there a more principled way to annotate a Relay AST?
Should be easy if it is intraprocedural.
It will definitely be to start. It's the sort of analysis that's easy to start small with and expand later.
We will likely need changes in both compiler and runtime.
What possibly need to happen is to have a special annotation in the relay IR and do special codegen in the runtime. It also requires us to enhance the runtime semantics, to allow specification of slice.
@tqchen Didn't get it why we need runtime codegen. In the case of intraprocedural opt, it would be fine to just hint the runtime to alloc contiguous memory and then skip concat operator
What I mean is that runtime need to be aware of the memory layout and provide out[slice] = f(inputs). Another possible "obstacle" is that TVM's compute kernel requires the buffer to be somewhat aligned, and we need to generate a special kernel for out[slice] = f(inputs), with a known offset(so we still benefit from good alignment). This is necessary for OpenCL
Shouldn't title be "Slice on LHS"?
Also @tqchen what's a good way to add that annotation?
I have not yet put very deep thoughts how can we generate annotation, this is something that worth some thoughts and discussion
Once you have liveness analysis, you could detect when some vars die on an instruction, while another var is created. Let call it substitution. To optimize concat, we could detect substitution on concat instruction, and extend the liveness of vars who previously die on this instruction.
Most helpful comment
What I mean is that runtime need to be aware of the memory layout and provide out[slice] = f(inputs). Another possible "obstacle" is that TVM's compute kernel requires the buffer to be somewhat aligned, and we need to generate a special kernel for
out[slice] = f(inputs), with a known offset(so we still benefit from good alignment). This is necessary for OpenCL