Dali: [Question on Design of DALI] Meaning of Tensor/TensorList and Different types of Workspace

Created on 6 Jul 2018 · 15Comments · Source: NVIDIA/DALI

Hello!

I've noticed the comment in the headers of Workspace ( SampleWorkspace, DeviceWorkspace etc), and corresponding OpType (CPU, GPU, etc).

My question is:
a) Is TensorList certainly represent a "Batch" while Tensor represent a "Sample" ?
b) Is it on purposed that Op running on CPU will process ONLY one Sample at a "Run", while Op running on GPU will process a Batch at a "Run" ?
But what about "Mixed" Op
c) It is true for the pipeline that:
*at the beginning of pipeline only one Sample is processed at a time;
*at some point of pipeline Samples are assembled into Batch (So it is the meaning of Mixed Op);
*after assembled, Batch can no longer be split into isolated Samples
???
d) Finally could you please provide brief architecture of DALI, and explain some important Class in more detail ?

question

Source

gzygzy9211

Most helpful comment

To add to Simon's answer:

Tensor does not necessarily have to be a single sample - for example Support ops (which currently are only used for random number generation) output a Tensor with batch_size number of elements. The more fundamental difference between Tensor and TensorList is that Tensor has a single well defined shape, whereas TensorList is a collection of Tensors (contiguous in memory), where each can have its own shape (so TensorList is a "jagged" tensor).

On the general architecture of DALI: we gave a talk during this year's GTC about DALI, and it contains answers to at least some of those questions: recording, slides - this reminds me that we should add those links to the README.

Also on general documentation - I'm currently working on enabling a better API docs and building documentation with Sphinx. At first it will be mostly about the Python API, but it is a step towards more developer-focused documentation as well.

ptrendx on 6 Jul 2018

👍2

All 15 comments

a) TensorList is a collection of samples stored contiguously in memory. Tensor is a single sample.

b) A CPU op only processes one sample at a time, but we run multiple CPU ops in parallel using a thread pool. Mixed ops are designed for the case where the input is single samples on the CPU, but the output is batched, either on the CPU or (more generally) GPU. The nvJPEG decoder matches this pattern, as does MakeContiguous, which takes in multiple Tensor and converts them into a contiguous TensorList on either CPU or GPU.

c) You're correct on all 3 points. The beginning of the pipeline is always on CPU (for now at least), and samples are process individually using a thread pool for parallelisation. Then the batch is assembled (either by MakeContiguous or the nvJPEG decoder for now) and generally transferred to the GPU, where it's run in batches afterwards. After this point the data isn't split into individual samples again. You can process samples individually in GPU ops if you wanted, it's just likely to be less efficient.

For a 10k ft description of the architecture:

A Pipeline has some number of Operator added to its graph. These operators can work on different combinations of devices (CPU, GPU) and different layouts (single sample, batched). We've explicitly restricted the dataflow to be CPU -> GPU (so you can't do work on the GPU, pass it back to the host and do more work there) -- doing this makes the execution of the graph (more on this in a minute) much simpler, and moving from CPU <-> GPU is slow, so we don't want it to be done more than absolutely necessary.

Pipelines are then run by an Executor. This handles data moving through the graph and the execution of all operators in graph. There are several Executors with slightly different behaviours, but in general all CPU operators are run in a thread pool, then all GPU operations are run on a cuda stream

slayton58 on 6 Jul 2018

👍2

To add to Simon's answer:

Tensor does not necessarily have to be a single sample - for example Support ops (which currently are only used for random number generation) output a Tensor with batch_size number of elements. The more fundamental difference between Tensor and TensorList is that Tensor has a single well defined shape, whereas TensorList is a collection of Tensors (contiguous in memory), where each can have its own shape (so TensorList is a "jagged" tensor).

ptrendx on 6 Jul 2018

👍2

Thanks for answers from @ptrendx and @slayton58.

Maybe I should close this issue for now and reopen it if I have any further questions?

gzygzy9211 on 6 Jul 2018

Or we could leave this open until we update the README with the links @ptrendx mentioned at least?

cliffwoolley on 6 Jul 2018

So I will leave this issue open until then.
Feel free to discuss here @ everyone

gzygzy9211 on 6 Jul 2018

Another question:
How to understand AllowMultipleInputSets in the definition of an Op?

Is it true that AllowMultipleInputSets means the Op can only take ONE input and produce ONE output? Otherwise the Op can take multiple inputs and produce multiple outputs

gzygzy9211 on 7 Jul 2018

AllowMultipleInputSets allows identical transforms to applied to multiply inputs.

As an example say I have an image and its segmentation mask. We may want to randomly crop the image, and an identical crop needs to be applied to the mask. This means that the random number(s) generated for the crop must be the same for each (image_i, mask_i) pair in the batch (but different for each (image_j, mask_j) pair. AllowMultipleInputSets() in an operator definition means that it supports this kind of operation.

slayton58 on 7 Jul 2018

For Op with AllowMultipleInputSets, it actually implement an operation

y=f(x)

where x and y are tensors. That what I really mean about single input and output. And we can provide multiple inputs to the Op with

y1, y2, y3 = Op(x1, x2, x3)

and what it really does is sequencially executing y1=f(x1); y2=f(x2); y3=f(x3). Here x1, x2, x3 may be the tuple of (image_j, mask_1j, mask_2j) following your example @slayton58
However, some operations like

x, y, z=g(a, b, c)

in which x is depend on all inputs a,b,c and so are y, z, will never support AllowMultipleInputSets if it is implemented as an Op in DALI.

Get it right?

gzygzy9211 on 7 Jul 2018

About Op definition.
Actually "Input" and "Argument" can both be treat as inputs to the Op.
The difference is that "Argument" can be provided when the Op is created, or can be provided by Workspace at runtime. In the former case input is constant, while in the latter case input is variable.
On the other hand, "Input" can only be provided by Workspace.
So "Input" tend to provide the data to be process, "Argument" tend to provide configuration about how the data will be processed.
And in python interface, runtime-provided "Input" and "Argument" are distinguished by positional argument and keyword argument at the python call.

Get it right?

gzygzy9211 on 7 Jul 2018

About arguments and inputs - basically yes.
About multiple input sets - not quite, you can actually have multiple inputs and outputs function that allows multiple input sets. In Python API a y = f(x) would look like this:

y1, y2, y3 = f([x1, x2, x3])

(notice the list of arguments to f)
For a function that takes multiple inputs you would do something like this:

y1, y2, y3 = f([x1,x2,x3], [z1,z2,z3])

which would translate to

y1 = f(x1, z1)
y2 = f(x2, z2)
y3 = f(x3, z3)

For functions with higher number of outputs your OpSchema would need to either specify NumOutputs or the OutputFn function that at runtime during building of the pipeline (based on the arguments values for example) could tell how many outputs each input set will produce.

ptrendx on 9 Jul 2018

Then in case

y1, y2, y3 = f([x1, x2, x3], [z1, z2, z3])

do ws->Input(0) reference to [x1, x2, x3] here and ws->Input(1) reference to [z1, z2, z3]? But how do I extract x1 from it since TensorList only has indexing for "Batch", ws->Input accepts index for inputs positions and no space left for input sets indexing.
I do notice that DisplacementFilter use ws->Input to index input sets, so I am confused about that.

gzygzy9211 on 9 Jul 2018

Inputs are interleaved: ws->Input(0) references x1, ws->Input(1) references z1, ws->Input(2) references x2 and so on.
RunImpl method gets the input set as parameter.

ptrendx on 9 Jul 2018

@ptrendx :

On the general architecture of DALI: we gave a talk during this year's GTC about DALI, and it contains answers to at least some of those questions: recording, slides - this reminds me that we should add those links to the README.

Are we still planning to do this?

cliffwoolley on 19 Jul 2018

@cliffwoolley - yes we plan as soon as we make DALI more future complete.