Dali: very slow even with prefetching

Created on 8 Jun 2019 · 4Comments · Source: NVIDIA/DALI

I was testing NVIDIA with a simple linear regression, and I noticed that it's still very slow even prefetching.

On a K80 GPU the utilization is around 1%.

Here's how I build my numpy pipeline and run it, am I doing something incorrectly?

class TestNvidiaDaliPipeline(Pipeline):

    def __init__(self, x, y, lambda_batch_index, kwargs_dali):

        super().__init__(**kwargs_dali)



        self.x = x.astype(np.float32)

        self.y = y.astype(np.float32)



        self.lambda_batch_index = lambda_batch_index



        self.input_x = dali.ops.ExternalSource()  # device="gpu")

        self.input_y = dali.ops.ExternalSource()  # device="gpu")



    def iter_setup(self):

        batch = self.lambda_batch_index()

        x_batch = self.x[batch, :]

        y_batch = self.y[batch, :]

        self.feed_input(self.x_batch, x_batch)

        self.feed_input(self.y_batch, y_batch)



    def define_graph(self):

        self.x_batch = self.input_x().gpu()

        self.y_batch = self.input_y().gpu()

        test_pipeline = TestNvidiaDaliPipeline(

            x =x, y=y.reshape((-1, 1)),

            lambda_batch_index=lambda: np.random.choice(n, size_batch, replace=False),

            kwargs_dali={

                "batch_size": size_batch,

                "num_threads": 16,

                "device_id": 0,

                "prefetch_queue_depth": 4,  # int(5e2),

                "set_affinity": True,

                "exec_pipelined": True,

                "exec_async": True,

            }

        )





        test_dali_iter = DALIGenericIterator(test_pipeline, ["x_batch", "y_batch"], n_iter)



        for i_iter, d in enumerate(test_dali_iter):

            x_batch, y_batch = d[0]["x_batch"], d[0]["y_batch"].squeeze()

question

Source

tianyang-li

Most helpful comment

Under the hood, DALI does whatever is possible in an asynchronous way. So when you use DALIGenericIterator it returns you a batch of data and runs computations in the background in an asynchronous way.
ExternalSource is less efficient compared to other readers which don't need to interact with the Python side. When the pipeline is run, first the Python code is executed (iter_setup), the pipeline is run in the background to consume and process data. .cuda(non_blocking=True) transfer data to the GPU, while in case of DALI we are not transferring data to the GPU (at least not in all cases). In your code, it happens just when you issue .gpu(), so data is first read from the Python, copied to CPU internal DALI buffer, and then just transferred to the GPU in an asynchronous way. When you ask DALIGenericIterator to return a batch of data it does synchronization to make sure that your data is already processed and transferred (because DALI uses own streams to overlap with the training we need to make sure that data is really ready). So if you write something like:

(prepare data in the same way as in the iter_setup)
batch = self.lambda_batch_index()
x_batch = self.x[batch, :]
y_batch = self.y[batch, :]
(copy to the GPU)
  .cuda(non_blocking=True)
(do stuff)
(sync to use data)

Then it would be the same as in DALI

JanuszL on 11 Jun 2019

👍2

All 4 comments

Hi,
The main idea behind DALI is to offload the processing to GPU so you can get the CPU bottleneck from your training performance away. In the example, you provided, you are using ExternalSource operator which is more flexible than other readers but not the best regarding performance. Also besides reading the input and just returning it, your pipeline does nothing. If you add some augmentations to it you should see better performance comparing to pure CPU processing. Also in your case, the disc IO could be a bottleneck.
What you can do it to check CPU utilization as well to see if the CPU is the bottleneck.

JanuszL on 10 Jun 2019

Thanks!

Does DALI have support for asynchronous copy from host memory to GPU memory? Similar to torch's .cuda(non_blocking=True)?

tianyang-li on 10 Jun 2019

(prepare data in the same way as in the iter_setup)
batch = self.lambda_batch_index()
x_batch = self.x[batch, :]
y_batch = self.y[batch, :]
(copy to the GPU)
  .cuda(non_blocking=True)
(do stuff)
(sync to use data)

Then it would be the same as in DALI

JanuszL on 11 Jun 2019

👍2

Thanks!

tianyang-li on 11 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings