Caffe: Explicitly synchronize streams in cuDNN code

Created on 21 Jul 2015 · 8Comments · Source: BVLC/caffe

Hi,

In caffe/src/caffe/layers/cudnn_conv_layer.cu, line 89, you have the following code to synchronize the streams running cuDNN computations:

// Synchronize the work across groups, each of which went into its own
// stream, by launching an empty kernel into the default (null) stream.
// NOLINT_NEXT_LINE(whitespace/operators)
sync_conv_groups<<<1, 1>>>();

If I understand correctly, this code tries to synchronize the streams used for cuDNN computations. However, I think it depends on the assumption that the default stream will force an implicit synchronization on the whole device. However, this is not true since CUDA 7, where the default stream is just a normal stream. Am I understanding this correctly?

Thanks,
Cui

style

Source

cuihenggang

Most helpful comment

The advantage of having a kernel call sync<<<1, 1>>>() is that it synchronizes all the streams without being CPU synchronous. The cudaStreamSynchronize loop you propose would be. I would block the CPU until all the CUDA kernels are done. It may cause launch latency performance issues.

jdemouth on 23 Jul 2015

👍3

All 8 comments

Hi Cui,

In CUDA 7.0, streams still synchronize implicitly with the default stream. The behavior hasn't changed since the first release of CUDA.

However, you are right that there's a new option to make a "per-thread" default stream. In that case, the streams don't synchronize anymore with the NULL stream (but do with the legacy default stream, called cudaStreamLegacy). That special case is enabled using a compiler option (or by setting a preprocessor constant before including the CUDA headers).

For more details, see http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/.

In the code above, I would recommend to issue the kernel in the cudaStreamLegacy to be safe or to simply call cudaDeviceSynchronize. Note also that if streams are created with the cudaStreamNonBlocking flag, there's an issue here.

Thanks,
Julien

jdemouth on 21 Jul 2015

But why not just explicitly synchronizing those streams using
cudaStreamSynchronize()?

Also, still in this source file, shall we first synchronize with the
"default" stream (well, I mean the per-thread stream comes by default
instead of the legacy default stream) before running cuDNN computations in
new streams?

Thanks,
Cui
On Jul 21, 2015 4:43 PM, "Julien Demouth" [email protected] wrote:

Hi Cui,

In CUDA 7.0, streams still synchronize implicitly with the default stream.
The behavior hasn't changed since CUDA 6.5.

However, you are right that there's a new option to make a "per-thread"
default stream. In that case, the streams don't synchronize anymore with
the NULL stream (but do with the legacy default stream, called
cudaStreamLegacy). That special case is enabled using a compiler option (or
by setting a preprocessor constant before including the CUDA headers).

For more details, see
http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
.

In the code above, I would recommend to issue the kernel in the
cudaStreamLegacy to be safe or to simply call cudaDeviceSynchronize. Note
also that if streams are created with the cudaStreamNonBlocking flag,
there's an issue here.

Thanks,
Julien

—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/2798#issuecomment-123471675.

cuihenggang on 21 Jul 2015

You mean by calling cudaStreamSynchronize(cudaStreamLegacy)?

My opinion is that we should not enable the per-thread default stream behaviour. I honestly see no reason to do it in that context. We might want to use the big hammer and call cudaDeviceSynchronize.

jdemouth on 22 Jul 2015

I actually mean they can just do:

for (int g = 0; g < this->group_ * CUDNN_STREAMS_PER_GROUP; g++)
cudaStreamSynchronize(stream_[g]);
}

to explicitly synchronize the streams used for cuDNN computation. Are there
any particular reasons for not doing that?

Thanks,
Cui

On Wed, Jul 22, 2015 at 10:18 AM, Julien Demouth [email protected]
wrote:

You mean by calling cudaStreamSynchronize(cudaStreamLegacy)?

My opinion is that we should not enable the per-thread default stream
behaviour. I honestly see no reason to do it in that context. We might want
to use the big hammer and call cudaDeviceSynchronize.

—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/2798#issuecomment-123737333.

cuihenggang on 22 Jul 2015