Hi,
In caffe/src/caffe/layers/cudnn_conv_layer.cu, line 89, you have the following code to synchronize the streams running cuDNN computations:
// Synchronize the work across groups, each of which went into its own
// stream, by launching an empty kernel into the default (null) stream.
// NOLINT_NEXT_LINE(whitespace/operators)
sync_conv_groups<<<1, 1>>>();
If I understand correctly, this code tries to synchronize the streams used for cuDNN computations. However, I think it depends on the assumption that the default stream will force an implicit synchronization on the whole device. However, this is not true since CUDA 7, where the default stream is just a normal stream. Am I understanding this correctly?
Thanks,
Cui
Hi Cui,
In CUDA 7.0, streams still synchronize implicitly with the default stream. The behavior hasn't changed since the first release of CUDA.
However, you are right that there's a new option to make a "per-thread" default stream. In that case, the streams don't synchronize anymore with the NULL stream (but do with the legacy default stream, called cudaStreamLegacy). That special case is enabled using a compiler option (or by setting a preprocessor constant before including the CUDA headers).
For more details, see http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/.
In the code above, I would recommend to issue the kernel in the cudaStreamLegacy to be safe or to simply call cudaDeviceSynchronize. Note also that if streams are created with the cudaStreamNonBlocking flag, there's an issue here.
Thanks,
Julien
But why not just explicitly synchronizing those streams using
cudaStreamSynchronize()?
Also, still in this source file, shall we first synchronize with the
"default" stream (well, I mean the per-thread stream comes by default
instead of the legacy default stream) before running cuDNN computations in
new streams?
Thanks,
Cui
On Jul 21, 2015 4:43 PM, "Julien Demouth" [email protected] wrote:
Hi Cui,
In CUDA 7.0, streams still synchronize implicitly with the default stream.
The behavior hasn't changed since CUDA 6.5.However, you are right that there's a new option to make a "per-thread"
default stream. In that case, the streams don't synchronize anymore with
the NULL stream (but do with the legacy default stream, called
cudaStreamLegacy). That special case is enabled using a compiler option (or
by setting a preprocessor constant before including the CUDA headers).For more details, see
http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
.In the code above, I would recommend to issue the kernel in the
cudaStreamLegacy to be safe or to simply call cudaDeviceSynchronize. Note
also that if streams are created with the cudaStreamNonBlocking flag,
there's an issue here.Thanks,
Julien—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/2798#issuecomment-123471675.
You mean by calling cudaStreamSynchronize(cudaStreamLegacy)?
My opinion is that we should not enable the per-thread default stream behaviour. I honestly see no reason to do it in that context. We might want to use the big hammer and call cudaDeviceSynchronize.
I actually mean they can just do:
for (int g = 0; g < this->group_ * CUDNN_STREAMS_PER_GROUP; g++)
cudaStreamSynchronize(stream_[g]);
}
to explicitly synchronize the streams used for cuDNN computation. Are there
any particular reasons for not doing that?
Thanks,
Cui
On Wed, Jul 22, 2015 at 10:18 AM, Julien Demouth [email protected]
wrote:
You mean by calling cudaStreamSynchronize(cudaStreamLegacy)?
My opinion is that we should not enable the per-thread default stream
behaviour. I honestly see no reason to do it in that context. We might want
to use the big hammer and call cudaDeviceSynchronize.—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/2798#issuecomment-123737333.
See also #2077.
I've also been puzzled by this line and think it would be better to explicitly cudaStreamSynchronize the streams we actually use, and I have done so in other contexts. You're welcome to PR that change.
I'm updating the title since this is not a bug.
The advantage of having a kernel call sync<<<1, 1>>>() is that it synchronizes all the streams without being CPU synchronous. The cudaStreamSynchronize loop you propose would be. I would block the CPU until all the CUDA kernels are done. It may cause launch latency performance issues.
I see. That's a good point. Thank you!
Cui
On Thu, Jul 23, 2015 at 3:37 AM, Julien Demouth [email protected]
wrote:
The advantage of having a kernel call sync<<<1, 1>>>() is that it
synchronizes all the streams without being CPU synchronous. The
cudaStreamSynchronize loop you propose would be. I would block the CPU
until all the CUDA kernels are done. It may cause launch latency
performance issues.—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/2798#issuecomment-124006178.
Thanks for the explanation @jdemouth. We may have to revisit this in the future when parallelizing across branches, but I'll consider this fine as is for now.
Most helpful comment
The advantage of having a kernel call sync<<<1, 1>>>() is that it synchronizes all the streams without being CPU synchronous. The cudaStreamSynchronize loop you propose would be. I would block the CPU until all the CUDA kernels are done. It may cause launch latency performance issues.