Onnxruntime: Wrong requested shape after a few thousand inference steps when using CUDA

Created on 17 Aug 2020  路  15Comments  路  Source: microsoft/onnxruntime

Describe the bug
After many (typically many thousand) successful inference steps with the same data, the ONNX Runtime with CUDA suddenly stops with an error. The following error message suggests that a value set by an initializer has changed its value which yields an invalid requested shape in a Reshape node:

[E:onnxruntime:, sequential_executor.cc:309 Execute] Non-zero status code returned while running Reshape node. Name:'Reshape_15' Status Message: /code/onnxruntime/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:43 onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape&, std::vector<long int>&) gsl::narrow_cast<int64_t>(input_shape.Size()) == size was false. The input tensor cannot be reshaped to the requested shape. Input shape:{4}, requested shape:{4,2,2}

This problem does not occur with the CPU or TensorRT version of ONNX Runtime.

This is what the graph of the ONNX model looks like (I suspect %19 is the issue here):

graph torch-jit-export (
  %shape[INT64, 2]
) initializers (
  %19[INT64, 1]
) {
  %1 = Constant[value = <Scalar Tensor []>]()
  %2 = Gather[axis = 0](%shape, %1)
  %3 = Constant[value = <Scalar Tensor []>]()
  %4 = Gather[axis = 0](%shape, %3)
  %5 = Mul(%2, %4)
  %6 = Unsqueeze[axes = [0]](%5)
  %7 = Concat[axis = 0](%6)
  %8 = ConstantOfShape[value = <Tensor>](%7)
  %9 = Constant[value = <Scalar Tensor []>]()
  %10 = Gather[axis = 0](%shape, %9)
  %11 = Constant[value = <Scalar Tensor []>]()
  %12 = Gather[axis = 0](%shape, %11)
  %15 = Unsqueeze[axes = [0]](%10)
  %16 = Unsqueeze[axes = [0]](%12)
  %17 = Concat[axis = 0](%19, %15, %16)
  %output = Reshape(%8, %17)
  return %output
}

The input is always a one-dimensional array with value [2,2].

The ONNX model was created from the following PyTorch code (using PyTorch 1.6.0):

    def forward(self, shape):
        r = torch.zeros(shape[0] * shape[1])
        return r.view(1, shape[0], shape[1])

System information

  • OS Platform and Distribution: Container created from Dockerfile.cuda (problem also occurs in containers based on Ubuntu 18.04):
  • ONNX Runtime installed from (source or binary): from source (GitHub master branch)
  • ONNX Runtime version: 1.4.0
  • Python version: 3.7.0
  • GCC/Compiler version: 7.4.0
  • CUDA/cuDNN version: CUDA 10.1 / cuDNN 7
  • GPU model and memory: Nvidia GeForce GTX 1060 (problem also occurs on 1080Ti with 11GB)

To Reproduce
Run the following code in an environment (or container) with the ONNX Runtime with CUDA (not TensorRT) from a directory that contains the file issue.onnx that is attached to this issue:

#!/usr/bin/env python3
"""Minimal example to reproduce issue with with ONNX Runtime on GPU"""

import numpy as np
import onnxruntime


def create_inputs():
    """create input for the model as a numpy array"""
    return np.array([2, 2])


def run_onnx(file_name):
    """run ONNX model until it fails (when run on GPU)

    on my computer this tends to fail withing the first 100000 iterations
    """
    options = onnxruntime.SessionOptions()
    session = onnxruntime.InferenceSession(file_name, options)

    shape = create_inputs()
    for iteration in range(int(1e6)):
        try:
            result = session.run(output_names=['output'],
                                 input_feed={
                                     'shape': shape,
                                 })
        except Exception as e:
            print(f"\nerror occured during iteration {iteration}")
            break
    return result


def main():
    """try to run inference"""
    filename = 'issue.onnx'
    result = run_onnx(filename)
    print(f"result: {result[0].shape}")


if __name__ == '__main__':
    main()

After a few seconds and a few thousand (sometimes tens of thousands of) iterations the loop in run_onnx() should abort with the error message above. While I have managed to reproduce this issue on multiple systems in various configurations, the number of iterations before the error occurs has varied dramatically. Any sort of additional GPU load (from watching Youtube videos to multiplying large matrices) seems to make the issue more reproducible.

Expected behavior
Inference should work deterministically.

Additional context
The ONNX/PyTorch model provided here may not be very useful in itself. However, the same issue also appears in more complex models that contain similar steps. Larger models seem to require fewer iterations until the error occurs.

CUDA

Most helpful comment

Sorry for being later in the discussion, and thanks @maherzog for this repro. As @HectorSVC pointed out, the bug is indeed caused by memory reuse. The issue is that copy and compute stream in CUDA has a racing condition in BFC arena. BFC arena is an arena allocator on top of cudaMalloc/Free to reduce the cost in syncing CPU and GPU when alloc/free.

To make CPU and GPU running asynchronously, buffers freed on CPU could still be in use on GPU. This is OK if there's only one stream, where the execution order in CPU and GPU are consistent. For example, if we have two kernels A and B, when CPU runs with order of allocA->computeA->freeA->allocB->computeB->freeB, even when A and B shares the same memory, computeA and computeB will not have racing in the same GPU compute stream. However, if it is allocA->CopyA->freeA->allocB->computeB->freeB in CPU, the order of execution in GPU could have copyA happen after computeB, when copy and compute happens in different GPU streams.

For this particular case, the execution plan in CPU is:

Allocation Plan:
(ort_value_idx) output_name : <allocation plan>
(18) 17 : Allocate, OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]], use fence when async
(17) 17_CUDAExecutionProvider : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]], use fence when async
(4) 4 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(12) 11 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(19) output : AllocateOutput, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(3) 3 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(2) 2 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(10) 9 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(0) shape : PreExisting, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(5) 5 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(9) 8 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(6) 6 : Reuse 5, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(1) 1 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(13) 12 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(7) 7_CUDAExecutionProvider : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]], use fence when async
(8) 7 : Allocate, OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]], use fence when async
(11) 10 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(16) 19 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(14) 15 : Reuse 11, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(15) 16 : Reuse 13, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]

Execution Plan:
[0] Gather (Gather_11)
[1] Unsqueeze (Unsqueeze_13)
[2] Gather (Gather_9)
[3] Unsqueeze (Unsqueeze_12)
[4] Concat (Concat_14)
Free ml-values: (11) 10, (13) 12
[5] MemcpyToHost (Memcpy)
Free ml-values: (17) 17_CUDAExecutionProvider
[6] Gather (Gather_3)
[7] Gather (Gather_1)
[8] Mul (Mul_4)
Free ml-values: (2) 2, (4) 4
[9] Unsqueeze (Unsqueeze_5)
[10] Concat (Concat_6)
Free ml-values: (5) 5
[11] MemcpyToHost (Memcpy_token_0)
Free ml-values: (7) 7_CUDAExecutionProvider
[12] ConstantOfShape (ConstantOfShape_7)
Free ml-values: (8) 7
[13] Reshape (Reshape_15)
Free ml-values: (9) 8, (18) 17

Here in step [5], the input buffer to MemcpyToHost is freed to BFC arena, and then allocated by static allocation plan to the output of step [8]. Because compute stream and copy stream runs concurrently on GPU, the GPU execution order may not match CPU's plan, and thus causing memory to be overwritten.

In this repro, if we add a line of code to disable memory pattern, the problem is gone:

    options = onnxruntime.SessionOptions()
    options.enable_mem_pattern = False
    session = onnxruntime.InferenceSession(file_name, options)

However, disabling memory pattern is not a full solution to the racing of streams in BFC arena. As a short term fix, we might force the copy stream to be the same as compute stream, by changing here to:

  streams_[kCudaStreamCopyIn] = nullptr;
  streams_[kCudaStreamCopyOut] = nullptr;

And also remove the cudaStreamDestory in the dtor. This approach might cause some performance degradation for certain models though. A thorough fix to BFC arena to support multiple stream is being looked at, and once that is in, we can continue to have the concurrent copy and compute streams.

All 15 comments

I am experiencing the same issue. I have had to switch to CPU because the GPU provider is unreliable.

Thank you. I'll take a look.

Thank you. I can repro the bug.

Is there any update on this issue? Thanks

I'm also experiencing a similar issue, where my network running on GPU is unreliable usually after around a 1000 or so inferences, but on CPU it works without issue. I can provide more information (graph/env) if it'll help.

Working on this.

Hi @HectorSVC, I think the problem is in your concat CUDA implementation. In this case, one of the concat op's input is in GPU. I think you didn't handle such cases.

There's nothing wrong with CUDA Concat implementation, should be relate to memory re-use. Still debugging.

Could it be somehow related to this: https://github.com/microsoft/onnxruntime/pull/5245 (or something similar) ?

Could it be somehow related to this: #5245 (or something similar) ?

So can we assume the issue has been fixed?

So can we assume the issue has been fixed?

I have built the current master branch (on commit fec890a09aa58cc7d7260ee4fb9ed8a9eb24579a, which is a few commits after the commit 14786432157806392c75e9564d18c448d2cf6092 from the pull request referenced above) and tested it again. Unfortunately, it seems like these changes have not solved this issue. I still get the same error message.

PS: Thank you to the people that have already looked into this issue. I can imagine that trying to understand non-deterministic behavior on the GPU can be a bit tedious.

Unfortunately, it seems like these changes have not solved this issue. I still get the same error message.

Thanks for the info :thumbsup:

Is there an estimate of when the issue could be fixed? The CPU backend works great, but it is not suitable for processing a video stream.

Sorry for being later in the discussion, and thanks @maherzog for this repro. As @HectorSVC pointed out, the bug is indeed caused by memory reuse. The issue is that copy and compute stream in CUDA has a racing condition in BFC arena. BFC arena is an arena allocator on top of cudaMalloc/Free to reduce the cost in syncing CPU and GPU when alloc/free.

To make CPU and GPU running asynchronously, buffers freed on CPU could still be in use on GPU. This is OK if there's only one stream, where the execution order in CPU and GPU are consistent. For example, if we have two kernels A and B, when CPU runs with order of allocA->computeA->freeA->allocB->computeB->freeB, even when A and B shares the same memory, computeA and computeB will not have racing in the same GPU compute stream. However, if it is allocA->CopyA->freeA->allocB->computeB->freeB in CPU, the order of execution in GPU could have copyA happen after computeB, when copy and compute happens in different GPU streams.

For this particular case, the execution plan in CPU is:

Allocation Plan:
(ort_value_idx) output_name : <allocation plan>
(18) 17 : Allocate, OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]], use fence when async
(17) 17_CUDAExecutionProvider : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]], use fence when async
(4) 4 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(12) 11 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(19) output : AllocateOutput, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(3) 3 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(2) 2 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(10) 9 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(0) shape : PreExisting, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(5) 5 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(9) 8 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(6) 6 : Reuse 5, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(1) 1 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(13) 12 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(7) 7_CUDAExecutionProvider : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]], use fence when async
(8) 7 : Allocate, OrtMemoryInfo:[name:CudaPinned id:0 OrtMemType:-1 OrtAllocatorType:1 Device:[DeviceType:0 MemoryType:1 DeviceId:0]], use fence when async
(11) 10 : Allocate, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(16) 19 : AllocateStatically, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(14) 15 : Reuse 11, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
(15) 16 : Reuse 13, OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]

Execution Plan:
[0] Gather (Gather_11)
[1] Unsqueeze (Unsqueeze_13)
[2] Gather (Gather_9)
[3] Unsqueeze (Unsqueeze_12)
[4] Concat (Concat_14)
Free ml-values: (11) 10, (13) 12
[5] MemcpyToHost (Memcpy)
Free ml-values: (17) 17_CUDAExecutionProvider
[6] Gather (Gather_3)
[7] Gather (Gather_1)
[8] Mul (Mul_4)
Free ml-values: (2) 2, (4) 4
[9] Unsqueeze (Unsqueeze_5)
[10] Concat (Concat_6)
Free ml-values: (5) 5
[11] MemcpyToHost (Memcpy_token_0)
Free ml-values: (7) 7_CUDAExecutionProvider
[12] ConstantOfShape (ConstantOfShape_7)
Free ml-values: (8) 7
[13] Reshape (Reshape_15)
Free ml-values: (9) 8, (18) 17

Here in step [5], the input buffer to MemcpyToHost is freed to BFC arena, and then allocated by static allocation plan to the output of step [8]. Because compute stream and copy stream runs concurrently on GPU, the GPU execution order may not match CPU's plan, and thus causing memory to be overwritten.

In this repro, if we add a line of code to disable memory pattern, the problem is gone:

    options = onnxruntime.SessionOptions()
    options.enable_mem_pattern = False
    session = onnxruntime.InferenceSession(file_name, options)

However, disabling memory pattern is not a full solution to the racing of streams in BFC arena. As a short term fix, we might force the copy stream to be the same as compute stream, by changing here to:

  streams_[kCudaStreamCopyIn] = nullptr;
  streams_[kCudaStreamCopyOut] = nullptr;

And also remove the cudaStreamDestory in the dtor. This approach might cause some performance degradation for certain models though. A thorough fix to BFC arena to support multiple stream is being looked at, and once that is in, we can continue to have the concurrent copy and compute streams.

Thank you for your extensive analysis and explanation (and obviously for the actual fix). And sorry for the integer type issue, I had not tested my code on Windows.

Was this page helpful?
0 / 5 - 0 ratings