Cntk: Is anyone trying to build CNTK with CUDA 11.1?

Created on 17 Dec 2020  路  17Comments  路  Source: microsoft/CNTK

I know MS announced that they won't support CNTK anymore.
However, I would like to know who is trying to build CNTK with CUDA 11.1 like me.
If someone trying and have some tips for this, I hope we discuss this.

Now I changed

  • cudnnGetConvolutionForwardAlgorithm -> cudnnGetConvolutionForwardAlgorithm_v7
  • cudnnGetConvolutionBackwardDataAlgorithm -> cudnnGetConvolutionBackwardDataAlgorithm_v7
  • cudnnGetConvolutionBackwardFilterAlgorithm -> cudnnGetConvolutionBackwardFilterAlgorithm-v7
  • cudnnSetRNNDescriptor_v5 -> cudnnSetRNNDescriptor_v8
  • cusparseScsr2csc -> cusparseCsr2cscEx2_bufferSize
  • cusparseDcsr2csc -> cusparseCsr2cscEx2
  • .. etc.

Now I build Common to ReaderLib.
When I build CNTKv2LibraryDll, I've got blow errors.

LNK2005: "unsigned __int64 __cdecl Microsoft::MSR::CNTK::GetCUDNNVersion(void)" (?GetCUDNNVersion@CNTK@MSR@Microsoft@@YA_KXZ) already defined in Cntk.Math-2.7.lib(Cntk.Math-2.7.dll)
LNK2005: "protected: float * __cdecl Microsoft::MSR::CNTK::BaseMatrix<float>::Buffer(void)const " (?Buffer@?$BaseMatrix@M@CNTK@MSR@Microsoft@@IEBAPEAMXZ) already defined in Cntk.Math-2.7.lib(Cntk.Math-2.7.dll) 
LNK1169: multiply defined symbols

Note that I already build Common to CNTK with CUDA 10.1.

Most helpful comment

Thanks for sharing the info here @dmagee! Sounds like trying to setup cntk with the latest Cuda is a non trivial task.

No worries. You're absolutely right. the Nvidia cuDnn library it is based on has changed api, so lots of things need updating. I've just fixed the bits that needed for training CNNs with CNTK.exe or the C++ interface. I've only commented out various other bits (to do with RNNs and Sparce matrices), and not touched any python (I don't use the python api). I'm afraid I don'treally have time to package all this up, but hopefully posting what I've done here can help someone who does.

All 17 comments

I and my co-worker are working on this at https://github.com/haryngod/CNTK/tree/2.7-cuda-11.1
It may look like a mess right now, but our goal is to build the code without any errors.
If someone wants to build CNTK, we could share our experiences with each other.

If you manage to successfully build it, i'll definitely be using it! I'm still stuck using GTX 1000 series cards, would love to upgrade. Unfortunately, i have zero experience in compiling cntk so i can't help you in this.

Interesting! Hope you succeed in building CNTK with CUDA 11 and maybe newer Python version too.

LNK2005: "unsigned __int64 __cdecl Microsoft::MSR::CNTK::GetCUDNNVersion(void)" (?GetCUDNNVersion@CNTK@MSR@Microsoft@@YA_KXZ) already defined in Cntk.Math-2.7.lib(Cntk.Math-2.7.dll)
LNK2005: "protected: float * __cdecl Microsoft::MSR::CNTK::BaseMatrix::Buffer(void)const " (?Buffer@?$BaseMatrix@M@CNTK@MSR@Microsoft@@IEBAPEAMXZ) already defined in Cntk.Math-2.7.lib(Cntk.Math-2.7.dll)
LNK1169: multiply defined symbols
```

How is it going? I have met the same error.

@kassinvin I add /FORCE:MULTIPLE in CNTKv2LibraryDLL > preperence > linker > command line. It will be ok.

I'm also trying to get this working. The thing I'm stuck on is GPUTensor.cu. It gives a heap error. If you comment out some of the template instantiations (I tried the

Also, tried the repo linked by @haryngod above, but it seems to be set up to use cuda 10 still. I'm not sure if it's supposed to be working yet?

Thanks!

I'm also trying to get this working. The thing I'm stuck on is GPUTensor.cu. It gives a heap error. If you comment out some of the template instantiations (I tried the

Also, tried the repo linked by @haryngod above, but it seems to be set up to use cuda 10 still. I'm not sure if it's supposed to be working yet?

Thanks!

In answer to my own question I added the /FORCE:MULTIPLE thing (suggested by @haryngod) to the MathsCuda and Maths projects too. I seem to have a working cntk.exe! (The 01_OneHidden.cntk example in the Images\GettingStarted folder seems to run.anyway). I did achieve this by a) Commenting out various cudnn calls in SparseMatrix and RNN classes that I suspected I wasn't using (I only use CNNS) and copying the cublasLt64_11.dll dll over manually from the cuda install. I also updated cubblas calls to _v7 where it was a simple replacement. This may be of help to some people.

The change to Cuda 11.1 was enacted by modifying various lines in CNTK.Cpp.props

D.

Ok, more advice from my experiments. It turned out I was using an older version of cudnn (cudnn-10.0-v7.3.1) which isn't really designed to work with cuda11.X, and I do suspect that while cntk.exe ran, it wasn't learning properly. I've now replaced this with cudnn-11.1-v8.0.5.39 (I needed to change CUDNN_PATH env variable to point to this). This then throws some new errors as the following functions don't exist:

cudnnGetConvolutionForwardAlgorithm
cudnnGetConvolutionBackwardFilterAlgorithm
cudnnGetConvolutionBackwardDataAlgorithm

These are all used in CuDnnConvolutionEngine.cu

I got past this by adding the following near the top of that file (after the includes):

#ifndef CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT
typedef enum
{
    CUDNN_CONVOLUTION_FWD_NO_WORKSPACE = 0,
    CUDNN_CONVOLUTION_FWD_PREFER_FASTEST = 1,
    CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT = 2,
} cudnnConvolutionFwdPreference_t;

cudnnStatus_t CUDNNWINAPI
cudnnGetConvolutionForwardAlgorithm(cudnnHandle_t handle,
                                    const cudnnTensorDescriptor_t xDesc,
                                    const cudnnFilterDescriptor_t wDesc,
                                    const cudnnConvolutionDescriptor_t convDesc,
                                    const cudnnTensorDescriptor_t yDesc,
                                    cudnnConvolutionFwdPreference_t preference,
                                    size_t memoryLimitInBytes,
                                    cudnnConvolutionFwdAlgo_t* algo)
{
    cudnnConvolutionFwdAlgoPerf_t perfResults;
    int returnedAlgoCount;
    cudnnStatus_t rv;

    rv = cudnnGetConvolutionForwardAlgorithm_v7(
        handle,
        xDesc,
        wDesc,
        convDesc,
        yDesc,
        1,
        &returnedAlgoCount,
        &perfResults);

    if (rv != 0)
    {
        std::cerr << "cudnnGetConvolutionForwardAlgorithm_v7 failed: " << rv << std::endl;

    }

    *algo = perfResults.algo;

    std::cerr << "Using ConvolutionForwardAlgorithm: " << perfResults.algo << std::endl;
    ; 

    return rv;

}
#endif

#ifndef CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT
typedef enum
{
    CUDNN_CONVOLUTION_BWD_DATA_NO_WORKSPACE = 0,
    CUDNN_CONVOLUTION_BWD_DATA_PREFER_FASTEST = 1,
    CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT = 2,
} cudnnConvolutionBwdDataPreference_t;

typedef enum
{
    CUDNN_CONVOLUTION_BWD_FILTER_NO_WORKSPACE = 0,
    CUDNN_CONVOLUTION_BWD_FILTER_PREFER_FASTEST = 1,
    CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT = 2,
} cudnnConvolutionBwdFilterPreference_t;

cudnnStatus_t CUDNNWINAPI
cudnnGetConvolutionBackwardFilterAlgorithm(cudnnHandle_t handle,
                                           const cudnnTensorDescriptor_t xDesc,
                                           const cudnnTensorDescriptor_t dyDesc,
                                           const cudnnConvolutionDescriptor_t convDesc,
                                           const cudnnFilterDescriptor_t dwDesc,
                                           cudnnConvolutionBwdFilterPreference_t preference,
                                           size_t memoryLimitInBytes,
                                           cudnnConvolutionBwdFilterAlgo_t* algo)
{
    cudnnConvolutionBwdFilterAlgoPerf_t perfResults;
    int returnedAlgoCount;
    cudnnStatus_t rv;

    rv = cudnnGetConvolutionBackwardFilterAlgorithm_v7(
        handle,
        xDesc,
        dyDesc,
        convDesc,
        dwDesc,
        1,
        &returnedAlgoCount,
        &perfResults);

    *algo = perfResults.algo;

    return rv;

}


cudnnStatus_t CUDNNWINAPI
cudnnGetConvolutionBackwardDataAlgorithm(cudnnHandle_t handle,
    const cudnnFilterDescriptor_t wDesc,
    const cudnnTensorDescriptor_t dyDesc,
    const cudnnConvolutionDescriptor_t convDesc,
    const cudnnTensorDescriptor_t dxDesc,
    cudnnConvolutionBwdDataPreference_t preference,
    size_t memoryLimitInBytes,
    cudnnConvolutionBwdDataAlgo_t* algo)
{
    cudnnConvolutionBwdDataAlgoPerf_t perfResults;
    int returnedAlgoCount;
    cudnnStatus_t rv;

    rv = cudnnGetConvolutionBackwardDataAlgorithm_v7(
        handle,
        wDesc,
        dyDesc,
        convDesc,
        dxDesc,
        1,
        &returnedAlgoCount,
        &perfResults);

    *algo = perfResults.algo;

    return rv;

}
#endif

Sorry for spaming everyone, but now with cudnn-11.1-v8.0.5.39 I'm getting an exception thrown on the cudnnConvolutionForward call in CuDnnConvolutionEngine.cu. The output is:

...
Starting minibatch loop.

About to throw exception 'cuDNN failure 3: CUDNN_STATUS_BAD_PARAM ; GPU=0 ; hostname=LAPTOP-RM6KJERA ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))'
cuDNN failure 3: CUDNN_STATUS_BAD_PARAM ; GPU=0 ; hostname=LAPTOP-RM6KJERA ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))


[CALL STACK]
    > vcomp_reduction_r4
    - Microsoft::MSR::CNTK::CudaTimer::  Stop
    - Microsoft::MSR::CNTK::CuDnnConvolutionEngine<float>::  ForwardCore
    - Microsoft::MSR::CNTK::ConvolutionNode<float>::  ForwardProp
    - Microsoft::MSR::CNTK::ComputationNetwork::PARTraversalFlowControlNode::  ForwardProp
    - std::_Func_impl_no_alloc<<lambda_258018e62e82ba6c7f6055b001fc29b8>,void,std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase> const &>::  _Do_call
    - Microsoft::MSR::CNTK::ComputationNetwork::TravserseInSortedGlobalEvalOrder<std::vector<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>>>>
    - Microsoft::MSR::CNTK::ComputationNetwork::ForwardProp<std::vector<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::CNTK::ComputationNodeBase>>>>
    - Microsoft::MSR::CNTK::SGD<float>::  TrainOneEpoch
    - Microsoft::MSR::CNTK::SGD<float>::  TrainOrAdaptModel
    - Microsoft::MSR::CNTK::SGD<float>::  Train
    - DoTrain<Microsoft::MSR::CNTK::ConfigParameters,float>
    - DispatchThisAction<float>
    - DoCommands<float>
    - wmainOldCNTKConfig
    - wmain1

EXCEPTION occurred: cuDNN failure 3: CUDNN_STATUS_BAD_PARAM ; GPU=0 ; hostname=LAPTOP-RM6KJERA ; expr=cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out))

(This is calling cntk.exe configFile=02_OneConv.cntk in Examples\Image\GettingStarted)

I checked the algorithm being used (m_fwdAlgo.selectedAlgo) is #1, but the workspace.BufferSize() is zero. Any ideas how to fix this gratefully recieved!

D.

No idea if I'm talking to myself, but the exception I reported above is due to the fact that the workspace size calculation in CNTK seems broken (too small) in 3 places in CuDnnConvolutionEngine.cu. Slightly hacky, but replacing the CNTK workspace object with an inline allocation seems to have my c++ code training!

#if 1
        // TEMPORARY FIX: Try allocating workspace here, rather than using workspace object
        size_t ws_size;
        CUDNN_CALL(cudnnGetConvolutionForwardWorkspaceSize(
            *m_cudnn, m_inT, *m_kernelT, *m_conv, m_outT, m_fwdAlgo.selectedAlgo, &ws_size));

        float* ws_data;
        CUDA_CALL(cudaMalloc(&ws_data, ws_size));

        CUDNN_CALL(cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ws_data, ws_size, &C::Zero, m_outT, ptr(out)));

        CUDA_CALL(cudaFree(ws_data));
#else
        CUDNN_CALL(cudnnConvolutionForward(*m_cudnn, &C::One, m_inT, ptr(in), *m_kernelT, ptr(kernel), *m_conv, m_fwdAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), &C::Zero, m_outT, ptr(out)));
#endif
#if 1
        // TEMPORARY FIX: Try allocating workspace here,rather than using workspace object
        size_t ws_size;
        CUDNN_CALL(cudnnGetConvolutionBackwardDataWorkspaceSize(
            *m_cudnn, *m_kernelT, m_outT, *m_conv, m_inT, m_backDataAlgo.selectedAlgo, &ws_size));

        float* ws_data;
        CUDA_CALL(cudaMalloc(&ws_data, ws_size));

        CUDNN_CALL(cudnnConvolutionBackwardData(*m_cudnn, &C::One, *m_kernelT, ptr(kernel), m_outT, ptr(srcGrad), *m_conv, m_backDataAlgo.selectedAlgo, ws_data, ws_size, accumulateGradient ? &C::One : &C::Zero, m_inT, ptr(grad)));

        CUDA_CALL(cudaFree(ws_data));

#else
        CUDNN_CALL(cudnnConvolutionBackwardData(*m_cudnn, &C::One, *m_kernelT, ptr(kernel), m_outT, ptr(srcGrad), *m_conv, m_backDataAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), accumulateGradient ? &C::One : &C::Zero, m_inT, ptr(grad)));
#endif
#if 1
        // TEMPORARY FIX: Try allocating workspace here, rather than using workspace object 
        size_t ws_size;
        CUDNN_CALL(cudnnGetConvolutionBackwardFilterWorkspaceSize(
            *m_cudnn, m_inT, m_outT, *m_conv, *m_kernelT, m_backFiltAlgo.selectedAlgo, &ws_size));

        float* ws_data;
        CUDA_CALL(cudaMalloc(&ws_data, ws_size));

        cerr << "Calling BackwardFilter:" << ws_size << " vs " << workspace.BufferSize() << endl;
        CUDNN_CALL(cudnnConvolutionBackwardFilter(*m_cudnn, &C::One, m_inT, ptr(in), m_outT, ptr(srcGrad), *m_conv, m_backFiltAlgo.selectedAlgo, ws_data, ws_size, accumulateGradient ? &C::One : &C::Zero, *m_kernelT, ptr(kernelGrad)));

        CUDA_CALL(cudaFree(ws_data));
#else
        CUDNN_CALL(cudnnConvolutionBackwardFilter(*m_cudnn, &C::One, m_inT, ptr(in), m_outT, ptr(srcGrad), *m_conv, m_backFiltAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), accumulateGradient ? &C::One : &C::Zero, *m_kernelT, ptr(kernelGrad)));
#endif

Maybe that helps someone!

Thanks for sharing the info here @dmagee! Sounds like trying to setup cntk with the latest Cuda is a non trivial task.

Thanks for sharing the info here @dmagee! Sounds like trying to setup cntk with the latest Cuda is a non trivial task.

No worries. You're absolutely right. the Nvidia cuDnn library it is based on has changed api, so lots of things need updating. I've just fixed the bits that needed for training CNNs with CNTK.exe or the C++ interface. I've only commented out various other bits (to do with RNNs and Sparce matrices), and not touched any python (I don't use the python api). I'm afraid I don'treally have time to package all this up, but hopefully posting what I've done here can help someone who does.

Thanks, @dmagee for sharing lots of your experience.

I'm not sure you, we met the same problem.
I've changed function 'cudnnGetConvolutionBackwardFilterAlgorithm' to 'cudnnGetConvolutionBackwardFilterAlgorithm_v7', since Cuda api updated.
You can check cuda documentation. Also, you can compare previous version documentation for comparison.

Another issue I found was in cudnnCommon.cpp on the line:

auto err = cudnnDestroy(*src);

This causes an crash somewhere in the nvidia cudnn library. Essentially a single instance of cudnnHandle_t is allocated when doing prediction and assigned as a shared_ptr within an instance of the CNTK CuDnn class. It is destroyed on the program exit as the destructors are called. I've no idea why this causes a crash (seemingly the same pointer that was allocated is destroyed, and there are no other relevant calls to cudnnDestroy), or why it only happens when doing prediction, and not learning (in c++ anyway), but my solution was to comment out everything in this tidy up code:

    static std::shared_ptr<cudnnHandle_t> m_instance = std::shared_ptr<cudnnHandle_t>(createNew(), [](cudnnHandle_t* src)
    {
#ifndef ORIGINAL_CODE
        UNUSED(src);
#else
        // For some reason the call to cudnnDestroy causes a crash
        // As only allocated/destroyed once, is ok to comment this out without causing a leak

        assert(*src != nullptr);


        auto err = cudnnDestroy(*src);
        assert(err == CUDNN_STATUS_SUCCESS);
#ifdef NDEBUG
        UNUSED(err);
#endif
        delete src;
#endif
    });

Again (like my other solution above) is a horrible hack as it doesn't actually fix the bug, but it does mean my programs don't crash right at the end. If there was lots of instances of CuDnn created obviously it would be a memory leak, but in my code at least it only seems to do it once and tidy up right at the end.

Hopefully helps someone!

I also get this error on RTX 3060 if I have more than 512 neurons on one layer. With RTX 2060 it works without any error with the same files and nvidia drivers.

Loading data...
Using device: GPU[0] GeForce RTX 3060

About to throw exception 'CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)'
CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)

Unhandled Exception: System.ApplicationException: CUBLAS failure 13: CUBLAS_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=PC1; expr=cublasgemmHelper(cuHandle, transA, transB, m, n, k, &alpha, a.Data(), (int) a.m_numRows, b.Data(), (int) b.m_numRows, &beta, c.Data(), (int) c.m_numRows)

[CALL STACK]

Microsoft::MSR::CNTK::TensorView:: Reshaped
- Microsoft::MSR::CNTK::CudaTimer:: Stop
- Microsoft::MSR::CNTK::GPUMatrix:: MultiplyAndWeightedAdd
- Microsoft::MSR::CNTK::Matrix:: MultiplyAndWeightedAdd
- Microsoft::MSR::CNTK::TensorView:: DoMatrixProductOf
- Microsoft::MSR::CNTK::TensorView:: AssignMatrixProductOf
- std::enable_shared_from_this:: shared_from_this (x3)
- CNTK::Internal:: UseSparseGradientAggregationInDataParallelSGD
- CNTK:: CreateTrainer
- CNTK::Trainer:: TotalNumberOfUnitsSeen
- CNTK::Trainer:: TrainMinibatch (x2)
- CSharp_CNTK_Trainer__TrainMinibatch__SWIG_2
- 00007FFF157C5E45 (SymFromAddr() error: The specified module could not be found.)

@dmagee I've faced the same issue. I think this issue has occurred in PyTorch(issue link) as well. Then it's the PyTorch PR. Even I read this, I have no idea how I fix this yet.

I'm trying to get CNTK working on latest CUDA 11 too on Windows. I was wondering why I can't find any Azure Pipeline yml files, so I could use a custom pipeline agent for testing instead of local dev. Anyone know link to Azure DevOps pipelines?

Also very interested in whatever changes needed for CUDA 11 to work.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chrispugmire picture chrispugmire  路  3Comments

arijit17 picture arijit17  路  3Comments

colino picture colino  路  4Comments

springkim picture springkim  路  4Comments

nietras picture nietras  路  5Comments