Cntk: GPU Memory leak in CNTK .NET assembly v2.4.0

Created on 12 Feb 2018  路  6Comments  路  Source: microsoft/CNTK

Here is a fully self-contained reproduction of the issue, which was also reported in #2386. When the program is run, the GPU memory usage increases linearly until it is exhausted. When the line containing GC.Collect(); is uncommented, the memory leak goes away.

using System;
using System.Collections.Generic;

namespace CNTK.GPUMemoryLeak
{
    class Program
    {
        static void Main(string[] args)
        {
            var device = DeviceDescriptor.GPUDevice(0);
            var dim = 1 << 10;

            /* Set up a dummy model to train */
            var inputShape = NDShape.CreateNDShape(new int[] { dim });
            var inputVar = Variable.InputVariable(inputShape, DataType.Double, "input");

            var labelShape = NDShape.CreateNDShape(new int[] { 1 });
            var labelVar = Variable.InputVariable(labelShape, DataType.Double, "label");

            var weightsShape = NDShape.CreateNDShape(new int[] { dim, 1 });
            var weights = new Parameter(weightsShape, DataType.Double, CNTKLib.GlorotUniformInitializer(), device, "weights");

            var model = CNTKLib.Sigmoid(CNTKLib.TransposeTimes(weights, inputVar).Output, "model");
            var cost = CNTKLib.BinaryCrossEntropy(model.Output, labelVar, "cost");
            var error = CNTKLib.ClassificationError(model.Output, labelVar, "error");

            /* Create random training data */
            var rng = new Random();
            var inputData = new double[1 << 22];
            for (var i = 0; i < inputData.Length; i++)
                inputData[i] = rng.NextDouble();

            var labelData = new double[inputData.Length / dim];
            for (var i = 0; i < labelData.Length; i++)
                labelData[i] = (rng.Next(2) == 0) ? 0.0 : 1.0;

            /* Setup training objects */
            var learner = Learner.SGDLearner(cost.Parameters(), new TrainingParameterScheduleDouble(0.01));
            var trainer = Trainer.CreateTrainer(model.Output, cost, error, new Learner[] { learner });

            var arguments = new Dictionary<Variable, Value>();

            /* Run training */
            var count = 0;
            while (count < 100)
            {
                using (var inputBatch = Value.CreateBatch(inputShape, inputData, 0, inputData.Length, device))
                {
                    using (var labelBatch = Value.CreateBatch(labelShape, labelData, 0, labelData.Length, device))
                    {
                        arguments[inputVar] = inputBatch;
                        arguments[labelVar] = labelBatch;
                        trainer.TrainMinibatch(arguments, false, device);
                        Console.WriteLine("{0}", ++count);
                    }
                }
                arguments.Clear();
                // !!! Uncomment the line below, and the GPU memory leak goes away !!!!
                //GC.Collect();
            }
            Console.WriteLine("done");
        }
    }
}

Most helpful comment

Thanks, indeed calling Value.Erase() works. I think most developers would expect that calling IDisposable::Dispose would release the unmanaged resources. Would you consider updating the Dispose implementation to call Erase?

All 6 comments

Each Value.CreateBatch allocates a buffer from GPU, and without GC it will exhaust GPU memory. Please use MinibatchSource if GC.Collect() is not desired.

But shouldn't the Dispose method on the Value object free the corresponding bit of GPU memory?

Seems not, but you may try Value.Erase(). Also, note that GPU computation is running asynchronously to CPU, but GPU memory allocation/free are synchronous. Freeing value object too eager might lead to GPU errors with bad address.

Thanks, indeed calling Value.Erase() works. I think most developers would expect that calling IDisposable::Dispose would release the unmanaged resources. Would you consider updating the Dispose implementation to call Erase?

Good suggestion, will look into that on SWIG side.

@mjmckp Thank you so much for this helpful repro. It was fixed with this commit:
https://github.com/Microsoft/CNTK/commit/3e83c56b8fc4d0e2878a710b09c8d61f3a9e76d3

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ddobric picture ddobric  路  15Comments

GuntaButya picture GuntaButya  路  16Comments

loretoparisi picture loretoparisi  路  19Comments

fchollet picture fchollet  路  16Comments

cha-zhang picture cha-zhang  路  49Comments