Dali: CPU memory keep growing when enable random_shuffle

Created on 25 Dec 2018  路  6Comments  路  Source: NVIDIA/DALI

I used a very simple pipeline to read the ImageNet dataset, but found the CPU memory keep growing in every epoch when enabling the random_shuffle.

This code can validate this issue:

#include <cuda.h>
#include <iostream>
#include <string>
#include <vector>
#include "dali/common.h"
#include "dali/pipeline/data/allocator.h"
#include "dali/pipeline/init.h"
#include "dali/pipeline/operators/op_spec.h"
#include "dali/pipeline/pipeline.h"
#include "dali/util/image.h"

#include "stdio.h"
#include "stdlib.h"
#include "string.h"

using namespace dali;
using namespace std;

// https://stackoverflow.com/questions/63166/how-to-determine-cpu-and-memory-consumption-from-inside-a-process
int parseLine(char* line) {
  // This assumes that a digit will be found and the line ends in " kB".
  int i = strlen(line);
  const char* p = line;
  while (*p < '0' || *p > '9') p++;
  line[i - 3] = '\0';
  i = atoi(p);
  return i;
}

int getValue() {  // Note: this value is in KB!
  FILE* file = fopen("/proc/self/status", "r");
  int result = -1;
  char line[128];

  while (fgets(line, 128, file) != NULL) {
    if (strncmp(line, "VmSize:", 7) == 0) {
      result = parseLine(line);
      break;
    }
  }
  fclose(file);
  return result;
}

int main() {
  DALIInit(OpSpec("CPUAllocator"), OpSpec("PinnedCPUAllocator"), OpSpec("GPUAllocator"));
  int batch_size = 32;
  int num_threads = 4;
  Pipeline pipe(batch_size, num_threads, 0);
  DALIImageType img_type = DALI_RGB;

  dali::string list_root("/share4/public/classification_data/imagenet1k/val");
  dali::string list_file("/share4/public/classification_data/imagenet1k/meta/val.txt");
  pipe.AddOperator(OpSpec("FileReader")
                       .AddArg("device", "cpu")
                       .AddArg("file_root", list_root)
                       .AddArg("file_list", list_file)
                       .AddArg("random_shuffle", true)
                       .AddArg("output_type", DALI_FLOAT)
                       .AddOutput("raw_jpegs", "cpu")
                       .AddOutput("labels", "cpu"));

  vector<std::pair<string, string>> outputs = {{"raw_jpegs", "cpu"}, {"labels", "cpu"}};

  pipe.Build(outputs);

  DeviceWorkspace ws;
  pipe.RunCPU();
  pipe.RunGPU();
  pipe.Outputs(&ws);

  for (int e = 1; e < 500; ++e) {
    for (int i = 0; i < 50000 / batch_size; ++i) {
      pipe.RunCPU();
      pipe.RunGPU();
      pipe.Outputs(&ws);
    }
    cout << "epoch: " << e << " mem:" << getValue() << endl;
  }

  return 0;
}
question

Most helpful comment

OK. I solve the problem by rewriting the resize logic in Buffer class (the superclass of Tensor).

In this case, every time a new image is read into the buffer list, the corresponding tensor will be resized to fit the actual size of the image, but the resize operation is only done at the condition of new_size > current_size, likes the reserve() function in std::vector. So in this logic, the used memory of a tensor will grow larger and larger and has no chance to reduce the size.
This problem will be more serious when enabling the random_shuffle with default initial_fill (=1024) because the sample_buffer will grow up very quickly.

I just remove the if condition of new_size > current_size and make it actually do the resize operation, like reserve() then shrink_to_fit() in std::vector.
Refers to here: https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/data/buffer.h#L286

It's not a bug or error in current design, but I think it will be more suitable that this condition should be implemented in Allocator class, not the Buffer class. For example, a pre-allocated memory pool can perform this condition perfectly without loss of performance.

All 6 comments

Hi @ay27

When enabling random_shuffle, DALI's internal data loader starts by pre-allocating a pool of initial_fill buffers, to add extra randomness at runtime. Each image read will be randomly sampled from this pool of loaded images (initial_fill being 1024 by default, and you can set it in the Reader parameters [1]).

Each of these buffers is initialized to a size of tensor_init_bytes (which can be also set [1]), and will be resized during runtime. Therefore, it is expected to see a increase of memory consumption as new bigger images of ImageNet are discovered, until eventually, every buffer has seen the biggest image.

[1] https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#nvidia.dali.ops.FileReader

Maybe I didn鈥檛 express it clearly. In your explanation, the memory usage will be fixed when an epoch is finished. But in my code, you can see that i print out the memory usage in every epoch, and the memory is growing in every epoch. When i disable the random_shuffle, the memory will be fixed after several
epochs.

The problem I met at very fist is when I use DALI in my project to read ImageNet dataset. But after several epochs, it was killed because of out of memory. So I try to simplify the graph and finally found that when I disable the random_shuffle, the problem solved. I know it is very weird, but I can't find any other solution to solve the OOM problem 馃

OK. I solve the problem by rewriting the resize logic in Buffer class (the superclass of Tensor).

In this case, every time a new image is read into the buffer list, the corresponding tensor will be resized to fit the actual size of the image, but the resize operation is only done at the condition of new_size > current_size, likes the reserve() function in std::vector. So in this logic, the used memory of a tensor will grow larger and larger and has no chance to reduce the size.
This problem will be more serious when enabling the random_shuffle with default initial_fill (=1024) because the sample_buffer will grow up very quickly.

I just remove the if condition of new_size > current_size and make it actually do the resize operation, like reserve() then shrink_to_fit() in std::vector.
Refers to here: https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/data/buffer.h#L286

It's not a bug or error in current design, but I think it will be more suitable that this condition should be implemented in Allocator class, not the Buffer class. For example, a pre-allocated memory pool can perform this condition perfectly without loss of performance.

I am glad that you were able to find a solution for you use case here. :)
Indeed, we are aware that we should have a better memory management here and that this logic should be moved the Allocator level, and it is something that is part of a bigger refactor we are currently working on.

Was this page helpful?
0 / 5 - 0 ratings