Caffe: MNIST Autoencoder Example hangs on data layer lock

Created on 6 Sep 2015 · 11Comments · Source: BVLC/caffe

Hi, All

I'm running 5367a1af5dc8a56a284b7f1c67efce097871955a and running the mnist Autoencoder example just hangs. I have a very vanilla setup compiled on ubuntu 14.04 using cuda 7.0.

bug

Source

andykitchen

Most helpful comment

A temporary workaround in this case would be to rename the source in the "test-on-train" data layer to:

source: "./examples/mnist/mnist_train_lmdb"

vs.

source: "examples/mnist/mnist_train_lmdb"

The problem was introduced by the MultiGPU extension, particularly in bcc8f50a95ecad954d1887f3fb273eaa298e2274 partly as a result of lines 23-29 in data_reader.cpp:

string key = source_key(param);
weak_ptr<Body>& weak = bodies_[key];
body_ = weak.lock();
if (!body_) {
  body_.reset(new Body(param));
  bodies_[key] = weak_ptr<Body>(body_);
}

Only one Body object is created per data source (the "key" being the path to the source/lmdb, hence the workaround), and for some reason this prevents the "test-on-train" data layer during setup from completing Datum& datum = *(reader_.full().peek()) in DataLayer<Dtype>::DataLayerSetUp().

I have a rough patch that I could PR in, unless the people who were involved with that branch (@cypof, @ronghanghu) would prefer to take a crack at addressing it. Basically, it uses an additional layer parameter to override the "one Body per data source" restriction, but there's probably a more principled way to address this.

mohomran on 16 Sep 2015

👍2

All 11 comments

the problem is not present in 8181870b9ac330a094ab0f8d53f54a0202f697a0 I'm using that version currently

andykitchen on 6 Sep 2015

Same here. Curious: what Cuda Toolkit version are you using? I'm using 5.5, for which this older commit works out of the box (as opposed to having to throw away some multi-GPU code blocks in a C++ file with the following commits).

rdipietro on 8 Sep 2015

I also can confirm this issue.

Additionally, the MNIST autoencoder does not work with leveldb because the TRAIN datalayer locks train_leveldb.

nik-ko on 16 Sep 2015

👍1

A temporary workaround in this case would be to rename the source in the "test-on-train" data layer to:

source: "./examples/mnist/mnist_train_lmdb"

vs.

source: "examples/mnist/mnist_train_lmdb"

The problem was introduced by the MultiGPU extension, particularly in bcc8f50a95ecad954d1887f3fb273eaa298e2274 partly as a result of lines 23-29 in data_reader.cpp:

string key = source_key(param);
weak_ptr<Body>& weak = bodies_[key];
body_ = weak.lock();
if (!body_) {
  body_.reset(new Body(param));
  bodies_[key] = weak_ptr<Body>(body_);
}

mohomran on 16 Sep 2015

👍2

@rdipietro I was using CUDA 7.0, which was the newest at the time. 8181870b9ac330a094ab0f8d53f54a0202f697a0 compiles out of the box for me with CUDA 7.0

andykitchen on 17 Sep 2015

I can confirm this bug as well. adding "./" to the source resolves the issue.