Caffe: MNIST Autoencoder Example hangs on data layer lock

Created on 6 Sep 2015  ·  11Comments  ·  Source: BVLC/caffe

Hi, All

I'm running 5367a1af5dc8a56a284b7f1c67efce097871955a and running the mnist Autoencoder example just hangs. I have a very vanilla setup compiled on ubuntu 14.04 using cuda 7.0.

bug

Most helpful comment

A temporary workaround in this case would be to rename the source in the "test-on-train" data layer to:

source: "./examples/mnist/mnist_train_lmdb"

vs.

source: "examples/mnist/mnist_train_lmdb"

The problem was introduced by the MultiGPU extension, particularly in bcc8f50a95ecad954d1887f3fb273eaa298e2274 partly as a result of lines 23-29 in data_reader.cpp:

string key = source_key(param);
weak_ptr<Body>& weak = bodies_[key];
body_ = weak.lock();
if (!body_) {
  body_.reset(new Body(param));
  bodies_[key] = weak_ptr<Body>(body_);
}

Only one Body object is created per data source (the "key" being the path to the source/lmdb, hence the workaround), and for some reason this prevents the "test-on-train" data layer during setup from completing Datum& datum = *(reader_.full().peek()) in DataLayer<Dtype>::DataLayerSetUp().

I have a rough patch that I could PR in, unless the people who were involved with that branch (@cypof, @ronghanghu) would prefer to take a crack at addressing it. Basically, it uses an additional layer parameter to override the "one Body per data source" restriction, but there's probably a more principled way to address this.

All 11 comments

the problem is not present in 8181870b9ac330a094ab0f8d53f54a0202f697a0 I'm using that version currently

Same here. Curious: what Cuda Toolkit version are you using? I'm using 5.5, for which this older commit works out of the box (as opposed to having to throw away some multi-GPU code blocks in a C++ file with the following commits).

I also can confirm this issue.

Additionally, the MNIST autoencoder does not work with leveldb because the TRAIN datalayer locks train_leveldb.

A temporary workaround in this case would be to rename the source in the "test-on-train" data layer to:

source: "./examples/mnist/mnist_train_lmdb"

vs.

source: "examples/mnist/mnist_train_lmdb"

The problem was introduced by the MultiGPU extension, particularly in bcc8f50a95ecad954d1887f3fb273eaa298e2274 partly as a result of lines 23-29 in data_reader.cpp:

string key = source_key(param);
weak_ptr<Body>& weak = bodies_[key];
body_ = weak.lock();
if (!body_) {
  body_.reset(new Body(param));
  bodies_[key] = weak_ptr<Body>(body_);
}

Only one Body object is created per data source (the "key" being the path to the source/lmdb, hence the workaround), and for some reason this prevents the "test-on-train" data layer during setup from completing Datum& datum = *(reader_.full().peek()) in DataLayer<Dtype>::DataLayerSetUp().

I have a rough patch that I could PR in, unless the people who were involved with that branch (@cypof, @ronghanghu) would prefer to take a crack at addressing it. Basically, it uses an additional layer parameter to override the "one Body per data source" restriction, but there's probably a more principled way to address this.

@rdipietro I was using CUDA 7.0, which was the newest at the time. 8181870b9ac330a094ab0f8d53f54a0202f697a0 compiles out of the box for me with CUDA 7.0

I can confirm this bug as well. adding "./" to the source resolves the issue.

If using the path as a key, then perhaps the path should be made canonical.

Same bug. And the "./" trick works.

Closing in favor of #3108, which more directly explains the issue here.

Adding the "./" in "test-on-train" data layer works well. @mohomran Thank you very much.

Adding the "./" in "test-on-train" data layer works well. @mohomran Thank you very much.

Was this page helpful?
0 / 5 - 0 ratings