I'm using caffe-rc3 on Ubuntu. Caffe tests pass. mnist sample runs perfectly. I have a trained net with a net and weight files. Everything works perfectly in CPU mode. GPU crashes. I've spent a few hours with gdb and the crash happens when caffe_rng_uniform() calls caffe_rng() and rng_stream returns 0x1, a bad pointer.
16 inline rng_t* caffe_rng() {
17 return static_castcaffe::rng_t*(Caffe::rng_stream().generator());
18 }
1
random_generator pointer is 0x1, which causes the crash when it is dereferenced
(gdb) p *caffe::thread_instance_.get()
$49 = {cublas_handle_ = 0x4df9160, curand_generator_ = 0x4dfab10, random_generator_ = {px = 0x1, pn = {pi_ = 0x0}},
mode_ = caffe::Caffe::CPU, solver_count_ = 1, root_solver_ = true}
However, caffe Get() has a good pointer. it seems like the thread specific data and the singleton data are different. I can;t figure out why.
(gdb) p *caffe::Caffe::Get().random_generator_
$46 = (caffe::Caffe::RNG &) @0x4df9160: {generator_ = {px = 0x7fffffff00000200, pn = {pi_ = 0xffff0000ffff}}}
backtrace:
(gdb) bt
at src/caffe/util/math_functions.cpp:252
bottom=std::vector of length 1, capacity 1 = {...}, top=std::vector of length 1, capacity 1 = {...})
at src/caffe/layers/base_conv_layer.cpp:170
bottom=std::vector of length 1, capacity 1 = {...}, top=std::vector of length 1, capacity 1 = {...})
at src/caffe/layers/cudnn_conv_layer.cpp:20
top=std::vector of length 1, capacity 1 = {...}) at ./include/caffe/layer.hpp:71
phase=caffe::TEST, root_net=0x0) at src/caffe/net.cpp:36
at ../src/uct.c:465
at ../src/G2init.c:112
My code invoking caffe (use_gpu is true:
int caffe_init(const char *path, int use_gpu) {
int argc = 2;
char *fake_args[] = { "gtpmfgo", "ManyFaces" };
char **argv = fake_args;
GlobalInit(&argc, &argv);
if (use_gpu) {
Caffe::set_mode(Caffe::GPU);
Caffe::SetDevice(0);
Caffe::DeviceQuery();
}
else {
Caffe::set_mode(Caffe::CPU);
}
if (caffe_test_net != NULL) delete caffe_test_net;
string file_path = path;
file_path += "/";
caffe_test_net = new Net<float>(file_path + filename_net, TEST);
caffe_test_net->CopyTrainedLayersFrom(file_path + filename_parameters);
I'm using Cuda_7.5
It appears that during Caffe::set_mode, the compiler is writing the mode_ into the random_generator_. gdb output: I have gdb 4.8.4.
(gdb) bt
at /usr/include/boost/smart_ptr/detail/shared_count.hpp:371
__in_chrg=<optimized out>) at /usr/include/boost/smart_ptr/shared_ptr.hpp:328
this=0x7ffff7bb9db0 caffe::thread_instance_, new_value=0x1173990) at /usr/include/boost/thread/tss.hpp:105
at /home/ubuntu/linux/caffe-rc3/include/caffe/common.hpp:148
at ../src/caffecnn.cpp:54
max_threads=64, use_gpu=1) at ../src/uct.c:465
use_gpu=1) at ../src/G2init.c:112
(gdb) n
375 }
(gdb) s
boost::thread_specific_ptrcaffe::Caffe::reset (this=0x7ffff7bb9db0 caffe::thread_instance_, new_value=0x1173990)
at /usr/include/boost/thread/tss.hpp:107
107 }
(gdb) s
caffe::Caffe::Get () at src/caffe/common.cpp:19
19 return _(thread_instance_.get());
(gdb) p thread_instance_.get()
$34 = (caffe::Caffe *) 0x1173990
(gdb) d
(gdb) p thread_instance_.get()->random_generator_
$35 = {px = 0x0, pn = {pi_ = 0x0}}
(gdb) s
boost::thread_specific_ptrcaffe::Caffe::get (this=0x7ffff7bb9db0 caffe::thread_instance_)
at /usr/include/boost/thread/tss.hpp:84
84 return static_cast
(gdb) p thread_instance_.get()->random_generator_
No symbol "thread_instance_" in current context.
(gdb) p caffe::thread_instance_.get()->random_generator_
$36 = {px = 0x0, pn = {pi_ = 0x0}}
(gdb) s
85 }
(gdb) s
caffe::Caffe::Get () at src/caffe/common.cpp:20
20 }
(gdb) s
caffe_init (path=0x7fffffffd340 "/home/ubuntu/linux/gtpmfgo/", use_gpu=1) at ../src/caffecnn.cpp:60
60 string file_path = path;
(gdb) p caffe::thread_instance_.get()->random_generator_
$37 = {px = 0x1, pn = {pi_ = 0x0}}
(gdb) p *caffe::thread_instance_.get()
$38 = {cublas_handle_ = 0x4840a30, curand_generator_ = 0x53064e0, random_generator_ = {px = 0x1, pn = {pi_ = 0x0}},
mode_ = caffe::Caffe::CPU, solver_count_ = 1, root_solver_ = true}
(gdb) s
61 file_path += "/";
(gdb) p *caffe::thread_instance_.get()
$39 = {cublas_handle_ = 0x4840a30, curand_generator_ = 0x53064e0, random_generator_ = {px = 0x1, pn = {pi_ = 0x0}},
mode_ = caffe::Caffe::CPU, solver_count_ = 1, root_solver_ = true}
(gdb) info threads
Id Target Id Frame
3 Thread 0x7fffcf3ff700 (LWP 1757) "gtpmfgo" pthread_cond_wait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
2 Thread 0x7fffd490d700 (LWP 1756) "gtpmfgo" 0x00007ffff613a12d in poll ()
at ../sysdeps/unix/syscall-template.S:81
Found the problem. I had CPU_ONLY defined in my application header, so my application and the library had different definition of the Caffe class.
Most helpful comment
Found the problem. I had CPU_ONLY defined in my application header, so my application and the library had different definition of the Caffe class.