Caffe: Need to call share_with in pycaffe after restoring snapshot, before testing

Created on 15 Nov 2015  Â·  4Comments  Â·  Source: BVLC/caffe

Spent a week on this pycaffe issue. This isn't an issue if you start a fresh solver (solver = caffe.SGDSolver(...), solver.net.copy_from(...), but if I restore a solver from a snapshot (solver = caffe.SGDSolver(...), solver.restore(...), I have to call solver.test_nets[0].share_with(solver.net) (which calls ShareTrainedLayersWith) before trying to get the validation accuracy/loss with solver.test_nets[0].forward(), even if I call solver.step(...) right after restoring. I wasn't aware you needed to do this.

First off, is this correct?

If so, I feel like others might run into this issue in the future, so some suggestions:

1) Make sure the solver properly syncs up the test net during restoration (I'm not too sure of what's going on in solver.cpp),
2) Add the share_with line to any of the pycaffe tutorials before calling solver.test_nets[0].forward(),
and/or
3) Expose the solver test method in pycaffe to be used instead of directly calling solver.test_nets[0].forward()

Python

Most helpful comment

I found even when i am using solver.net.copy_from(...) and then I just directly called the solver.test_nets[0].forward() and get the solver.test_nets[0].blobs['loss'].data as my testing loss , the accuracy is also just like random weights.

thanks for @bchu , I add the solver.test_nets[0].share_with(solver.net) before solver.test_nets[0].forward(), and then the issue disappeared and accuracy seems normal.

I tried add solver.step(1) before forward() and remove share_with() , it seems that the testing weights is suddenly restore to normal trained weights and the test loss is normal , I think there must be some operations in step() refreshing the testing_net[0] 's weights to trained weights.

fortunately, I have fine-tuned several times but I haven't seen the suddenly increased test loss at the beginning of training because my train step solver.step(1) is the first statement under iteration loop.

I hope pycaffe should expose an API test() to directly get the testing network's accuracy with trained weights. Thanks to find this issue to save my 4 hours!!!!!!!!

All 4 comments

I just found the same issue. Also restoring a snapshot with solver.restore(), and was wondering why it didn't seem like the model was any better than random.

I think the root issue is that the ShareTrainedLayersWith function is called in the solver.cpp Test function. When initializing the solver from scratch, the solver executes a Test after the first training iteration no matter what. However, this isn't always the case when restoring from a snapshot.

So, the issue won't show if you're using the solver in the standard way. But if you're iterating using solver.step() and testing using solver.test_nets[0].forward(), Test is never called and the weights are never shared. I think the test nets should be forced to share the weights _on initialization_, shouldn't they?

I spent almost 4 hours and I had the same issue as well. I loaded in a snapshot then began iterating with solver.step() and finding accuracy with a validation dataset. I noticed that the weights for the test_net were different right before I saved the snapshot and right after when I loaded in the snapshot. My hours of searching over the Internet put me here and I'm thankful to have found it.

In any case, I agree that the weights should be shared by default.... it just naturally makes sense.

I found even when i am using solver.net.copy_from(...) and then I just directly called the solver.test_nets[0].forward() and get the solver.test_nets[0].blobs['loss'].data as my testing loss , the accuracy is also just like random weights.

thanks for @bchu , I add the solver.test_nets[0].share_with(solver.net) before solver.test_nets[0].forward(), and then the issue disappeared and accuracy seems normal.

I tried add solver.step(1) before forward() and remove share_with() , it seems that the testing weights is suddenly restore to normal trained weights and the test loss is normal , I think there must be some operations in step() refreshing the testing_net[0] 's weights to trained weights.

fortunately, I have fine-tuned several times but I haven't seen the suddenly increased test loss at the beginning of training because my train step solver.step(1) is the first statement under iteration loop.

I hope pycaffe should expose an API test() to directly get the testing network's accuracy with trained weights. Thanks to find this issue to save my 4 hours!!!!!!!!

I found that as long as you restore from the sovlerstate, the parameters of solver.test_nets[0] and solver.net are absolutely the same even without share_with method. That is to say, the test net and training net share the same network weights.
However, when you call solver.test_nets[0].forward(), it doesnt operate in the test net you have load from sovlerstate. It is really strange~

To get the correct acc by solver.test_nets[0].forward(), I have to call share_withmethod

Was this page helpful?
0 / 5 - 0 ratings