Spent a week on this pycaffe issue. This isn't an issue if you start a fresh solver (solver = caffe.SGDSolver(...), solver.net.copy_from(...), but if I restore a solver from a snapshot (solver = caffe.SGDSolver(...), solver.restore(...), I have to call solver.test_nets[0].share_with(solver.net) (which calls ShareTrainedLayersWith) before trying to get the validation accuracy/loss with solver.test_nets[0].forward(), even if I call solver.step(...) right after restoring. I wasn't aware you needed to do this.
First off, is this correct?
If so, I feel like others might run into this issue in the future, so some suggestions:
1) Make sure the solver properly syncs up the test net during restoration (I'm not too sure of what's going on in solver.cpp),
2) Add the share_with line to any of the pycaffe tutorials before calling solver.test_nets[0].forward(),
and/or
3) Expose the solver test method in pycaffe to be used instead of directly calling solver.test_nets[0].forward()
I just found the same issue. Also restoring a snapshot with solver.restore(), and was wondering why it didn't seem like the model was any better than random.
I think the root issue is that the ShareTrainedLayersWith function is called in the solver.cpp Test function. When initializing the solver from scratch, the solver executes a Test after the first training iteration no matter what. However, this isn't always the case when restoring from a snapshot.
So, the issue won't show if you're using the solver in the standard way. But if you're iterating using solver.step() and testing using solver.test_nets[0].forward(), Test is never called and the weights are never shared. I think the test nets should be forced to share the weights _on initialization_, shouldn't they?
I spent almost 4 hours and I had the same issue as well. I loaded in a snapshot then began iterating with solver.step() and finding accuracy with a validation dataset. I noticed that the weights for the test_net were different right before I saved the snapshot and right after when I loaded in the snapshot. My hours of searching over the Internet put me here and I'm thankful to have found it.
In any case, I agree that the weights should be shared by default.... it just naturally makes sense.
I found even when i am using solver.net.copy_from(...) and then I just directly called the solver.test_nets[0].forward() and get the solver.test_nets[0].blobs['loss'].data as my testing loss , the accuracy is also just like random weights.
thanks for @bchu , I add the solver.test_nets[0].share_with(solver.net) before solver.test_nets[0].forward(), and then the issue disappeared and accuracy seems normal.
I tried add solver.step(1) before forward() and remove share_with() , it seems that the testing weights is suddenly restore to normal trained weights and the test loss is normal , I think there must be some operations in step() refreshing the testing_net[0] 's weights to trained weights.
fortunately, I have fine-tuned several times but I haven't seen the suddenly increased test loss at the beginning of training because my train step solver.step(1) is the first statement under iteration loop.
I hope pycaffe should expose an API test() to directly get the testing network's accuracy with trained weights. Thanks to find this issue to save my 4 hours!!!!!!!!
I found that as long as you restore from the sovlerstate, the parameters of solver.test_nets[0] and solver.net are absolutely the same even without share_with method. That is to say, the test net and training net share the same network weights.
However, when you call solver.test_nets[0].forward(), it doesnt operate in the test net you have load from sovlerstate. It is really strange~
To get the correct acc by solver.test_nets[0].forward(), I have to call share_withmethod
Most helpful comment
I found even when i am using
solver.net.copy_from(...)and then I just directly called thesolver.test_nets[0].forward()and get thesolver.test_nets[0].blobs['loss'].dataas my testing loss , the accuracy is also just like random weights.thanks for @bchu , I add the
solver.test_nets[0].share_with(solver.net)beforesolver.test_nets[0].forward(), and then the issue disappeared and accuracy seems normal.I tried add
solver.step(1)beforeforward()and removeshare_with(), it seems that the testing weights is suddenly restore to normal trained weights and the test loss is normal , I think there must be some operations instep()refreshing the testing_net[0] 's weights to trained weights.fortunately, I have fine-tuned several times but I haven't seen the suddenly increased test loss at the beginning of training because my train step
solver.step(1)is the first statement under iteration loop.I hope
pycaffeshould expose an APItest()to directly get the testing network's accuracy with trained weights. Thanks to find this issue to save my 4 hours!!!!!!!!