Leela-zero: net2net question

Created on 10 Mar 2018  路  4Comments  路  Source: leela-zero/leela-zero

The net2net approach is apparently great when you increase the number of blocks. My question is if there is a difference in the end result if you do for ex. additional 30 net2nets (add blocks one by one to reach 40 blocks) or if you make one net2net to 40 blocks in one step? To generalize the question: what is the optimal number of net2net processes to reach 40 blocks?

Based on the success of the 6->10 net2net I'm wondering if we should add 1, 2, 4, 5 or more blocks per net2net process...

Most helpful comment

I think that in general, somewhat larger steps are better. Every time net2net is used, it incurs a degree of inefficiency where previously optimised network weights are scrambled with increased learning rate to find weights that work for the enlarged net. I don't think incremental steps would be in any way a good solution for network expansion.

All 4 comments

I think that in general, somewhat larger steps are better. Every time net2net is used, it incurs a degree of inefficiency where previously optimised network weights are scrambled with increased learning rate to find weights that work for the enlarged net. I don't think incremental steps would be in any way a good solution for network expansion.

@jkiliani Yeah I agree, it's probably better to not do it too often. Net2net makes the supervised bootstrapping to an appropriate playing strength much faster than starting from random weights, but may have an impact on ultimate strength.

I was wondering if depth and width can be scaled at the same time for something like

-> 192x15 (2 times slower with fused Winograd and GPU scaling?)
-> 256x20 (1.7 times slower, 3.4 times overall?)
-> 256x40 (1.8 times slower, 6.1 times overall?)

I know for now @gcp is or has been working on a 128x20 and a 256x10 with net2net, that's why I'm asking.

Increasing the filters is less theoretically appealing, but has this nice GPU scaling property.
For the 20 blocks one could make the argument that it is kind of a magic number on a 19x19 board, possibly enabling the convolutional layers so see across the board once in every situation. While that doesn't solve every crazy ladder (with mirrors, switches, etc. ) there is reason to believe it will be much stronger in normal situations and rely less on the tree search in these cases.

As for the "repeated net2net", I agree here. It would be interesting to compare a new trained from scratch with a trained from net2net, but I have run out of GPUs already.

I was wondering if depth and width can be scaled at the same time for something like

Yes this is possible.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jslechasseur picture jslechasseur  路  4Comments

l1t1 picture l1t1  路  3Comments

destanig picture destanig  路  4Comments

chilin99999 picture chilin99999  路  4Comments

infinitycr picture infinitycr  路  4Comments