Leela-zero: Training on score differences as well as game outcomes?

Created on 15 Feb 2018 · 4Comments · Source: leela-zero/leela-zero

How about training the LZ network on game scores as well as game outcomes? The idea is to give the network more information about the games it's trained on. Currently, the network is only shown positions that are either "won" or "lost". It never sees positions that are "equal" or "slightly better" for one side or the other, which may be creating all kinds of issues such as difficulties in network training and biases during play.

This network output need not be evaluated at all during any of the matches. The only change to the current system may be in playing all training games to the scoring.

Has this kind of thing been tried?

wontfix

Source

destanig

Most helpful comment

I’ll try to illustrate the point of my suggestion. The idea is not to replace winrate prediction with something else, but to give the network more complete information about the games it is trained on.

It has been noticed previously (e.g. https://arxiv.org/pdf/1707.03300.pdf ) that training a network to perform multiple related tasks can make it learn faster and improve its performance in each individual task. It can also make it learn things that it couldn’t previously. It’s not always the case, but it has been the case with AlphaGo Zero: It joined move prediction and winrate prediction into one network, which increased its strength. Only each training game includes much more information than just the win/loss outcome for its game positions.

The network consists of residual blocks terminated with several output “heads”. Training an extra score prediction “head” would have an effect on the weights inside the blocks and influence other outputs. Whether something useful can be done with the score prediction itself is a different matter. The new “head” is of no use for tree search and could just be removed from the network that is used for actual play. The network is still used to predict the winrate, only its prediction should hopefully be better.

AlphaGo Zero was the first time someone joined multiple outputs into a single network for the game of Go, which resulted in a strength improvement. The idea is to give this single network more information about training games. Let’s say there are two lost games, one of which is a peaceful 0.5-point game and another is a big struggle of multiple groups some of which died resulting in a 30-point loss. Currently, the network is shown both of these positions as just “lost” and has to essentially guess how that came about: whether anything died and on what scale. If anything, this guessing makes network training difficult. At least intuitively, including the game margin should help the network understand game positions better and increase training efficiency.

destanig on 15 Feb 2018

👍2

All 4 comments

I am a bit lost. Do you mean we use the score only for training purposes, and then throw it away when actually playing the game?

ihavnoid on 15 Feb 2018

The goal of the game is to win. Score differences are irrelevant. Every time this has been tried the result was a significantly weaker program.

gcp on 15 Feb 2018

destanig on 15 Feb 2018

👍2

I think that previous attempts resulted in weaker programs because 'simple' deep learning algorithm on player games would invite the network to learn about overplay moves and due to the lack of enough games to learn how to overcome these overplays, the network fall in the trap of playing overplays.

But I also think that with reinforced learning, the network could get out of that trap and keep playing a nice game without overplays. I hate when leela leave a group that could easily live without compromising something else for no other reason that even without that group, leela still win :/ Taking the score into account during the learning phase could 'force' leela to not leave these groups to an unnecessary death.