Leela-zero: Tune first play urgency ?

Created on 20 Jan 2018  ·  369Comments  ·  Source: leela-zero/leela-zero

I tried a lot of modifications to first play urgency after learning that AlphagoZero probably the equivalent of 0.5, reading #238 and the relevant part of the AlphagoZero paper. A lot of my modifications were bad or didn't make an improvement (I am not an expert in game AI or neural networks so that was expected, also I don't have a very good computer so the sample size of my test games is low), but when I tried simply lowering the fpu by 0.2 in the function UCTNode::get_eval, my reasoning was that from my others failed experiment, it seemed like it was bad for exploitation to be hurt by a fpu that is too high, and I simply intended to confirm my hypothesis. I got a very surprising result : 5/0 against the standard version (that was on 0.10.1). My tests were run on Sabaki with 1000 playouts, using the c83e network, the parameters used were
--gtp -w leelaz-model-128000.txt --playouts 1000 --noponder
I also switched black/white side every game.
I then tried to give 1600 playouts to the standard version, the results were 1/1, then the version 0.11 arrived so I did my modification on this version and ran tests again, and got 3/0 against the normal version.

At that point I was wondering if these results are simply because the same network is used on both side. The deeper search caused by lowered fpu would then be more efficient because it knows the moves the other side is going to make. So I tried making the lowered fpu version with c83e net against the normal version with the latest 7fde net. The results are currently 2/0 for the lowered fpu version with worse net.

I also tried using weaker networks. Because it seems to makes sense that the lowered fpu cause less exploration and thus, a weaker networks would not be as strong using it. So maybe my change is useful only once the network is sufficiently strong.
Using the a1151640 network, one of the earliest, almost playing randomly, the results are indeed 0/2, showing that maybe it is worse at that level of play.
Using the 4e4d09 network, a slightly stronger one, made after the bug corrections, the results are however 2/0.

In total that makes a record of 13/3 for the modification. I know that is not enough to draw complete conclusions, especially with the use of 1000 playouts instead of 1600 and the mess of different networks and versions used in my tests, but I hope by posting this that someone with a stronger computer than me can run extensive tests to confirm or refute this change.

The only change to make is UCTNode.cpp in the function get_eval, change
return eval to return eval-0.2f

I'm not very familiar with github workflows so I'm not sure if this warranted a pull request before confirming if it works or not.

enhancement

Most helpful comment

I will merge this, after adding the old policy as a tunable (for weak networks);

leelaz v leelaz-tuned (433/1000 games)
board size: 19   komi: 7.5
               wins              black          white        avg cpu
leelaz          106 24.48%       49  22.58%     57  26.39%    520.75
leelaz-tuned    327 75.52%       159 73.61%     168 77.42%    487.62
                                 208 48.04%     225 51.96%

leelaz-m2_2 v leelaz-tuned (432/1000 games)
board size: 19   komi: 7.5
               wins              black          white        avg cpu
leelaz-m2_2     189 43.75%       102 47.22%     87  40.28%    511.17
leelaz-tuned    243 56.25%       129 59.72%     114 52.78%    508.18
                                 231 53.47%     201 46.53%

PUCT=0.8, reduction=0.22

Network was 5773f44c

All 369 comments

@killerducky You might find this of interest.

Nice experiment. Another way to encourage deeper search instead of wider is to modify cfg_puct. It's currently 0.85f, maybe it should be smaller. This will change the balance for all moves, not just those that have 0 visits. Many code changes have been made since cfg_puct was tuned. I think especially tree reuse could have a large impact on the balance between deep/wide search.

Of course there aren't enough results yet. In addition you said you ran many experiments, that makes the problem of low sample size even worse because it's a form of p-hacking.

Yeah I know that it may have just been me being lucky with the matches. I'm running more games but it is slow. I ran 10 more games, using 1000 playouts for the reduced fpu and 1600 for the normal fpu, so at a handicap, and I got 8/2 for now.
I thought about cfg_puct too but I thought fpu was more important, because for example, if one side is on a bad sequence which is becoming worse and worse as it explore deeper, the fpu will be higher than explored nodes, hurting exploitation. If a 0.5 net_eval move has all his explored childs at 0.1, but one at 0.4, instead of exploring the 0.4 move which is the best explored move yet, it will continue exploring unexplored nodes which have fpu 0.5, which seems bad to me. This preference will be the same no matter cufg_puct.

I'm starting to like your idea more. Suppose the parent position has value 0.5 (it's even). The top candidate moves should also have close to 0.5. But as you move down the candidate list, those moves will tend to be worse, and will have smaller and smaller evals when finally evaluated. So it might be interesting to try to model that behavior. Or just use a simple parameter like you have subtracting 0.2.

Edit: Another obvious candidate is to actually test FPU=0.5, since we now think that's what Google did. You didn't explicitly say so, did you test FPU=0.5?

Yes that was one of the things I tried. In particular I tried using get_eval of the parent position as a maximum for the fpu. The results were interesting but not spectacular enough given the sample size, especially compared to just subtracting 0.2 which I quickly realised was almost only winning.

Edit : I also tried 0.5, I didn't do that much tests (3/4 ? I don't remember, I should have kept my notes for that xD) but it didn't seem very different than the current version. I abandoned the idea when it got caught in a ladder by the normal version.
Edit 2 : I saw this reddit post https://www.reddit.com/r/cbaduk/comments/7rqj3k/leelazero_passed_my_ladder_test_finally/ and decided to try the reduced fpu versionwith c83e on this position, and it doesn't get caught in the ladder while the normal version does

If a 0.5 net_eval move has all his explored childs at 0.1, but one at 0.4, instead of exploring the 0.4 move which is the best explored move yet, it will continue exploring unexplored nodes which have fpu 0.5, which seems bad to me. This preference will be the same no matter cufg_puct.

This is not correct. Take a look at the complete formula: https://github.com/gcp/leela-zero/blob/47ddf38327b80ec5f2303344d1d9c65e25f5a3fb/src/UCTNode.cpp#L340

cfg_puct controls the influence of the policy net prior. If this is strong, the other moves will not be explored even if their FPU is higher than the evaluated move.

the sample size of my test games is low

A strong word of advice: this is a problem that you must fix first, or else you are completely wasting your time.

Yes the policy prior has an influence but in the case it has the same value for 2 moves then what I said hold true.

For the more game parts, I know and I am doing more games, but since my computer is slow I also think it is a better use of time if someone with a strong computer can do 100 games in one day while it will take me a week.

in the case it has the same value for 2 moves then what I said hold true.

If the moves have the same policy prior, then why would the behavior you describe (explore them both) not be exactly what you'd want? The alternative (explore the move we just discovered is bad) makes no more sense, for sure.

Many code changes have been made since cfg_puct was tuned. I think especially tree reuse could have a large impact on the balance between deep/wide search.

Fair point. Seems worth re-tuning now. (I do think tree reuse is the only change the has an effect here)

If the moves have the same policy prior, then why would the behavior you describe (explore them both) not be exactly what you'd want? The alternative (explore the move we just discovered is bad) makes no more sense, for sure.

the point is that the move we just discovered is actually not bad. It is worse than we expected given parent's net_eval, but all the other explored moves are worse. In fact at that point the parent's get_eval() would be lower than the move we just discovered. And yet because of the higher fpu (parent's net_eval), the search will prefer to explore unexplored nodes. To illustrate that point, if we imagine a worst case of a flat policy, the search will actually explore ALL unexplored node before looking at the best explored node. In fact I wonder if that is not what happen in failed ladders.

Essentially because of that I'm quite sure there is a problem with the current fpu, my solution of -0.2 is probably not even the best solution (even beyond simply optimizing the value to subtract), it's just something that mitigate the problem.

How can we know that all the unexplored nodes will be even worse without
exploring them?

On Sat, 20 Jan 2018 at 19:53 Eddh notifications@github.com wrote:

If the moves have the same policy prior, then why would the behavior you
describe (explore them both) not be exactly what you'd want? The
alternative (explore the move we just discovered is bad) makes no more
sense, for sure.

the point is that the move we just discovered is actually not bad. It is
worse than we expected given parent's net_eval, but all the other explored
moves are worse. In fact at that point the parent's get_eval() would be
lower than the move we just discovered. And yet because of the higher fpu
(parent's net_eval), the search will prefer to explore unexplored nodes. To
illustrate that point, if we imagine a worst case of a flat policy, the
search will actually explore ALL unexplored node before looking at the best
explored node. In fact I wonder if that is not what happen in failed
ladders.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/gcp/leela-zero/issues/696#issuecomment-359197627, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AgPB1pwrYJotFTjM3I_1clrkZJYcS7WGks5tMkRQgaJpZM4Rlcn0
.

We don't know, but we can't know if they will be better either. There should be a balance, and It's probably not with a fpu that is higher than the best explored node that we get this balance.

I follow now, but it surely depends how many nodes you explored, if it is
just 1, and that happens to be worse than the parent, that does not mean
much, if it is several then that may suggest that the parent node was too
optimistic.

On Sat, 20 Jan 2018 at 20:52 Eddh notifications@github.com wrote:

We don't know, but we can't know if they will be better either. There
should be a balance, and It's probably not with a fpu that is higher than
the best explored node that we get this balance.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/gcp/leela-zero/issues/696#issuecomment-359201299, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AgPB1kysEPKsg-hquQ46eXwHVPm0RHBkks5tMlIRgaJpZM4Rlcn0
.

So as a starting point to test, an idea that comes to me is the fpu for unknown children be a weighted average of the parent and the children seen so far. So when we start new, it's the parent's. But the more we explore and see bad children, the more the unknown children will also reduce to the average. By the time we have 1 unknown child left, it'll be the weighted average of 1 (for parent) + # of children - 1 (the unknown child), and the parent alone is having very little effect.

@roy7 I like that idea.

It looks very interesting. Perhaps someone with strong machine will help with testing?

So as a starting point to test, an idea that comes to me is the fpu for unknown children be a weighted average of the parent and the children seen so far. So when we start new, it's the parent's. But the more we explore and see bad children, the more the unknown children will also reduce to the average. By the time we have 1 unknown child left, it'll be the weighted average of 1 (for parent) + # of children - 1 (the unknown child), and the parent alone is having very little effect.

Yeah that is the idea I had at the very start. It didn't seem to work though, and I think it may be because it causes problem but on the opposite side : when we have a parent node at 0.5, and child explored nodes at 0.5, and then we find a excellent node at 0.9, so the average value go up, thus the fpu goes up, and because of that the excellent node is exploited less...

But to be precise, the tests I did used get_eval of the parent for the fpu of the children, because from what I understand it is already an average of the children and the parent. But it is not weighted right ? So maybe using a weighted average would actually be better. Good idea !

I did 10 more tests of the reduced fpu option, and the record thus far is 13/7. Keep in mind that I do these tests with only 1000 playouts for the reduced fpu option and 1600 for the normal version, just to go faster while taking out possible handicap of using a playout number the normal version wasn't tuned for.

I've not learned this area of the code, but I read all of the discussions and issue and pull requests. :) get_eval might average, I think what I meant was when a new node is made, unknown children are assumed to be the same as the parent, which would be the nn_eval from the parent's node being fed to the GPU? So that's what I mean when I suggested a weighted average. If the parent get_eval is already weighted then I don't want to confuse the discussion further. Although I'm guessing as nodes are explored, the unknown children's values are never changed.

Instead of looping through all of the children each time we explore a new child to update their values, some sort of "unknown child value" could be stored and updated and referenced for any unknown child. So if the get_eval of the parent is an average, it'd sum actual children and # of unknown * unknown val. And unknown_val would be adjusted accordingly every time a new child is resolved?

Simple example of my thinking. Say we open a new node. It has a nn_eval of .5. We have 4 children with unknown values. So:

Parent nn_eval .5. Child 1-4 is each set to .5 based on parent.

We explore child 1. It has a nn_eval of .99. So we update unknown children to be 1 * .99/4 + 3 * .5/4.

So now Child 1 = .99, Child 2-4 are .6225.

As opposed to leaving Child 2-4 to the original .5 which was just a guess anyway. Say we now explore child 2 and it really is a nn_eval of .5.

Without weight: C1 .99, C2-4 still .5
With weight: C1 .99, C2 .5, C3-4 = 1 * .99/4 + 1 * .5/4 + 2 * .5/4 = .6225 still.

In English, if we assume all moves are 50/50 and explore a couple and see one is great and one is 50/50, we'll assume unknown ones might be better than 50/50 since we know a better move exists. If the one move we looked at was terrible, the average will go down and .5 would be our best move. As we explore more unknown children, maybe we discover .5 is the best move. Or maybe we find other better moves to raise the average, and so unknown moves look better than .5 again.

Anyway, just some brainstorming. Hopefully someone with more C++ skills and LZ knowledge might try tweaking the unknown child fpu a bit and run ringmaster on the ideas.

Hmm I thought about it more and the weighted average idea may not actually work as is. If what I think is right that is : in the situation with parent net_eval at 0.5 and an excellent child at 0.9, you want to go exploit the 0.9 first, thus getting the fpu up might be a bad idea, since the search will read the 0.9 less deeply before going to unexplored children. It is similar thinking that brought me to just dumbly lower fpu by 0.2, I wanted to verify if it was better to reduce fpu in general. Still need more test matches as gcp said but for now it does seem that we want to lower fpu more than up it.

Edit : I realized there is a problem with the -0.2 : if m_init_eval is below 0.2, then the fpu is less than 0, and the unexplored nodes will stay unexplored forever since the explored nodes will never have an eval below 0 and the puct cannot go negative either. Changed the return of get_eval in the case of visit == 0 to std::max(0.05f, m_init_eval-0.2f) and seeing better results at 1600 playouts. Will do more tests games

@gcp When testing things an SPRT pass is good enough? So like 11-0 in favor of a change can stop there, or do more tests than just what SPRT wants?

@Eddh I'm running a test now. I'll let it run overnight, but let me just say... Get your pull request ready!

I think we can do even better, I'm going to try to code up something soon.

Is the idea be to best predict the value of the node once explored? If so,
then it seems reasonable to expect that those nodes that have a lower
policy % will also have a lower value. i.e should they should default to a
function of the parent value and the policy percentage?

Returning eval-0.2 just seems to unfairly favour the first node explored with no basis for believing all the other nodes are likely to evaluate much worse than the parent, this causes the search to be deeper than might otherwise be desired - good for ladders but maybe not good overall. In any case the decision of how wide the search should be set elsewhere.

On Sun, 21 Jan 2018 at 01:16 Andy Olsen notifications@github.com wrote:

@Eddh https://github.com/eddh I'm running a test now. I'll let it run
overnight, but let me just say... Get your pull request ready!

I think we can do even better, I'm going to try to code up something soon.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/gcp/leela-zero/issues/696#issuecomment-359215656, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AgPB1r_OgSNR_Cfq4ND753-eDjFwVU53ks5tMo_vgaJpZM4Rlcn0
.

Might be. I have no clue. But it seems redundant since the policy influencing which move to pick among explored and unexplored both is already there and working.

Apart from that I changed again to eval-0.2 but then I seek the worst child and if the fpu is lower than the worst child I put it to the worst child. So it is a slight tune back toward more exploration and it also takes care of the negative fpu problem. We'll need to sort all these solutions out before doing a pull request I think x)

@Eddh even if you don't want to do a formal pull request yet, can you push a branch to your fork so we can see the various experiments you try? For example I get the idea of seek the worst child thing but I'd like to see the implementation.

Using the worst child makes a bit more sense, in a way this does what I suggest - those with a lower policy score will be evaluated later, so will be likely be given a lower unexplored value when the time comes to consider them, as it is more likely a worse child will be found in the meantime. I like this, though the -0.2 cap still seems arbitrary.

If we want to avoid one bad child setting the FPU too low, how about setting it to the first quartile? (median value of the lowest half of all child nodes)

If we check each child for a NNCache hit to use real nn_eval starting value, there's a chance we'll find a value we already really know. If it's a bad value, that gives you a bad child floor to start with before even needing to explore a random child.

Here is my fork https://github.com/Eddh/leela-zero/tree/tune_fpu_worst_child
I have got 7/1 with 1600p vs 1600p c83e against normal version for now
Thanks for your help with testing :)

@evanroberts85 Yes I think that might be a good idea. It's slightly more complicated to implement though. I was too lazy. And now I am sleep deprived from testing.

@Eddh did you check the performance of that latest idea? It looks to me like it will be very expensive.

My results for Eddh's first suggestion return eval - 0.2f. This is with latest 6x128 7fde net.

lzbase-p-1600 v lzfpum0p2-p-1600 (68/1000 games)
board size: 19   komi: 7.5
                   wins              black         white       avg cpu
lzbase-p-1600         8 11.76%       3   8.82%     5  14.71%    435.14
lzfpum0p2-p-1600     60 88.24%       29 85.29%     31 91.18%    429.74
                                     32 47.06%     36 52.94%

I'm testing a version that does a moving average of sibling node net evals. I wrote some code to analyze a few different methods and my goal was a simple algorithm that minimizes the MSE between get_eval when visits=0 (FPU) and when visits=1.

The idea behind using parent's eval as the FPU is to use a better approximation than a constant. Doing a moving average as we get more information is a refinement of that idea.

I'll post more results and some code tomorrow.

How does the FPU=0 version perform? That was the arguably the choice for AGZ, and IIRC that performed quite well for the supervised 6x128 net. That could at least serve as the second baseline.

I wrote some code to analyze a few different methods and my goal was a simple algorithm that minimizes the MSE between get_eval when visits=0 (FPU) and when visits=1.

I wondered if anyone would make this step, it makes perfect sense of course!

A moving average is likely to give too high an evaluation, as unexplored nodes are unexplored for the reason that we can expect them to be slightly worse than our first options, hence there lower policy score. I would suggest an average of only the weaker half of all sibling nodes would give a lower MSE.

@killerducky wow, I didn't even expect that much because of the negative fpu problem. Again, thanks for running the tests.
For the performance, I admit I didn't really look at it. I don't think it has a strong impact tho because I use an already existing loop for seeking the worst child. So at worst it would be 1 comparison, 1 value assignment per child, and then 1 comparison/1 value assignement for the parent. (If you look at my code I actually have some code for finding the max child too but I don't use it xD)
Edit : It doesn't seem to affect performance noticeably.

The moving average idea was also something I thought of. But looking at the alphago zero paper and and the code, get_eval already returns the average of the child nodes and the parents, so in that case I already tried that. It didn't seem to work. It was a very low sample size so maybe could turn differently with more tests, but I think it isn't that simple because as evan said, explored nodes have been explored because they were recommended by the policy, and thus the unexplored ones are likely to be worse than the explored ones. I still think this idea is a good starting point but not usable directly.

The idea of roy7 to look for previous evaluations of the move is very good also, I didn't think of that.

@isty2e It seems to me that AGZ's network returned values in the range -1;1 and not 0;1, thus 0 for them should be the equivalent of 0.5 for us. Apart from that I think using fpu of 0 is the wrong balance but on the "too deep" side. Imagine worst case of flat policy, the search will choose a move at random, which is likely to be bad, so let's say this move has net_eval of 0.2. The next time the parent node is visited, the search will have to choose between this node and all the unexplored nodes, with fpu=0. Since puct cannot go negative, the only explored node will always have a value > 0 (unless the network is absolutely sure to lose, which happens, but very rarely from my experience), and thus the search will never go visit the other unexplored node, despite how bad the first pick is.

Obviously since we have a strong policy and not a flat one the effect will be mitigated. If your policy is godlike maybe fpu=0 is even optimal. In general tho I think even tho we have a strong policy, the search should be sensible even with a flat policy, because the original paper which invented the very concept of fpu didn't seem to have a policy (correct me if I'm wrong tho), so it was actually flat. In addition, if we do a new start, we will not have a strong policy.

@Eddh Oh I forgot that again, yes the range was [-1, 1] for them. For the rest of the analysis, I agree, and I wondered if such a "too deep" search can be any beneficial at the first place.

@killerducky's tests of @Eddh's ideas certainly sound statistically significant to me, and it's surprising just how much improvement potential there is to be derived from UCT parameter tuning only! I hope this is researched thoroughly and implemented soon. This would be a really big strength gain, or speed boost if invested in lowering playouts.

I have 12/3 for now with the fpu lowered then with worst child as minimum. With c83e and 1600p. So that seems like another solution to compare to the other.

Ok I pushed a branch. I implemented the moving average (I think technically it's an exponential decay average or something like that). Preliminary tests show it is also better than base, but worse than just subtracting 0.2. Next I am going to tune cfg_puct.

See below my python script to calculate MSE for a few different methods. Procedure to do this is in the comments of the script.

https://github.com/killerducky/leela-zero/tree/aolsen_fpu2
analyze_first_v.py.txt
test.log
test.sgf.txt

So "moving average" seems to be a shift towards the value of the
latest explored node with inertia. I am right in understanding
that in your code MOVING_AVG_WEIGHT
is a constant set to 0.1?

On Sun, 21 Jan 2018 at 13:52 Andy Olsen notifications@github.com wrote:

Ok I pushed a branch. I implemented the moving average (I think
technically it's an exponential decay average or something like that).
Preliminary tests show it is also better than base, but worse than just
subtracting 0.2. Next I am going to tune cfg_puct.

See below my python script to calculate MSE for a few different methods.
Procedure to do this is in the comments of the script.

https://github.com/killerducky/leela-zero/tree/aolsen_fpu2
analyze_first_v.py.txt
https://github.com/gcp/leela-zero/files/1649791/analyze_first_v.py.txt
test.log https://github.com/gcp/leela-zero/files/1649788/test.log
test.sgf.txt
https://github.com/gcp/leela-zero/files/1649789/test.sgf.txt


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/gcp/leela-zero/issues/696#issuecomment-359249929, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AgPB1rRCSQAApXJawVOgm39RqWQUxHVhks5tM0EKgaJpZM4Rlcn0
.

@killerducky I think your function is called exponential smoothing. I would guess you will get better results with a higher smoothing factor than 0.1, which seems very low. In a sense this does a similar thing to Eddh using the lowest child, as the latest explored nodes tend to be the weakest ones, of course you have applied smoothing too so there are no big jumps, which makes some sense.

Last time we had this discussion we found out that fpu=0 worked well for the supervised network, but not for the self-play ones. The fpu=parent thing was ok for both. So maybe check with the best 5x64 as well.

This is what I am getting for now (7fde):

Parent-eval v FPU-0 (14/100 games)
board size: 19   komi: 7.5
              wins              black        white      avg cpu
Parent-eval      3 21.43%       1 14.29%     2 28.57%    717.37
FPU-0           11 78.57%       5 71.43%     6 85.71%    717.03
                                6 42.86%     8 57.14%

Unfortunately it might take a few days for my hardware, and eval - 0.2 version seems currently better (though statistically insignifcant yet).

Yeah, that's why I made that remark, see: https://github.com/gcp/leela-zero/pull/238

The behavior was very different for the supervised network and the self-play networks. That's why it's important to test with 57481001 or c83e1b6e as well.

I thought it was related more to the quality of the policy. Anyway when it reaches 100 games or becomes statistically significant I might test this for 5x64 ones, but hopefully someone with a better graphic card can finish it faster.

I thought it was related more to the quality of the policy.

I thought this too but I'm not so sure anymore now. I am testing it overnight.

This would have some interesting consequences. I tuned PUCT, and I used the supervised network to do so (because it was strong and we didn't have any strong self-training ones). If the supervised network is very different from the self-play ones in the optimal PUCT value, then right now the matches are being played with a massive advantage to the supervised network.

This could be a potential reason why after the bootstrap all matches are going 0:50.

(After having said that, the supervised networks we are using now had massive problems beating the 5x64, and passed at like 55-60%, so maybe it's not a very good explanation for the 0:50's)

@isty2e Interesting. So it's no surprise that by lowering the current version's fpu the winrate increase. I guess there must be a sweet spot for fpu. For now it seems eval-0.2 is closer to it. Maybe killerducky's solution can beat it. There's potentially a few hundreds elo to gain and then more because of higher self play quality.

Edit : I have made another variation that seems to work : using get_eval of the parent instead of net_eval, so it is a moving fpu too, except i still take out a value from it, which is 0.13. Very arbitrary. I tried 0.2 for 2/3 games against eval-0.2 version with 1000 playouts, they were lost and I didn't like the way in which they were lost, so I tried 0.15, which seemed a bit better to my eyes, then 0.13 and it got 4/1 against eval-0.2. I am trying it at 1600 playouts against the normal version now, for now it is 4/1 in that matchup too.

So yeah that is not very hard science I just went by feeling to tune the value to subtract, need confirmation with lots more game if someone would be kind enough to run it.

There was a reason for subtracting a lesser value though, which is that I think in the game of go most moves are "bad" moves, thus get_eval should generally be slightly lower than parent's net_eval, and thus there is no need to subtract a value as large as 0.2 because it is already lower.

Oh and I also added a lower and upper bound. 0.01f for the minimum fpu value, and the maximum value will never exceed the best child, for reason I already explained.

https://github.com/Eddh/leela-zero/tree/moving_fpu_m0.13_bounded

For the aforementioned match between the current one and FPU=0:

Parent-eval v FPU-0 (38/100 games)
board size: 19   komi: 7.5
              wins              black         white       avg cpu
Parent-eval      8 21.05%       5  26.32%     3  15.79%    644.61
FPU-0           30 78.95%       16 84.21%     14 73.68%    646.79
                                21 55.26%     17 44.74%

Due to the occasional kernel panic in OS X when running leelaz it was terminated here. Though it is not statistically significant in terms of SPRT(-13, 13), I would like to start a benchmark with c83e, though I personally expect nothing different. After all, all our networks are trained with supervised learning in terms of the training method.

leelaz-5748 v leelaz-fpu0 (155/1000 games)
board size: 19   komi: 7.5
              wins              black         white       avg cpu
leelaz-5748     55 35.48%       31 39.74%     24 31.17%    198.14
leelaz-fpu0    100 64.52%       53 68.83%     47 60.26%    196.77
                                84 54.19%     71 45.81%

leelaz-5748 v leelaz-fpu05 (155/1000 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
leelaz-5748     115 74.19%       57 73.08%     58 75.32%    184.63
leelaz-fpu05     40 25.81%       19 24.68%     21 26.92%    184.96
                                 76 49.03%     79 50.97%

leelaz-5748 v leelaz-m2 (155/1000 games)
board size: 19   komi: 7.5
              wins              black         white       avg cpu
leelaz-5748     33 21.29%       15 19.23%     18 23.38%    209.17
leelaz-m2      122 78.71%       59 76.62%     63 80.77%    208.13
                                74 47.74%     81 52.26%

leelza-5748 is the current method, FPU=parent. FPU=0 and FPU=parent-0.2 are fairly close, not sure if they are fundamentally different (how often does the score drop by 20% or more?). But they are better than FPU=parent.

My main concern is that we tested FPU=0 vs FPU=parent before and reached the opposite conclusion (https://github.com/gcp/leela-zero/pull/238#issuecomment-349557312). What makes the difference now? Did I have a bug? Is it because the networks are stronger?

PUCT tuning indicates a maximum between 0.5 and 1.5, and a current best guess of 1.05. This is basically the same as what we have, so this isn't the issue.

The results are good enough that I will make the "minus" a tunable value and do a joint FPU+PUCT tuning run.

The two proportion z-test says that FPU=parent-0.2 is better than FPU=0 with a p-value of 0.00558. Also, doesn't the quality of policy prior explain the difference? A better policy allows us to decrease the FPU, where the limiting case is FPU=0 for a perfect policy, and there is a rating difference of 5269 between 6f27 and 5748. Of course the tree reuse might have been affected this, but my guess is that it is much more relevant to the strength of the policy.

By the way, how many methods do we have for the FPU?

  • Constant (0, 0.5, etc.)
  • Parent
  • Parent-constant(>0)
  • Parent*constant(<1)
  • MSE-based one

If someone has a beefy machine, would it be worthwhile to test all of these?

If this now becomes a tuning with many variable parameters, would it now make sense to implement some way to run distributed tuning tests? For example by implementing different tuning settings and making them callable with an option to leelaz?

Something similar to this http://tests.stockfishchess.org/tests

A Fishtest analogon has already been rejected, for the problems associated with auto compiling different binaries on clients. But if the tests could be run with the same binary, just with different parameters?

For example by implementing different tuning settings and making them callable with an option to leelaz?

This is already implemented.

But as I said several times before, tuning isn't a problem, it's something that only has to happen very occasionally so I can just rent a bunch of boxes on AWS if I run out of local machines.

Also, doesn't the quality of policy prior explain the difference?

Probably, but that does mean these kind of things might be problematic for "fresh starts".

Of course it would, but I don't think we are in a position to consider a new run from scratch (especially when #704 is done). Maybe relevant for forked projects though. If that is really a thing, FPU can be set by an external argument and letting autoGTP receive it from server would be fine, I suppose?

After all, all our networks are trained with supervised learning in terms of the training method.

Not really. The supervised ones all had lower weighting on MSE, presumably required because the datasets are more static. That's why I wasn't sure the behavior was the same. (And it clearly isn't if you look at what's happening with the bootstrap)

FPU can be set by an external argument

Not so much FPU itself, but the algorithm, I guess yes.

Ugh, I did not care much about the bootstrapping, but was it trained from a single dataset, with all post-0.9 games? Anyway I will run the FPU=0 test for 5x64 and compare.

Ugh, I did not care much about the bootstrapping, but was it trained from a single dataset, with all post-0.9 games?

See #591.

Anyway I will run the FPU=0 test for 5x64 and compare.

You mean with a weak 5x64? I already ran it with the best one, see above.

I already ran it with the best one, see above.

Oh, I am halting it then. Unfortunately 30-8 is not enough to run some z-tests or something (30-8 is better than 100-55 by a p-value of 0.09).

https://github.com/Eddh/leela-zero/tree/moving_fpu_m0.13_bounded

if (worstChildVal = 1000.0f || bestChildVal == -1000.0f) {

I'm pretty sure this is buggy. (Repeat rant about doing proper testing with large enough sample size here)

As far as fresh starts go, I think the FPU = 0 (Q(a) initialized at 0) described in the AGZ paper would work, it wouldn't be the optimal value, but it wouldn't prevent the network from improving, as opposed to the FPU = 1.2 that was initially in LZ, and which made the winrate more and more lopsided.

My main concern is that we tested FPU=0 vs FPU=parent before and reached the opposite conclusion (#238 (comment)). What makes the difference now? Did I have a bug? Is it because the networks are stronger?

Wasn't there a bug with white side at that point ?

if (worstChildVal = 1000.0f || bestChildVal == -1000.0f) {

I'm pretty sure this is buggy. (Repeat rant about doing proper testing with large enough sample size here)

Hmm could be. I did it very fast just to see the result. It seemed to work anyway, but I completely agree that it is ugly, I will change it this evening. Also I realized I did a mistake when pushing this branch, it seems to be with a constant of -0.17 and not -0.13. Sorry ><

For the sample size. I've been considering doing a lot of game but with around 500 playouts to have reasonable speed. But would it be valid enough to extrapolate for 1600 ?

As far as fresh starts go, I think the FPU = 0 (Q(a) initialized at 0) described in the AGZ paper would work

Do you mean FPU=0 or FPU=0.5 (in Leela terms)?

which made the winrate more and more lopsided.

That was a bug with virtual loss, not FPU.

Do you mean FPU=0 or FPU=0.5 (in Leela terms)?

Huuuuh, Leela's FPU isn't equal to the Q(a) initialization? What I meant what initializing the Q(a) at 0 as described in the paper, if that means a value other than 0 for Leelaz' FPU, then I've missed something...

That was a bug with virtual loss, not FPU.

Ah right, but didn't the FPU at 1.2 also cause extremely bad exploration which was flattening the distribution (and thus prevented the policy head from learning)?

Huuuuh, Leela's FPU isn't equal to the Q(a) initialization? What I meant what initializing the Q(a) at 0 as described in the paper, if that means a value other than 0 for Leelaz' FPU, then I've missed something...

Leela scores internally from 0 to 1. AGZ scores from -1 to 1. When the FPU discussion was had originally, the latter was missed and people interpreted the paper as saying FPU=0. This did not work so well as FPU=parent. Much later in another thread someone pointed out that FPU=0 in the paper is FPU=0.5 in Leela, which coincidentally is closer to FPU=parent for most nodes.

But as you can see above, FPU=0.5 (or FPU=parent) isn't as good as FPU=0 for strong networks.

Interesting, I did miss that part of the discussion. Does that mean that in Leelaz, Q(s,a) should be multiplied by a factor of 2 to keep the same "importance" relative to U(s,a), or was that already tuned by reducing cfg_puct (I'm assuming the latter through the tuning process)?

Anyway I meant FPU = 0.5 in Leelaz terms for fresh starts, yes. But it's good to know it can be retuned along the way to maximize performance (I guess if someone has as much computing power as Google, then retuning it after each network promotion would be ideal).

was that already tuned by reducing cfg_puct (I'm assuming the latter through the tuning process)?

AGZ's PUCT isn't specified, so indeed this would have been compensated by the tuner.

leelza-5748 is the current method, FPU=parent. FPU=0 and FPU=parent-0.2 are fairly close, not sure if they are fundamentally different (how often does the score drop by 20% or more?)

The score should drop by 20% or more quite often in the search tree. In capturing races, ladders, suicide moves...

Yeah but most moves are not that.

These moves decide the fate of the game though.

Oh and I also added a lower and upper bound. 0.01f for the minimum fpu value

I'm not sure the lower bound, i.e. cap to 0.01f is necessary FWIW. The eval gets pushed up by the UCT part of the formula anyway, so negative values will "just work".

Yeah I'm not sure either. With a strong policy fpu can be negative and the policy can compensate. But in the case 2 moves have identical policy, but one is explored and the other isn't. The unexplored one will never be visited right ?

The unexplored one will never be visited right ?

It will get visited at some point. The UCT formula will force exploration if the other move has had a ton of visits, and that even works if it has a negative value.

this : cfg_puct * psa * (numerator / denom) can have a negative value ?

@Eddh The full formula is: value = winrate + cfg_puct * psa * sqrt(parentvisits) / (1+childvisits)

The part your code modifies is winrate, and even if it is negative, the other term will overwhelm it as parentvisits get large.

OK I see I misunderstood that. So yeah lower bound probably not necessary.

Edit : I think from now on I will do a lot more games but with 600 playouts. It seems from this thread https://github.com/gcp/leela-zero/issues/667 that there is some correlation for this range of playout with 1600 playouts when it come to networks. I'm hoping there is also a correlation when it come to fpu.
Edit2 : I have to say my branch moving_fpu_m0.15 is showing promise at 12/8 against eval-0.2f at 600 playouts for now. I took out the dumb shenanigans with lower bound, and maximum to the best child and just use parent's get_eval(color)-0.15f so if you want to test at 1600 playouts a moving average like fpu you can try that. https://github.com/Eddh/leela-zero/tree/moving_fpu_m0.15

I am at 17/11 for moving_fpu_m0.15 at 600 playouts against net_eval-0.2f. Seems more and more promising to me. For fun I tried 5 matches at 600 playouts against normal Leelazero 1600 playouts, result 3/2.

Wow, the 600 playouts one beat the 1600 playouts 3 times out of 5? That's some serious potential ELO gain there even though not statistically significant.

If you can make that result statistically significant, that would be a good reason to lower self-play playouts to something like 800-1000... this would compensate the slowdown we're going to experience with 10 blocks, and provide very fast games with 6 blocks.

I played 4 x 200 games with the variants (FPU=parent-0.2, parent-0.1, your branch, and FPU=max(0, parent-0.2) and aside from parent - 0.1 being weaker than - 0.2 they were all very close, so no statistically significant decisions. Based on this I will just take the simplest one.

I'm doing a combined tuning run now for PUCT factor + FPU reduction constant.

The earlier PUCT tuning run ended with an optimum between 0.8 and 0.9. We have 0.85 now, so that wasn't a problem.

Tuner is trending towards PUCT=0.8 and FPU-reduction=0.22, so not very far from what we already have.

Are you doing the tuning on a single net, or are you using multiple nets for reference? How confident are you that optimal tuning values for one network are the same as for another?

Nice. I've been very lucky to hit 0.2 on the first try.

Apart from that, I wonder if it is possible to approximate how different fpu methods would perform for a new run from scratch by testing performance on a flat/random policy, using only the network evaluation for guiding the search.

Are you doing the tuning on a single net, or are you using multiple nets for reference? How confident are you that optimal tuning values for one network are the same as for another?

I'm using 5748 (best 5x64). The "optimum" in the strict sense may differ between networks but I don't even see any reason to care whatsoever! We need a "good enough" value. It's not like I'm going to let the tuner run till infinity either.

It's more likely that the strength of the policy network influences the optimum value. We observed earlier that FPU=0 ain't so good for near random networks.

Apart from that, I wonder if it is possible to approximate how different fpu methods would perform for a new run from scratch by testing performance on a flat/random policy, using only the network evaluation for guiding the search.

You can take one of the very early networks (say a11516) and compare there. (Just take it early enough that it has no white bias yet - we had a bug with that early on. Winrate should be close to 50% in the beginning).

For Odin I have similar FPU implementation as in current Leela, discovered independently (although I probably have the same in Valkyria which is about 10 years old so I forgot). I started out with the FPU= 0.5, then I used the Parent Value (but forgot to adjust for color change) and when I I fixed the bug I got a really huge boost in playing strength. And this boost is there also for pure MCTS without neural network priors so I would agree with gcp that this implementation detail is not dependent on the network. It is simply a matter of search efficiency.

A rationalization for why this is better: The puct formula assumes that each move returns a value from some sort of distribution (normal? I don't know). So each additional search to the same move only gets a more accurate measure of that moves average value. But in a Go game each additional search goes deeper and can potentially provide much more information.

Suppose you have sixteen moves with the same psa, and you have sixteen playouts left. Current code would search each of those moves once (ignoring a large positive change in winrate found in one). But it might be a better strategy to randomly pick one of those sixteen moves, search it twice. Then pick another move and search it twice. Etc until you searched eight moves twice, and ignored the other eight.

So the parent-0.2 serves as a sort of progressive widening. Until a move with zero playouts overcomes the -0.2 penalty you don't search it. But once it is deemed time to search it once, the penalty is removed, and the search focuses on only that move for awhile. As the penalty is removed from the top candidates the search converges to the classic uct formula. But the parent-0.2 effect is still going on deeper in the tree.

Also once the penalty is removed and it starts to search, suppose the move turns out to be bad after all and gets lower winrates. The search will start to avoid that move again. This can explain why my idea of moving average is not needed.

tldr, over simplified: For every level in the tree, search each candidate N moves deep before moving to another candidate. But for candidates with N moves searched, spread the searches evenly.

@odinAuthor Do you also use a fpu reduction constant ?

@killerducky I agree, this effect is probably a big part of the reason it works so well. But the idea of moving average, even tho not needed, is at least competitive with the fpu reduction constant from the testing gcp did with my moving_fpu branch. I think more effort on that front can still yield even better results. But for now I agree with keeping things simple since there is no reason to go for something else.

@Eddh No, not yet but I will most certainly consider testing it. The simplest reason for it is probably that most moves in go is very bad and easy to predict, and by postponing searching those moves a little bit longer has a huge impact even if the heuristic for doing it is very simple. For Odin it might even be more important because the priors from a network evaluation is added after the node has been visited perhaps 50-100 times or so.

@odinAuthor That will be very interesting.

@gcp @killerducky I have tried a bit of games with a115, but I don't think I have the right setup to test it with more games. But from the few games I've tried, there seems be a huge and obvious problem with eval-0.2 when the policy is almost flat like with a115. In that case, even though the policy is flat, all playouts go into very few moves. Which I realize now is logical from the formula for puct, since when psa is <1% for every move, parentvisits need to get huge to compensate the fpu reduction. This seems very bad to me, since that means a flatter policy cause deeper search, which is contradictory. It also has implications for strong networks, who can sometime hesitate between a few moves and output a lot of < 10% options. Logically this also cause a deeper search to compensate for fpu reduction which is contradictory with the goals of the policy.

To correct this, I'm wondering if it would be possible to use something like the variance of the policy distribution to adjust the fpu reduction for each move. The goal being that a flatter policy, with lower variance, imply a lower fpu reduction value. In a way the more the policy is confident, the bigger the fpu reduction would be, and vice versa. Any thoughts ?

@Eddh For you or others interested, I was looking over the UCT formula on Wikipedia and in the Improvements section it mentions

One such method assigns nonzero priors to the number of won and played simulations when creating each child node, leading to artificially raised or lowered average win rates that cause the node to be chosen more or less frequently, respectively, in the selection step.[18] A related method, called progressive bias, consists in adding to the UCB1 formula a {\displaystyle {\frac {b_{i}}{n_{i}}}} {\displaystyle \frac{b_i}{n_i}} element, where bi is a heuristic score of the i-th move.[5]

The linked references are Combining Online and Offline Knowledge in UCT (2007) for using non-zero priors, and Progressive Strategies for Monte-Carlo Tree Search (2008) for progressive bias.

What we currently do with puct=parent or parent-0.2 basically sounds like the non-zero prior idea, whereas your comment on using variance or something else to make the priors more dynamic is similar to the progressive bias idea. Perhaps those references (or other newer papers that reference them) can lead to ideas for Leela Zero. The latter paper also has an idea for "progressive unpruning" where they reduce the branching factor up front, then increase it as time allows.

It seems a bit different, those seems for using heuristic knowledge on top of monte-carlo evaluations. I've been reading a bit Modifications of UCT and sequence-like simulations for
Monte-Carlo Go
and Exploration exploitation in Go as well to better understand this. It seems there are many different ways to do this. I don't even think we are limited to using winrate estimation directly. I've made another buggy branch a few days ago which is interesting but not sure if it is better or worse than eval-0.2 yet. It tweaks the winrate estimation so that child nodes action value is mapped to a [0;x] range, and then we can choose fpu to be a constant within that range. Its playstyle is .... different http://eidogo.com/#1F0JL95t7 (I called it normalwinrate) but it depends on the parameters. It would be completely different than Alphago Zero at that point though.

I found a paper on using MCTS to play Heathstone. In it they mentioned using non-zero priors as "search seeding". I'd not heard of it by that term before. Googling for "MCTS search seeding" led me to A Survey of Monte Carlo Tree Search Methods (2012) which is a bit more recent and talks about both progressive bias and search seeding. Although the main reference in the search seeding section goes back to the 2007 paper above.

From the second paper, "In plain UCT, every node is initialised with zero win and
visits. Seeding or “warming up” the search tree involves
initialising the statistics at each node according to some
heuristic knowledge." So that involves human knowledge or reusing moves seen elsewhere in the tree.

Rather than

value = winrate + cfg_puct * psa * sqrt(parentvisits) / (1+childvisits)

Would something like the following make more sense?

value = (winrate + cfg_puct * psa) * sqrt(parentvisits) / (1+childvisits)

or:

value = (winrate +  psa) * cfg_puct  * sqrt(parentvisits) / (1+childvisits)

Hmm I'll try the second one. In the third one you could just take out cfg_puct and it would make no difference.

then needs to be some exploration paramater so this is better:

value = winrate +  psa +  cfg_puct  * sqrt(parentvisits) / (1+childvisits)

Second one seems bad. Only one game, but looking at the spread of the search, it sometimes only search one move, sometimes all moves. It lost quite fast.
Edit : might also be a parameter thing, who knows.

Yea there needs to be a exploration paramater, see my updated formula. Also cfg_puct would need to be retuned.

value = winrate + psa + cfg_puct * sqrt(parentvisits) / (1+childvisits)
This seems bad to me because as parentvisits get large the last term will overwhelm both winrate and psa who will almost not matter. Will end up searching all moves almost evenly.
Edit : I'm not even so sure, it's tricky to imagine the behavior of these formulas xD

But as childvisits increase the third part gets smaller, so winrate and psa becomes more relevant.

So it will first search nodes with high psa+winrate, but then the univisited nodes will have higher and higher parentvisits while their childvisits stay the same. It seems like with that rise it will force exploration of the unvisited node every 1 or 2 visit of the parent ?

Since I have nothing to lose I will probably still try anyway, but tomorrow. I need to sleep now.

Well cf_puct would need to be retuned, probably lowered quite a bit. But the basic idea of the amount of children explored being increased the more the parent is explored is part of the original formula.

I also wonder, the balance between the influence of policy score and winrate currently depends on if the AI is winning or losing. Would it not be better if they were both scored out of 1? So instead of using winrate you use win winrate/best-winrate, where best-winrate is the best child of the root node?

Edit: actually instead of best-winrate, just the root winrate may make more sense

So it will first search nodes with high psa+winrate, but then the univisited nodes will have higher and higher parentvisits while their childvisits stay the same. It seems like with that rise it will force exploration of the unvisited node every 1 or 2 visit of the parent ?

Ok thinking about it more I can see what you mean, and why the original formula is what is is. Nodes with a very low or 0 policy score should not be selected, so the score needs to be a multiplier not a constant.

How about:

value = winrate + cfg_puct * psa/best-sibling-policy-score * sqrt(parentvisits) / (1+childvisits)

That way, a flatter distribution would behave similar to a strongly biased one as 1%/2% is the same as 30%/60% etc.

With this if you have moves with psa 0.5 ; 0.10 ; 0.10 etc, once divided by best sibling they will turn into multipliers 1 ; 0.2 ; 0.2 etc. It doesn't seem right either because once the 0.5 has been explored and the first 0.10 has caught up with the fpu reduction, the second 0.10 move will have to do just as much catching up as the first one in order to be selected, even though he is being compared to a much lower policy value. With the thought experiment @killerducky did it seems ok to have the second one wait for a few exploitations of the first one, but it shouldn't be too much, and there it seems too much because the difficulty of catching up has been raised just because the 0.5 policy move is raising the bar. But I think I'll test it this evening, if I have time.

The solution I'm thinking of is a bit more complicated, using the standard deviation of unvisited nodes. That way the balance isn't messed up by a strong policy move. When there is one unvisited node at 0.4 and the others at 0.01 or lower, the standard deviation will be high and because of that fpu reduction will be high too (I'm thinking it should be a maximum of around 0.4), but the 0.4 policy move should be able to catch up after a few parentvisits, then there are only 0.01 or lower policy moves, meaning a very low standard deviation, which should drag down fpu reduction, and we're back at the normal formula, though we might want to still have a minimum for fpu reduction, seing the results it has had.

I also wonder, the balance between the influence of policy score and winrate currently depends on if the AI is winning or losing. Would it not be better if they were both scored out of 1? So instead of using winrate you use win winrate/best-winrate, where best-winrate is the best child of the root node?

Edit: actually instead of best-winrate, just the root winrate may make more sense

This is actually kind of what I do in my normalwinrate branch might want to try it, it is fun. But at that point it becomes so different from AlphagoZero that we are deviating from the goals of the project. (Also my normalwinrate branch is buggy, I haven't pushed a correction for a division by 0 when NN eval is absolute 0 xD)

It doesn't seem right either because once the 0.5 has been explored and the first 0.10 has caught up with the fpu reduction, the second 0.10 move will have to do just as much catching up as the first one in order to be selected, even though he is being compared to a much lower policy value. With the thought experiment

I do not quite understand what you mean here, if both policy scores are 0.1, then as soon as the first node with a 0.1 policy score gets an extra child node it's overall value would be less, meaning that the other 0.1 node would be explored, so they should end up with more or less an equal number of child_visits if win_rates also stays equal.

The denominator is equal for all siblings, so the behaviour is the same as the original formula.

If you do test it, I would suggest to set cfg_puct to a third of its current value, on the basis that best-child policy score probably averages around 30% currently and acts as a divisor.

Hmm maybe you're right on that part. I will try when I can.

I looked at the old code in Valkyria and in the more recent code in Odin. It turned out that the Valkyria code is so riddled with weird experiments and features it is impossible to say what it is doing... In Odin I saw that I did not use the parent node for FPU, but instead it uses the win rate of the most visited sibling sofar. Both measures are correlated but the dynamic behavior of my current implementation is probably more unstable/complicated. Nice thing is I have a command line parameter for it already so maybe I could do some quick tests in the weekend easily. I would like to compare what I have with A) using parent winrate and B) parent winrate - constant, and C) current system - constant.

killerducky tried the most recent sibling but with exponential smoothing. Lowest sibling though seems to make more sense to me, which is what Eddh proposed I believe.

Lowest sibling feels dangerous for Odin at least. In Odin the following could happen: the hardcoded prior (basic local patterns) dominates search and the highest prior is searched 10 times in a row. Sometimes all the simulations are lost for a score of 0% (Leela is different becuase the WR is from the nn value output does not suffer from statistical random noise (just bad bias). In this case no new moves might be searched for a long time compare to using the wr of the parent.

I could see that could be a problem then, exponential smoothing should sort
that though and should be used anyway for this kind of thing. It seems at
best though this is only a small improvement on just parent - constant.

On Fri, 26 Jan 2018 at 12:55 odinAuthor notifications@github.com wrote:

Lowest sibling feels dangerous for Odin at least. In Odin the following
could happen: the hardcoded prior (basic local patterns) dominates search
and the highest prior is searched 10 times in a row. Sometimes all the
simulations are lost for a score of 0% (Leela is different becuase the WR
is from the nn value output does not suffer from statistical random noise
(just bad bias). In this case no new moves might be searched for a long
time compare to using the wr of the parent.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/gcp/leela-zero/issues/696#issuecomment-360777536, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AgPB1odMtiE0lWbaAqvculJfpomQjfLgks5tOctVgaJpZM4Rlcn0
.

killerducky tried the most recent sibling but with exponential smoothing. Lowest sibling though seems to make more sense to me, which is what Eddh proposed I believe.
What I tried was using the lowest sibling as a minimum for fpu (eval-0.2f), so it didn't cause that problem. It didn't seem to make much of a difference either. To this day I don't know if it is better or worse.

I'm doing some math on the current situation. Right now with flat policy at 0.01 everywhere, you need around 600 visits to compensate the fpu reduction of 0.2f (assuming winrate stay around eval, which is likely). So that is exactly why in my test of a115, at 500 playouts, only 1 move was searched despite flat policy.

For moves with policy of 0.1, which are more interesting and are relevant for our current strong networks, you only need 5 parentvisits. So it might actually not be too much of a problem for strong networks...

Formula I use is N = (fpu_reduction/(psa*cfg_puct))^2

So almost everything is better than FPU =0.5. This is my understanding now. Just having FPU constant is bad because when the best move is winning 0.75 or so, other moves are not explored but there could be a good move with low prior that simplifies the game with a high win rate. In this case we want search wide and not deep. If we are losing however with a WR of 0.25 the FPU=0.5 will leads us to explore all moves in sequence, but when we are losing we would like to search complicated lines deep looking for a tactical combination that actually works. By using the WR of the parent (or anything correlated) the exploration/exploitation tradeoff will be much more stable.

Also the statistic is different in traditional MCTS and AGZ style MCTS. The value output of the NN is close to the value returned by a deeper search, but for Odin it is a coin flip. If I remember correctly in Odin WR = (Wins+WRofMostVisited)/(Visits + 1) so that WRofmostVisited is still stabilizing the WR also after the first time it was used. I think that is good to counter random noise, but could perhaps also have a tiny effect on Leela too. Using the WR of the most visited sibling means exploration is encouraged also when a strong move has been found (very important for Odin since at the leaves no NN policy priors are available). But I am still curious to see if making the current search slightly more selective could be good.

Yeah I'm more and more certain that 0.5 is bad now. If they used something like that for Alphago Lee it would explain a lot why it plays bonkers when it is far ahead or behind. it would not be at all because it actually thinks those moves are good for winrate, but because the search is messed up, though more when behind than when ahead.

@odinAuthor I think that's a fantastic observation that FPU=0.5 means we go deep when winning and wide when losing. I wonder if that factors into why LZ often loses with a large dead dragon. Watching some of those games live on OGS, her win rate will peak over 99% for a long time. Eventually crashing to resignation within a few moves once she is finally forced to see the capture moves on the dragon. The moves where the opponent will capture the dragon (and thus the win rate estimate shift drastically, so they are the opponent's best moves as far as LZ is concerned once seen) never made it into a PV until the very very end.

This also suggests in self play LZ would resign long before the dragon dies (win chance for opponent < 1%) and thus reinforce its belief that the dragon board state is a winning one, not a losing one. LZ has to somehow see dragons dying in training to learn more about eyes...

For moves with policy of 0.1, which are more interesting and are relevant for our current strong networks, you only need 5 parentvisits. So it might actually not be too much of a problem for strong networks...

I think you are mean child visits.

psa  * sqrt(parentvisits) / (1+childvisits)

so if the first node is explored 5 times, we looking for if (0.1 * sqrt(parent)/1 ) - ( 0.1 * sqrt(parent)/6) becomes greater than the fpu reduction. But the parentvisits depends also on other siblings so I can not just put this as 6.

I don't know about LZ. LZ doesn't use FPU=0.5 right now, and it seems like the NN actually has problems understanding very large groups #708. But the behavior of the search will indeed be different when it is at an extreme range of winrate because all winrates are the same, so it will be guided only by the policy probably ? Do you know what kind of shape the policy has in endgame ?

@evanroberts85 I was just looking at the number of visit in the parent node required to make puct = 0.2

It is important because in case the NN output a policy where the 3 best moves are 0.1, the rest very low,
The first 0.1 is going to be picked. Assuming its winrate is at net_eval and stay that way as it is exploited, puct of the second 0.1 node will need to at least be higher than 0.2 for it to be picked. because the first node will have score = net_eval + puct(first node), while the second will have score = net_eval - 0.2 + puct(second node)

I think we'll have to repeat these experiments after the new nets settle. The bug fix gcp made to randomness of positions selected for training should improve the value net. That might have a big impact on these other parameters.

@Eddh I think i was following your thought process, but if the puct(node) is going to make up the 0.2 difference, then it is amount of child sub-nodes that is the dominant factor, though I suppose you could compare a node that has been explored exactly once to one that has not been explored at all and then look at the amount of parent nodes required to make up the difference.

@evanroberts85 first node with psa 0.1 has score = net_eval + puct(first node), no matter how much this node is visited, puct(first node) will always be positive.
second node, with psa 0.1 also, has score net_eval-0.2 +puct(second node) since it has not been visited yet.
When can score(secondnode) overcome score(firstnode) ?

net_eval + puct(firstnode) < net_eval -0.2 + puct(secondnode)
puct(firstnode) + 0.2 < puct(secondnode)

Thus, puct(secondnode) needs to be at least more than 0.2 so that the score of the second node overcome the score of the first one. I'm sorry I explain my thought process badly.

Edit:

This led me to solving puct(secondnode)=0.2 which is psa * cfg_puct * sqrt(parentvisits)/(1+childvisits)

Since secondnode is unvisited, childvisits = 0
parentvisits is what we want to find

so we have 0.2 = psa * cfg_puct * sqrt(parentvisits)

sqrt(parentvisits) = 0.2 / (psa * cfg_puct)
parentvisits = (0.2 / (psa * cfg_puct))^2

@Eddh I understood that and that is reflected in my comments. There are two factors in the value of puct(node) though, child visits and parent visits.

Currently looking through your normalwinrate branch. I am not sure using worstChildValue to create a range is a good idea. Imagine only two sibling nodes where the difference in value between them is infinitesimally small, in that case the policy output instead should be dominant and the value difference should be all but ignored, but by normalising over the range the value output would have a huge affect.

I think it is better to just use bestChildValue. So for the child node:
winrate=winrate/best-sibling-winrate.

child visits is 0 for the second node, which is unvisited.
I know my normalwinrate branch is not ideal, it was one of my goal to make the search more sensitive to slight evaluation difference and see what it would do. It is actually not bad, at least from the few games I've tried. You can try more if you want.

Indeed, so the unexplored node will evaluate psa * sqrt(parentvisits) instead of half that value if it had one visit, or a third that value if it had 2 visits etc.

I will try to make a few modifications to your branch then do some very limited tests (I have a very slow computer).

Well yeah it will rise according to the square root. And there will be a point at which it will surpass 0.2, which is the minimum if the second node is to be picked (assuming that winrate for the first node stay at net_eval). This point is when parentvisits is more than 5. So in the current situation, if the 2 best child nodes have same policy and the first one is stagnant or rising in terms of winrate, the second node will not be picked until the first one has been visited 5 times (minimum). I'm not sure if this is actually good or bad. It means nodes deep in the tree, or very low playout setting, will have an unbalanced search sometimes.

It is almost certainly bad for networks producing flat policy though. And hypothetical cases where a strong network would be confused and output lots of policies below 5%. I'm not sure how often that happen.

@evanroberts85 Your formula scaling the policy seems to not be bad. Just played one match against normal Leela(raw parent net_eval), both using c83e and 600 playouts, it won convincingly, reading out capturing races properly. More testing would be interesting, to know if it is better than eval-0.2f alone.
Here is my branch

That is a good sign. I have got my development tools and Sabaki set up now locally so started doing some testing myself, but I am having to learn c++ as I go along.

Your formula fared well against normal Leela but didn't seem as good against eval-0.2 alone, It got 2/6 (600 playouts, with c83e). Too few to be sure, but I also tested a lot of my ideas and found one which has a record of 21/11 against eval-0.2 at 600 playouts so I find it more promising. It is this branch.

The basic idea is that the anomaly we are trying to take care of is a case where a 1% policy move which isn't particularly good is taking up all of the visits. So we can detect that with simple math. Adding up all of the visited moves's policies, we get a total visited policy, and if it is very low that means a lot of the policies recommendations are being ignored.

And so this solution simply multiply a fpu reduction constant by the total visited policy. If this total is 0.01 then the fpu reduction constant will be very low which will allow all the clogged up policy to do something. if it is 0.01 for a good reason, like a very good move was found, the fpu reduction will be weak, but since the very good move has a winrate higher than net_eval it will not be too much of a problem.

Edit: current results are, at 600 playouts:
27/15 for my branch against eval-0.2 on network c83e
3/1 for my branch against eval-0.2 on network 714a

I will merge this, after adding the old policy as a tunable (for weak networks);

leelaz v leelaz-tuned (433/1000 games)
board size: 19   komi: 7.5
               wins              black          white        avg cpu
leelaz          106 24.48%       49  22.58%     57  26.39%    520.75
leelaz-tuned    327 75.52%       159 73.61%     168 77.42%    487.62
                                 208 48.04%     225 51.96%

leelaz-m2_2 v leelaz-tuned (432/1000 games)
board size: 19   komi: 7.5
               wins              black          white        avg cpu
leelaz-m2_2     189 43.75%       102 47.22%     87  40.28%    511.17
leelaz-tuned    243 56.25%       129 59.72%     114 52.78%    508.18
                                 231 53.47%     201 46.53%

PUCT=0.8, reduction=0.22

Network was 5773f44c

Are we staying with 1600 playouts, or are you considering putting some of the tuning gain into speed up by lowering playouts? In general, are there any changes to the plan you posted 5 days ago, before discovering the training data shuffling issue?

@jkiliani There is something to be said for keeping the numerical value of 1600 playouts. It's what AlphaGo Zero used, and it's what people have been using, and it's what people know. Keeping it at that number will prevent a lot of questions about "hey this is different than AGZ".

@WhiteHalmos I'm not arguing that point, and I suppose it would prevent such questions... but at some point we will need to ensure our game rate doesn't drop too much, e.g. when going to 10 blocks. Lowering playouts is the easiest such method that comes to mind.

I have been running tests with Odin 1000 simulations, with no network against Gnugo on 9x9. About at least 3000 games was played for each experiment.
It turned out that I still used FPU=-0.5 as default and in my tests it is also the best option. All other are about 2%-4% worse.

FPU = MostVisitedWR - k, was worst with k= 0, k = 0.1 or something near could almost be as good as default

FPU = AverageWR - k, slightly less bad and effect of k less clear

I am still running some k = 0.06 experiments but so far it looks like FPU=0.5 is rock solid and simply the best for Odin 9x9 MCTS.

Interesting. It would indicate that 0.5 is good for 9x9 but bad for 19x19, possibly because the branching factor is lower. Or it depends on the search algorithm.

@Eddh
I have started on a new approach to balancing Winrate, Policy Score and influence of number of visits (child and parent). It should fix all the issues you have mentioned before.

Winrate and policy_score are converted into points using logarithms. So if 50% is 0 points 25% is -10 points and 12.5% is -20 points etc. For scores above 50% I use the distance from 100%, so 75% is +10 points and 87.5% is +20 points etc.

This means I can separate out the psa from the PUCT algorithm. Now the total value is = win_points + policy_points + puct_points. I just need to find a good balance between the three.

Edit: Realised weighting of policy needs to be related to number of visits otherwise a poor policy score can never be selected, so might add it back into puct.

Yeah you need to be careful tweaking the puct formula itself. Deepmind obviously didn't make it that way mindlessly. In the alphago zero paper they say the goal of the formula is that it favor moves recommanded by the policy early on but as the visits count gets higher it is more and more based on pure winrate.

they say the goal of the formula is that it favor moves recommanded by the policy early on but as the visits count gets higher it is more and more based on pure winrate.

Well that is just the entire idea of UCT itself. There's the original one (using log), the one in the AGZ paper (using sqrt), also there's also a variant (UCT with priors) that's described in another paper the AGZ one refers to.

They are all fairly close in strength as far as I remember.

Is there going to be a release with mandatory update for this? There's a good chance some nets will scale better with FPU reduction than others, and not having a uniform UCT search code over all clients may bias network testing results.

I found this which seems quite interesting : https://ewrl.files.wordpress.com/2011/08/ewrl2011_submission_29.pdf
From what I understand about this "double progressive widening", it seems to me like fpu would disappear. Instead once the best scored explored node is sufficiently exploited an unvisited node is picked. We would no longer directly compare unvisited nodes to visited ones, which is what lead us to these problems. They use similar reasoning to what @killerducky said :
"it is of little importance to build a very wide tree at the random node level (i.e.
to explore many different random occurrences of the events). In this setting, the
best strategy is to explore very few random occurrences, and to build a tall tree,
with many leaves being final nodes"

I will try to see if i can adapt this to LeelaZero. Potentially we could have a general solution that work for all network.

Well that is just the entire idea of UCT itself. There's the original one (using log), the one in the AGZ paper (using sqrt), also there's also a variant (UCT with priors) that's described in another paper the AGZ one refers to.

If the original UCT algorithm didn't use priors then going for high priors moves before going for high winrate ones is not the entire idea of UCT. It is the entire idea of the modified UCT used by deepmind though.
this is the original UCT paper.

There's more recent publications describing how to incorporate prior knowledge about arm likelihood into UCT, I don't remember the links offhand though. I just remember that for some reason AGZ did not use that algorithm but a variation, IIRC the difference is that AGZ applies the prior to the entire right-hand side of the equation, whereas in the original algorithm the prior is a separate term.

Going to be hard to retrieve information from playouts without having playouts.

Other papers were linked earlier in this thread, around this point: https://github.com/gcp/leela-zero/issues/696#issuecomment-360601056. Using the term "search seeding" when setting priors is involved.

Here's also one paper with very similar neural net evaluation as AGZ, but in Hex instead of Go: https://arxiv.org/abs/1705.08439. Chapter 4 and Appendix C are about MCTS.

Is there going to be a release with mandatory update for this?

This and tree reuse (which we didn't enforce yet as there were bugs with time controls etc)

There's a good chance some nets will scale better with FPU reduction than others

There's no evidence of this, aside from networks which have a near-uniform policy, which no longer applies. So what makes you say "there's a good chance"?

If this causes a strong bias to network strength we probably need to back it out again ASAP.

There's no evidence of this, aside from networks which have a near-uniform policy, which no longer applies. So what makes you say "there's a good chance"?

The tests I conducted at low playouts currently show pretty convincingly that scaling with playouts varies very widely between different nets. For recent nets, I'd have to consider the screening at low playouts pretty much a failure, mainly for this reason.

However, if the MCTS playout scaling is very different based on the net, is it really a stretch to think that scaling may also depends on UCT parameters, to more than a minor extent?

is it really a stretch to think that scaling may also depends on UCT parameters, to more than a minor extent?

It may or may not. I'd suspect likely not as long as the shape of the priors is roughly similar. But there's no evidence and there's no data, so obviously I don't think there's a "good chance". I would never have merged this otherwise.

Scaling with playouts is affected by policy accuracy vs value accuracy, but that doesn't really affect this change.

Scaling with playouts is affected by policy accuracy vs value accuracy, but that doesn't really affect this change.

To be clear, if your policy is really really bad, then FPU reductions will be bad, and you want FPU=parent or FPU=>1.0.

But I don't see that happening with current and future networks.

As for data, note that we did two UCT factor tuning runs, and they ended up with nearly the same value. So that's some actual data against that assertion.

As for data, note that we did two UCT factor tuning runs, and they ended up with nearly the same value. So that's some actual data against that assertion.

Hard to argue with that, so I guess the effect shouldn't be large even if present...

@Eddh if you have an improvement that you're confident is worth retesting, let us know!

Maybe @amj might also be interested in these ideas, and for them a tuning run is really easily, being blessed with massive company resources...

I have got this branch, which is an adjustment on this other branch which I talked about previously :

Edit: current results are, at 600 playouts:
27/15 for my branch against eval-0.2 on network c83e
3/1 for my branch against eval-0.2 on network 714a

That was against untuned eval-0.2 so I did more tests against reduction 0.22 and cfg_puct 0.8 :

leelaz_totalvp v leelaz_tuned_red (36/50 games)
board size: 19   komi: 7.5
                   wins              black         white       avg cpu
leelaz_totalvp       17 47.22%       12 66.67%     5  27.78%    643.62
leelaz_tuned_red     19 52.78%       13 72.22%     6  33.33%    640.70
                                     25 69.44%     11 30.56%

This test was also at 800 playouts and on c83e. Results weren't as good as I thought but still interesting since this branch should require another tuning.

Anyway I did some calculations and figured using the square root of totalvp should be ok which led to my branch sqrt_totalvp, I ran some tests against the tuned leelaz:

leelaz_sqrt_totalvp v leelaz_tuned_red (50/50 games)
board size: 19   komi: 7.5
                      wins              black         white       avg cpu
leelaz_sqrt_totalvp     28 56.00%       16 64.00%     12 48.00%    580.58
leelaz_tuned_red        22 44.00%       13 52.00%     9  36.00%    578.78
                                        29 58.00%     21 42.00%

So that seems interesting even though statistically insignificant. This branch should also be better for very weak flat policy networks than flat fpu reduction. I've just ran 2 test games with a115, with result 1/1 but the important part is that it wasn't looking at only 1 move every time, unlike eval-0.2. So all in all I think it might be interesting to test this more.

I am testing yet another modification on totalvp right now. Should have results tomorrow.

Btw gomill ringmaster is quite neat. I have to run it inside a VM and before I thought it would hurt performance too much but it is actually decent.

Any thoughts on how the FPU reduction may affect the self-learning? I have noticed that on the (slightly old) network I am testing with many perfectly good openings never seem to even get 1 visit out of 1600.

The randomization should help with that ? I don't know. I guess I'll do some tests with randomization.

@evanroberts85 That seems like a valid point. The new FPU will lead to a stronger policy, but with less exploration as a tradeoff (presumably), so I have no idea whether this is beneficial for training or not.

New FPU certainly implies less exploration of low policy move. But from what I see the dirichlet noise is applied to psa, so it doesn't seem too bad ? I am curious about how the dirichlet noise affect the strength of new FPU against old FPU though.

Even with noise, if on average it gets less visits than its policy score, then will there not be pressure in training to reduce policy score even further, until eventually it reduces to almost 0?

I continued working on my own branch, which is now a big departure from Leela 11, so tuning it is taking a while, currently it is slightly worse than Leela 11 with the FPU reduction. I use the following formula:

value = FA(winrate) + FB(policy_score)/FC(child_visits) - FD(child_visits)/FE(parent_visits)

Functions A,B,C and D all make use of logarithms. Function E just uses an exponent.

@Eddh Can you rebase to the "next" branch? Also, note that you can change the FPU in GTP.cpp (but you'll automatically get the new value if you rebase)

@evanroberts85 Hmm I thought the policy training was only trained on the final output of the search and not the distribution. But I didn't look at that part very much so I could be wrong.

For your formula : does that lead to 5 parameters to tune ? Be careful with that, it will be even harder to tune, and to reach conclusions about which parameters are affected by what. In my experience simpler is better.

@gcp I will do that this evening.
Edit : @gcp sqrt_totalvp now rebased to your next

@jkiliani a tuning run is not actually that easy for us, unfortunately. Again -- this is not the official project and we don't have "run in 3 days" levels of resources. Sorry :( Happy to share our experiences on our side tho let me see if i understand the thread.

IIUC, "FPU" here is the Q-value that a child-node is initialized to prior to its expansion, as discussed here right?

Just to make sure i understand what we're talking about, the behavior currently is to use the parent's Q, instead of init-to-0 as described in the paper? And then this initial estimate is either replaced or averaged with the first real result from the value net, correct?

For us, init-to-0 seemed to result in weird behavior in the search, where one side would not bother exploring new moves, and the other side would frantically explore all moves depending on who was winning and who was losing. Whether that behavior is intended & important i don't know.

It seems as if this change (FPU-0.2) is similar in effect to turning up c_puct, right? Why not just do that?

Yes it is the initial value of Q, for unvisited nodes. I learned about what it is just 1 month ago by reading this old paper https://hal.archives-ouvertes.fr/hal-00115330/document
My reasoning at the beginning for reducing fpu instead of turning up c_puct was that:

I thought about cfg_puct too but I thought fpu was more important, because for example, if one side is on a bad sequence which is becoming worse and worse as it explore deeper, the fpu will be higher than explored nodes, hurting exploitation. If a 0.5 net_eval move has all his explored childs at 0.1, but one at 0.4, instead of exploring the 0.4 move which is the best explored move yet, it will continue exploring unexplored nodes which have fpu 0.5, which seems bad to me. This preference will be the same no matter cufg_puct.

But now I think there are additional effects that are probably bigger. For one, it place higher trust in the move that are already explored, because these moves were recommended by the policy network, and so moves that were not recommended are likely to be worse. It seems redundant with the current puct formula but this effect is more long term while the puct formula is a short burst toward recommended move. Another big effect was described by @killerducky

A rationalization for why this is better: The puct formula assumes that each move returns a value from some sort of distribution (normal? I don't know). So each additional search to the same move only gets a more accurate measure of that moves average value. But in a Go game each additional search goes deeper and can potentially provide much more information.

Suppose you have sixteen moves with the same psa, and you have sixteen playouts left. Current code would search each of those moves once (ignoring a large positive change in winrate found in one). But it might be a better strategy to randomly pick one of those sixteen moves, search it twice. Then pick another move and search it twice. Etc until you searched eight moves twice, and ignored the other eight.

So the parent-0.2 serves as a sort of progressive widening. Until a move with zero playouts overcomes the -0.2 penalty you don't search it. But once it is deemed time to search it once, the penalty is removed, and the search focuses on only that move for awhile. As the penalty is removed from the top candidates the search converges to the classic uct formula. But the parent-0.2 effect is still going on deeper in the tree.

Also once the penalty is removed and it starts to search, suppose the move turns out to be bad after all and gets lower winrates. The search will start to avoid that move again. This can explain why my idea of moving average is not needed.

tldr, over simplified: For every level in the tree, search each candidate N moves deep before moving to another candidate. But for candidates with N moves searched, spread the searches evenly.

So basically it adds another layer of progressive widening. But it is very confusing. At this point I'd like to be able to see the entire search tree on screen to understand more what happens.

I have done more tests on sqrt_totalvp:

leelaz11 v leelaz_sqrt_totalvp (50/50 games)
board size: 19   komi: 7.5
                      wins              black         white       avg cpu
leelaz11                14 28.00%       7  28.00%     7  28.00%    606.13
leelaz_sqrt_totalvp     36 72.00%       18 72.00%     18 72.00%    605.50
                                        25 50.00%     25 50.00%

leelaz_tuned_red v leelaz_sqrt_totalvp (50/50 games)
board size: 19   komi: 7.5
                      wins              black         white       avg cpu
leelaz_tuned_red        24 48.00%       13 52.00%     11 44.00%    642.24
leelaz_sqrt_totalvp     26 52.00%       14 56.00%     12 48.00%    642.26
                                        27 54.00%     23 46.00%

still on c83e, 800 playouts. leelaz11 being parent net_eval.
So the total result is 54/46 for sqrt_totalvp against puct 0.8 and fpu reduction 0.22. I will do tests on weak networks next.

It seems as if this change (FPU-0.2) is similar in effect to turning up c_puct, right? Why not just do that?

I thought the exact same thing:
https://github.com/gcp/leela-zero/issues/696#issuecomment-359191303

But empirically, this seems not to be true.

My intutition is that c_puct strongly controls the overall shape of the final distribution of entire search tree. Different FPU strategies controls exploration near the leaves (which c_puct also does). So c_puct directly affect how FPU works, but FPU only indirectly influences the global effect of c_puct,

Fpu reduction has a massive effect on the shape of the final distribution too. I've played dozens of games on Sabaki looking at the distribution and it is wildly different from using parent net_eval. Parent net_eval very often cover my screen with the moves it explored while eval-0.2 in almost all cases has a search restricted to a few moves. From all my tests on Sabaki I've come to the conclusion that it is quite bad if the algorithm cover your screen with moves it explored. At the same time a restricted search is not a guarantee that it will be good, but if the search explores all moves despite the policy trying its best to tell it to go for 3 moves, it is guaranteed to be bad. That's just my feelings from watching the games though I have no data to back this up.

Yes, this is true too. In Odin MCTS explores move at such a high rate that FPU reduction will not prevent searching very bad moves as well. Deeper in tree however one can see that the moves the lowest prior has not yet been searched. Since the exploration term of UCT grows to infinity (but very very slow) all moves will be searched at some point in time. Therefor I actually prune provably bad moves in Valkyria and Odin so that they are indeed never searched (but beware of this approach... there is no rules without exceptions in go <- except this rule).

But I think searching bad moves a little at some point is not a major factor. It is more important to decide when nodes near the leaves can exploit the "best moves" so far to search deeper and also back up more accurate values in complicated situations, where evaluation has high uncertainty.

I think it might be a bigger factor than it seems. Though for sure I speak for Leelazero with <= 1600 playouts and it would not be a problem if you have a lot more playouts. But in Leelazero, what good does searching 0 or very low policy moves do ? They can be good because as you say there are exceptions to the rules, but they are bad 99.9% of the time. If the search consider them, it will be very late in terms of number of playouts, thus even if they are good the search will rarely have time to converge on them. And of course if you consider these options early it will bother the exploitation of high policy moves which are almost always better.

I've actually seen a lot of cases where the final search distribution has a move with high value but low policy which wasn't selected because it didn't have time to receive enough visits. And so in that case all the playouts spent on exploring low policy move and all the playouts spent on the high value move have been wasted, since there was no chance for the very high move to be selected. It's very different from the bandits problem.

I realised last night why the FPU reduction works so well - the FPU reduction makes a huge difference when node visits are very low, in comparison to just tuning c_puct. This creates a tree with a weeping willow like shape, which gives more depth without becoming too narrow.

I am testing having the FPU reduction be a function of node visits, this could keep the advantage of the FPU reduction while having greater exploration nearer the root. Results seem good so far.

These are the current results of sqrt_totalvp at 800 playouts with net 1ac2638d:

leelaz11 v leelaz_sqrt_totalvp (43/50 games)
board size: 19   komi: 7.5
                      wins              black         white       avg cpu
leelaz11                20 46.51%       14 63.64%     6  28.57%    556.76
leelaz_sqrt_totalvp     23 53.49%       15 71.43%     8  36.36%    552.37
                                        29 67.44%     14 32.56%

leelaz_tuned_red v leelaz_sqrt_totalvp (44/50 games)
board size: 19   komi: 7.5
                      wins              black          white       avg cpu
leelaz_tuned_red         5 11.36%       5   22.73%     0   0.00%    299.90
leelaz_sqrt_totalvp     39 88.64%       22 100.00%     17 77.27%    329.51
                                        27  61.36%     17 38.64%

So my branch is a solution which seems ok for weak networks. I am running tests with m=30 and noise now to see how it would fare in self play

@Eddh remind me what are leelaz11 and leelaz_tuned_red? Now that gcp has committed a version of FPU reduction to next we should use that as the baseline for further tests.

I want to get a solution closer to theory behind the PUCT formula.

winrate + cfg_puct * psa * sqrt(Ntotal)/(Nchild+1).

The formula says pick the move with the highest winrate. But also explore nodes with few visits (Nchild in denominator) that we deem likely to be good (psa). cfg_puct controls the balance between explore/exploit.

These improvements adjust the winrate for nodes with 0 visits, or more generally adjust the Bayesian prior of the winrate (which includes the 0 case, but also has an effect on cases where the number of visits is low).

The first change LZ made was to change winrate for nodes with 0 visits from 1.1 to the parent's winrate. We should find a good way to use information from visits to sibling nodes to adjust the winrate for the rest of the nodes. The sqrt_totalvp is one way to do this. Doing something that looks at sibling nodes, such as the exponential averaging I did, should be more correct, but maybe it's too complicated for the gain vs simpler heuristics.

Minigo has done something slightly different. They also use winrate of parent for children with 0 visits. But they treat it as a Bayesian prior, so when visits=1, the winrate is the average of the Bayesian prior (parent's winrate) and winrate the NN gives for that child. As visits accumulate the impact of this Bayesian prior diminishes. This makes sense to me because there shouldn't be any special difference between samples of the winrate from the parent, the child, and the UCTSearch from the child down.

The work in this thread can be viewed as adding information from siblings' winrate and psa to the Bayesian prior winrate of unexplored nodes.

Eddh's latest version: winrate = parent_winrate - 0.25 * sqrt(total_visited_policy) says the winrate of children tends to get worse as the psa gets lower. What about winrate = parent_winrate - scaling_const * squashing_func((max_psa - this_psa)/max_psa). The point is max_psa-this_psa where children with identical psa get identical lowerings. squashing_func can be whatever works best (sqrt? linear?).

leelaz11 is parent net_eval, leelaz_tuned_red is with puct = 0.8 and fpu reduction = 0.22
Your idea seems interesting. I actually tried something close to it. Instead of this_psa I used the max unvisited psa, and it was an adjustement on sqrt_totalvp : sqrt(total_visited_policy - max_unvisited_psa). It didn't seem to work well and after thinking about it more I ended up not liking it.
For squashing_func(max_psa - this_psa). I think there is a problem with the fact that if You have a move with 0.9 psa and another at 0.1, you will have extremely high fpu reduction which will stop the 0.1 from ever being explored. If you tune the function down so that it isn't a problem, fpu reduction might then be too low.

But at least a high value move that gets search late will provide training data for the network to learn that the prior should have been higher. So in time, that move will be found earlier.

This reminds me then, that early termination of search "because the result can't change" does change dump_training() to bias against move found later in search.

I see. I thought that only the final move output was used for training but rereading that part of agz paper, you are right, it seems trained to match the distribution. Dirichlet noise should prevent moves from being never looked at though so it's probably not a problem that fpu reduction makes the search ignore a lot of low policy moves.

@Eddh since you only ran 50 games against "next" i ran some more test games overnight. For the first 100 games it kept going back and forth between the two networks by a couple of %, but today in the morning after 300 games, the next branch seems to have a small edge.

Leela_NXT_800p v Leela_SQRT_totalvp_800p (300/400 games)
board size: 19 komi: 7.5

Leela_NXT_800p 159 53.00%
Leela_SQRT_totalvp_800p 141 47.00%

NXT is the Next branch from February 1st (after the save games to disk update)
SQRT_totalvp is your github from February 2nd.
Network was the current best 257 network, 800 playouts, resign at 1%

EDIT: Ok, how do i paste the ringmaster report as nicely as everyone else does?

Ok, how do i paste the ringmaster report as nicely as everyone else does?

There is an "insert code" button in github.

Interesting results anyway. I don't know how statiscally significant the difference is though.

Or like

```
code
code
```

for text you should use
```text
[whatever]
```

but if you insert actual code you can also hint at the language:
```Js
[code]
```

Else it will just auto-detect what you used.

I think I might have made a big mistake with my pushed branch sqrt_totalvp:
https://github.com/Eddh/leela-zero/commit/10da6922129d8277b095d1ca8d74ec33ccd895f9
@maxMaki I am so sorry. I guess we did learn something, that such a massively lower fpu has so little effect on the winrate. The tests I did in my VM were probably without that mistake, since I think I just copy pasted the line to it ( and hard to explain the crushing result with weak networks otherwise).
I will be extra careful with these things in the future.

Edit : I am actually not even sure if your tests had that mistake or not. I think I made the mistake when rebasing to gcp's next branch.

@Eddh no worries. Yes, i had return eval - cfg_fpu_reduction; in my version as well.
Deleted, and redownloaded/rebuilt from your github now.

I can rerun next night to see how it compares to next.
I just queued some games now for a quick check if the difference is huge, and currently it's 12 wins, 12 losses.

So that change doesn't at least seem to have a huge difference.

Is tuning automatically active on the next branch or does it require some startup flags to be set? (puct 0.8 , reduction 0.22)

You can look in GTP.cpp what cfg_puct and cfg_fpu_reduction are set at. They are at 0.8 and 0.22 in the next branch.

My branch is 5-0 up against FPU-0.22 in 800 playout matches.

I have made several changes but the biggest gain seemed to be to set FPU to
highest child, then increasing puct values for the first 50 parent visits. [edit: this did not really work, I ended up using policy exponent instead]
This gives the gains of the -0.22 reduction but without having the
blindspots.

I'll clean up the code and upload tomorrow. What do people use for
automated testing? If I can leave it running overnight I can run a lot more
tests at higher playouts.

Edit: I meant to say decreasing not increasing puct values for the first 50 visits. i.e reduce exploration.

On Sat, 3 Feb 2018 at 19:21 Eddh notifications@github.com wrote:

You can look in GTP.cpp what cfg_puct and cfg_fpu_reduction are set at.
They are at 0.8 and 0.22 in the next branch.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/gcp/leela-zero/issues/696#issuecomment-362846754, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AgPB1pWiNc6X-9fYpqvOA493lzgew9Hxks5tRLHGgaJpZM4Rlcn0
.

@evanroberts85 Gomill-ringmaster is quite useful for testing GTP clients, including LZ of course.

You can also use the tool in the validation folder.

@Eddh , i ran some more games at 800p with a updated version of your sqrt_totalvp, but sadly it didn't look very good so i stopped at 50 games.

Then i decided to try if it would scale differently at 1600 playouts, and the results were very different between 800p and 1600p. Does this make any sense? I'll leave the 1600 playouts one running.

Leela_NXT_800p v Leela_SQRT_totalvp_800p (50/400 games)
board size: 19   komi: 7.5
                                               wins                 black              white             avg cpu
Leela_NXT_800p             27 54.00%       13 52.00%     14 56.00%    132.10
Leela_SQRT_totalvp_800p    23 46.00%       11 44.00%     12 48.00%    133.28
                                               24 48.00%     26 52.00%


Leela_NXT_1600p v Leela_SQRT_totalvp_1600p (50/400 games)
board size: 19   komi: 7.5
                                                    wins                black               white            avg cpu
Leela_NXT_1600p                17 34.00%       8  32.00%       9  36.00%    274.06
Leela_SQRT_totalvp_1600p       33 66.00%      16 64.00%      17 68.00%    276.65
                                                    24 48.00%      26 52.00%

I'm not sure 27/23 is very statistically significant. My results which were slightly better for sqrt_totalvp weren't statistically significant either so "about same strength as nxt" seems valid to me. On the other hand I didn't do any test at 1600 so your results are certainly surprising. It could make sense since sqrt_totalvp should have higher and higher fpu reduction as nodes get more visited. I would never have expected that to boost winrate at higher playout though since sqrt_totalvp was actually made more with low playouts and bad networks in mind.

@Eddh , sadly it seems that the first 50 games were not giving an accurate idea of the strength at 1600 playouts. I let it to run to 200, and now it seems like all the other results, more or less equal with next.

Leela_NXT_1600p v Leela_SQRT_totalvp_1600p (200/400 games)
board size: 19   komi: 7.5
                           wins              black         white        avg cpu
Leela_NXT_1600p             102 51.00%       44 44.00%     58  58.00%    240.41
Leela_SQRT_totalvp_1600p     98 49.00%       42 42.00%     56  56.00%    243.34
                                             86 43.00%     114 57.00%


I see. Well it is good enough that it is equal since it beats it with bad networks and can still be tuned. Thank you for all the tests !

I've uploaded my branch here:
https://github.com/evanroberts85/evan-leela/

I ended up switching to what I think is the most elegant solution. FPU is set to the best visited child-0.01 then I have applied an exponent to the psa score.

auto psa_exponent = 1.5;
auto puct = cfg_puct * pow(psa,psa_exponent) * (numerator / denom);

//TO TEST NEXT auto psa_exponent = 1 + 0.5*(500-parentvisits)/500); if (parentvisits<500)

I need to work out the best value for the exponent, 1.5 was my first guess but validation is taking ages on my computer. Currently winning 6-3 at 1600 playouts (vs /next)

Reasoning behind my branch is that I looked at the effect of the FPU reduction, basically if the policy difference between two children was doubled, then you would require around 3x the amount of visits before the weaker node is explored (exact multiple depends on a few factors). So I think it is better to get rid of the FPU reduction and use an exponent to achieve this 3x effect instead.

@evanroberts85 Interesting. It does seem to have an effect similar to FPU reduction.

Out of curiosity I played a game with the following failed network against the best network both running on evanroberts85's modified version with 800 playouts.

image

Result is 1-0 for the failed network.

Here's the game.
http://eidogo.com/#2Um4VASaw

A single game holds no statistical significance, although this result definitely suggests getting more data on this. Can you add a couple of games to this matchup?

It is interesting. I'll add couple more games tonight after work.

With a proper amount of games, the "improved" versions can be seen to be clearly worse than what is in "next" now. There is a "validation" tool in the codebase that tells you when you have an improvement and when you're tossing dice.

leelaz-257a v leelaz-eddh (656/1000 games)
board size: 19   komi: 7.5
              wins              black          white        avg cpu
leelaz-257a    359 54.73%       168 51.22%     191 58.23%    368.65
leelaz-eddh    297 45.27%       137 41.77%     160 48.78%    361.11
                                305 46.49%     351 53.51%

@gcp Hum I think I made a mistake when rebasing, forgetting about cfg_fpu_reduction which the rebase added https://github.com/Eddh/leela-zero/commit/10da6922129d8277b095d1ca8d74ec33ccd895f9 so the fpu reduction was applied doubly. I hope that wasn't the version you downloaded. I'm very sorry if it was. Anyway the goal of this branch is to be as strong with strong networks but also ok with weak networks in order to not need to switch between fpu strategies along a training run. For that I have these results which seems significant to me:

leelaz_tuned_red v leelaz_sqrt_totalvp (44/50 games)
board size: 19   komi: 7.5
                      wins              black          white       avg cpu
leelaz_tuned_red         5 11.36%       5   22.73%     0   0.00%    299.90
leelaz_sqrt_totalvp     39 88.64%       22 100.00%     17 77.27%    329.51
                                        27  61.36%     17 38.64%

with 800 playouts on net 1ac2638d
Edit : I found an easy to use sprt script https://chessprogramming.wikispaces.com/Ratios%20/%20Operating%20Figures-SPRT, and funnily, 359/297 is one game short from being a SPRT pass

>>> sprt.SPRT(359,1,297,0.0,35.0,0.05,0.05)
''
>>> sprt.SPRT(360,1,297,0.0,35.0,0.05,0.05)
'H1'

Here is result from the match-up of d6c843c9 vs 257aeeb8 running on evanroberts85's version.

  | Black | White | Result | Playouts | View
-- | -- | -- | -- | -- | --
Game 1 | 257aeeb8 | d6c843c9 | W+Resign | 800 | http://eidogo.com/#2Um4VASaw
Game 2 | d6c843c9 | 257aeeb8 | W+Resign | 1600 | http://eidogo.com/#32mBcEjTP
Game 3 | d6c843c9 | 257aeeb8 | B+Resign | 1600 | http://eidogo.com/#2oip5cco6
Game 4 | 257aeeb8 | d6c843c9 | W+Resign | 1600 | http://eidogo.com/#ABm0Oc2A
Game 5 | d6c843c9 | 257aeeb8 | B+Resign | 1600 | http://eidogo.com/#Bqsr1StS

d6c843 wins 4 to 1.

Few notes:

  • The overall quality of the games seem on a different level than before. They are fascinating to look at.
  • Look at the fuseki played by black in Game 2. I have not seen Leelaz play it before.
  • There seems to be no ladder reading problems that I can see. Look at game 5.

800 playouts or less is nice to get some fast results, but 1600 playouts with full testing (preferably SPRT significant) is necessary to say anything concrete.

4:1 is not enough to say if the 4 win net is stronger (>50% win), but it is enough to say it's stronger than the 0 : 54 it had before. Indeed significant more results are needed to get an exact verdict.

funnily, 359/297 is one game short from being a SPRT pass

It's 297-359, not 359-297. It failed long ago but I didn't need the machine over the weekend.

I hope that wasn't the version you downloaded.

I used the sqrt_totalvp branch that you indicated.

I ran a validation test overnight on my version, and got a 14-18 result (or
should I say non-result). I've currently testing another version where
instead of a 0.01 reduction from best child it uses the policy score for a
better guess. I've then reduced the exponent to compensate.

Ideally I am looking for a version that is as strong as the FPU-0.22
reduction but with much wider exploration at root, as I think that is key
for training.

On Mon, 5 Feb 2018 at 14:33 Gian-Carlo Pascutto notifications@github.com
wrote:

funnily, 359/297 is one game short from being a SPRT pass

It's 297-359, not 359-297.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/gcp/leela-zero/issues/696#issuecomment-363101544, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AgPB1kdGv7rAES9ATlA0-UxYdy65UKqWks5tRxFAgaJpZM4Rlcn0
.

I used the sqrt_totalvp branch that you indicated.

That's the thing. I rebased 2 times (because there were new commits on next to apply), and I'm not sure whether it was on the first or second rebase, but I forgot that rebasing would change get_eval return to
return net_eval - cfg_fpu_reduction;. And sqrt_totalvp makes fpu be child->get_eval(color)-0.25*sqrt(total_visited_policy), so because of the mistake on my github branch, fpu reduction is applied 2 times with 2 different methods, leading to fpu = net_eval - cfg_fpu_reduction - 0.25*sqrt(total_visited_policy). But since I don't know the timing of my mistake I don't know if you downloaded the right version or not. If you still have the branch that you downloaded you should be able to see in UCTNode.cpp whether get_eval return eval or eval-cfg_fpu_reduction.

@Eddh there's been a lot of commits in next since, would you mind rebasing and then rather than pushing to the same branch, pushing to a separate branch and pointing @gcp to the fixed branch instead?

Thanks for your good work! I've been following this issue closely and seems like we're on the verge of getting something relatively simple that works for all our networks rather than just ones with a good policy, very encouraging results!

I can run another test overnight but obviously I won't get 600 games in that time. But maybe enough to make a decision. But yes...provide a branch rebased to next that works, pretty please!

Yeah I will rebase (and reread the github branch 3 times) as soon as I can this evening.

@evanroberts85

Ideally I am looking for a version that is as strong as the FPU-0.22
reduction but with much wider exploration at root, as I think that is key
for training.

I wonder if you can't use the depth from root as some sort of parameter then. So at/near root you search wider, but as you get deeper you just use FPU-0.22.

@roy7 yea that is what I have been testing (though using a policy exponent rather than the FPU reduction), it seems to make it slightly worse, but needs more testing to be sure. In any case I think it is a necessary tradeoff.

I am using this formula for the FPU reduction:

best_child_or_parent_value +log10(psa)*0.025;

So a 10% policy gives a 0.025 reduction
1% policy = 0.05 reduction
0.1% policy = 0.075

All much less than a simple 0.22

It should be ok this time : sqrt_totalvp_clean I made it cleaner by changing cfg_fpu_reduction to 0.25f and using cfg_fpu_reduction instead of a hard coded constant.
It should also be noted that the 0.25f multiplier that I use is something I guessed and there is probably a better value somewhere in the [0.22,0.3] range which should gain a few winrate percent.

My new branch:
https://github.com/evanroberts85/evan-leela/tree/dev

Apart from using the policy score for the FPU reduction that I mentioned earlier, I also reduced the puct weighting to compensate for the exponent I applied to the policy score. Watching in Sabaki this seems to have made it stronger. I am hopeful but of course it needs testing. I am running validation now on my (very slow) laptop.

@evanroberts85 There's a lot of magic numbers there... 😲

Hmm why reduce cfg_puct to compensate the exponent ? It should be the opposite. The exponent makes the psa part lower. Because for x in [0,1], x^2 < x

The exponent makes the difference between two policy scores greater. This means that even if the winrate is good once expanded it was still difficult for the child to pick up enough visits to be selected if the policy score was low.

2 policies : one at 0.3 another at 0.2, difference is 0.1
through a ^1.5 exponent : they become 0.16431676725 and 0.0894427191, difference is 0.07487404815

By difference I really mean the ratio:
0.3/0.2 = 1.5
0.1643/0.08944 = 1.83

Exactly so ratio increases?

Yeah my bad.

I am doing some (non automated) tuning. So far I've gone from 1 -> 0.7 -> 0.5 -> 0.4 each time the lower value winning over the higher.

Opening Search (move 1):

/next with the FPU -0.22 reduction:

  D4 ->     414 (V: 47.60%) (N: 16.08%) PV: D4 Q4 Q16 D16 R3 R4 Q3 P3 P2 O3 O2 N3 N2 M3 C17 D17 
  Q4 ->     379 (V: 47.58%) (N: 14.91%) PV: Q4 D16 D4 Q16 C17 C16 D17 E17 E18 F17 F18 G17 G18 H17 
 D16 ->     373 (V: 47.59%) (N: 14.50%) PV: D16 D4 Q16 Q4 C3 D3 C4 C5 B5 C6 B6 C7 B7 C8 R3 R4 Q3
 Q16 ->     362 (V: 47.49%) (N: 15.11%) PV: Q16 D4 Q4 D16 C3 C4 D3 E3 E2 F3 F2 G3 G2 H3 C17 C16 
  D3 ->      27 (V: 47.88%) (N:  0.84%) PV: D3 Q4 Q16 D16 D5 R17
  R4 ->      18 (V: 47.20%) (N:  0.93%) PV: R4 D4 Q16 D16 P4 R17
 D17 ->      10 (V: 46.64%) (N:  0.75%) PV: D17 Q16 Q4 D4
  Q3 ->      10 (V: 46.59%) (N:  0.77%) PV: Q3 Q16 D4 D16
 R16 ->       8 (V: 46.82%) (N:  0.71%) PV: R16 D4 Q4 D16

My version:

  D4 ->     388 (V: 47.54%) (N: 14.50%) PV: D4 Q16 Q4 D16 D6 R3 Q3 R4 R5
 Q16 ->     379 (V: 47.52%) (N: 14.91%) PV: Q16 D16 D4 Q4 F4 R17 Q17 R16 R15
  Q4 ->     371 (V: 47.45%) (N: 16.08%) PV: Q4 D4 Q16 D16 O16 D6 C17 C16
 D16 ->     338 (V: 47.44%) (N: 15.11%) PV: D16 Q4 D4 Q16 D14 C3 C4 D3 E3
 C16 ->      21 (V: 47.61%) (N:  0.66%) PV: C16 Q4 Q16 D4 E16
 D17 ->      19 (V: 47.53%) (N:  0.71%) PV: D17 D4 Q4 Q16
 R16 ->      16 (V: 47.25%) (N:  0.77%) PV: R16 Q4 D4 D16
  C4 ->      15 (V: 47.31%) (N:  0.75%) PV: C4 Q16 Q4 D16
  R4 ->      15 (V: 47.26%) (N:  0.84%) PV: R4 Q16 D4 D16
  Q3 ->      14 (V: 47.34%) (N:  0.65%) PV: Q3 D16 Q16 D4
 Q17 ->      13 (V: 47.05%) (N:  0.93%) PV: Q17 D16 D4 Q4
  D3 ->      12 (V: 47.16%) (N:  0.66%) PV: D3 Q16 D16 Q4

With Leela Zero 11:

 Q16 ->     291 (V: 47.75%) (N: 14.91%) PV: Q16 D4 Q4 D16 C3 D3 C4 C5
  Q4 ->     287 (V: 47.58%) (N: 16.08%) PV: Q4 D4 Q16 D16 C3 C4 D3 E3 E2 F3
 D16 ->     265 (V: 47.56%) (N: 15.11%) PV: D16 Q4 Q16 D4 R3 R4 Q3 P3 P2 O3
  D4 ->     260 (V: 47.59%) (N: 14.50%) PV: D4 Q4 D16 Q16 R3 Q3 R4 R5 S5 R6
  C4 ->      16 (V: 47.73%) (N:  0.75%) PV: C4 Q4 Q16 D16
  D3 ->      15 (V: 47.97%) (N:  0.66%) PV: D3 D16 Q16 Q4
 D17 ->      15 (V: 47.84%) (N:  0.71%) PV: D17 Q4 D4
  R4 ->      15 (V: 47.60%) (N:  0.84%) PV: R4 D4 Q16 D16
 Q17 ->      15 (V: 47.27%) (N:  0.93%) PV: Q17 Q4 D4 D16
  Q3 ->      13 (V: 47.77%) (N:  0.65%) PV: Q3 D4 D16
 R16 ->      13 (V: 47.58%) (N:  0.77%) PV: R16 Q4 P16
 C16 ->      11 (V: 47.60%) (N:  0.66%) PV: C16 Q4 Q16
 R17 ->       6 (V: 47.09%) (N:  0.44%) PV: R17 Q4 D16
  R3 ->       5 (V: 46.75%) (N:  0.45%) PV: R3 D16 Q16
  C3 ->       4 (V: 46.46%) (N:  0.42%) PV: C3 Q16
 C17 ->       4 (V: 46.21%) (N:  0.48%) PV: C17 Q4
 P17 ->       3 (V: 47.68%) (N:  0.17%) PV: P17 Q4
 D15 ->       3 (V: 46.21%) (N:  0.26%) PV: D15 D4
  E3 ->       2 (V: 47.40%) (N:  0.12%) PV: E3 Q16
 E17 ->       2 (V: 47.16%) (N:  0.18%) PV: E17 D4
  Q6 ->       2 (V: 46.76%) (N:  0.07%) PV: Q6 Q4
 R15 ->       2 (V: 46.76%) (N:  0.13%) PV: R15 Q4
  P4 ->       2 (V: 46.59%) (N:  0.24%) PV: P4 D4
  R5 ->       2 (V: 46.56%) (N:  0.19%) PV: R5 D16
  P3 ->       2 (V: 46.20%) (N:  0.17%) PV: P3 D16
  Q5 ->       2 (V: 46.09%) (N:  0.24%) PV: Q5 D4
  E4 ->       2 (V: 45.82%) (N:  0.22%) PV: E4 Q16
 E16 ->       2 (V: 45.59%) (N:  0.23%) PV: E16 Q4
  D5 ->       2 (V: 45.48%) (N:  0.17%) PV: D5 Q4
 Q15 ->       2 (V: 45.48%) (N:  0.29%) PV: Q15 D16
  P2 ->       2 (V: 45.40%) (N:  0.09%) PV: P2 Q16
 F18 ->       2 (V: 45.33%) (N:  0.11%) PV: F18 Q16
 S15 ->       2 (V: 45.17%) (N:  0.09%) PV: S15 D16
 P16 ->       2 (V: 45.13%) (N:  0.22%) PV: P16 Q4
  C6 ->       1 (V: 48.08%) (N:  0.07%) PV: C6 
Plus another couple of hundred moves at 1 visit each.

All results 1600 playouts with the strong 5x64 network

My version with the puct reduced by half is at 6 wins to 2 losses against FPU-0.22. I'll leave it running overnight and see if it stays ahead.

6 wins, 2 losses
8 games played.
Status: 0 LLR 0.416153 Lower Bound -2.94444 Upper Bound 2.94444

Tuning so far suggests that I would be better reducing puct further. Currently oscillating around 0.4

Edit: now 7-2 up, looking promising.

Not sure why it took me so long to think of mentioning this, but since AQ is the strongest open source bot currently, someone could go look and see what it does. ;)

This morning:

20 wins, 12 losses
32 games played.
Status: 0 LLR 0.660221 Lower Bound -2.94444 Upper Bound 2.94444

Tagged this as version 1.0.
https://github.com/evanroberts85/evan-leela/tree/1.0/src

Top 8 searched moves, move 1 with 6400 playouts (5x64 network):

Leela 11.0

  Q4 ->    1288 (V: 47.48%) (N: 16.08%) PV: Q4 D4 Q16 D16 C3 C4 D3 E3 E2 F3 F2 G3 G2 H3 H2 J3
 D16 ->    1199 (V: 47.47%) (N: 15.11%) PV: D16 Q4 Q16 D4 R3 R4 Q3 P3 P2 O3 O2 N3 N2 M3 C3 D3
  D4 ->    1197 (V: 47.50%) (N: 14.50%) PV: D4 Q16 D16 Q4 R17 Q17 R16
 Q16 ->    1169 (V: 47.46%) (N: 14.91%) PV: Q16 D16 Q4 D4 C17 C16 D17 E17 E18 F17 F18 G17
 Q17 ->     123 (V: 47.81%) (N:  0.93%) PV: Q17 D4 Q4 D16 Q15 R3 R4
 D17 ->     109 (V: 47.88%) (N:  0.71%) PV: D17 D4 Q16 Q4 D15 R17 Q17
  R4 ->      86 (V: 47.66%) (N:  0.84%) PV: R4 D16 Q16 D4 P4 R17 R16
  Q3 ->      85 (V: 47.80%) (N:  0.65%) PV: Q3 D16 Q16 D4 Q5 R17 Q17

/next with the FPU-0.22:

 D16 ->    1135 (V: 47.38%) (N: 14.91%) PV: D16 Q16 D4 Q4 R17 R16 Q17 P17 P18 O17 O18 N17 N18 M17 R3
  D4 ->    1112 (V: 47.29%) (N: 16.08%) PV: D4 D16 Q4 Q16 C17 C16 D17 E17 E18 F17 F18 G17
 Q16 ->    1084 (V: 47.33%) (N: 15.11%) PV: Q16 D16 Q4 D4 C17 D17 C16 C15 B15 C14 B14 C13 B13 C12 C3 C4 D3 E3 E2 F3
  Q4 ->    1076 (V: 47.36%) (N: 14.50%) PV: Q4 D4 D16 Q16 C3 C4 D3 E3 E2
  C4 ->     297 (V: 48.03%) (N:  0.84%) PV: C4 Q16 Q4 D16 E4 R3 Q3 R4 R5 S5 R6 S6 R7 S7 R8
 D17 ->     270 (V: 47.99%) (N:  0.93%) PV: D17 Q4 D4 Q16 D15 C3 C4 D3 E3 E2 F3 F2 G3
 R16 ->     248 (V: 48.04%) (N:  0.66%) PV: R16 Q4 D16 D4 P16 C17 D17 C16 C15 B15 C14 B14 C13 B13
  D3 ->     243 (V: 48.04%) (N:  0.65%) PV: D3 Q4 D16 Q16 D5 C17 C16 D17 E17

My version 1.1:

 R16 ->    1474 (V: 47.87%) (N:  0.77%) PV: R16 D4 Q4 D16 P16 R3 R4 Q3 P3 P2 O3 O2 N3 N2 M3
 Q17 ->    1393 (V: 47.90%) (N:  0.93%) PV: Q17 D4 Q15 D16 Q4 D14 C3 C4 D3 E3 E2
 Q16 ->     686 (V: 47.33%) (N: 14.91%) PV: Q16 Q4 D16 D4 C3 C4 D3 E3 E2 F3 F2 G3 G2 H3 R3 Q3
 Q4 ->     682 (V: 47.28%) (N: 16.08%) PV: Q4 Q16 D4 D16 C17 C16 D17 E17 E18 F17 F18 G17 G18 H17    
 D4 ->     659 (V: 47.32%) (N: 14.50%) PV: D4 D16 Q4 Q16 C17 C16 D17 E17 E18 F17 F18 G17 G18 H17 
 D16 ->     653 (V: 47.29%) (N: 15.11%) PV: D16 Q16 D4 Q4 R17 Q17 R16 R15 S15 R14 S14 R13
  R4 ->     334 (V: 47.81%) (N:  0.84%) PV: R4 Q16 D16 D4 P4 C17 C16 D17 E17
  D3 ->     320 (V: 47.82%) (N:  0.66%) PV: D3 D16 Q16 Q4 D5 R17 Q17 R16 R15 S15 R14

My version is the only one to open with the 3-4 point! :-) The others it is not even close they both prefer 4-4. The advantages of having a lower puct value, more innovation!

As a fun experiment I created a version that would really seek out those rare moves. Puct is reduced to a tenth and the policy exponent becomes less than 1 after 800 parent visits.

I thought /next would be way too strong for it so I played it against 11.0 and it won. So at the very least it is not hopeless. It opens with the 3-4 5-4 and star point enclosure fuseki as black (I can not remember what this opening is called) even with just 1600 playouts

Find it here:
https://github.com/evanroberts85/evan-leela/tree/1.1-Patagonia/src

Edit: meanwhile version 1.0:

30 wins, 23 losses
53 games played.
Status: 0 LLR 0.439956 Lower Bound -2.94444 Upper Bound 2.94444

FWIW, personally I would prefer a bot that plays varied opening moves (rather than almost solely 4-4 points), even if it should be marginally weaker

I agree. Also we want to avoid a positive feedback loop in training where
even if a move evaluates well, low policy moves get low visits, then
training reacts by decreasing there policy score, then they get even lower
visits etc.

In self play we are not trying to beat Lee Sedol, we are trying to create
good training data. So expanding the playbook has got to be a good thing.
On Tue, 6 Feb 2018 at 15:04, Yakago notifications@github.com wrote:

FWIW, personally I would prefer a bot that plays varied opening moves
(rather than almost solely 4-4 points), even if it should be marginally
weaker


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/gcp/leela-zero/issues/696#issuecomment-363449758, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AgPB1lYTaLOfAlsqph9udEKwaLQBBLnlks5tSGoQgaJpZM4Rlcn0
.

But a positive feedback loop on good moves is exactly what we want. If you want more variated self play games just turn on this temperature thing, also if you want more variation in live games, similar randomness can be turned on (but probably randomness that doesn't let it pick very bad moves).

Remember that one player is always losing, so even if you let it pick the best move only, the best move should keep changing for the losing player until i find a wining variation again, then the other player can differ (in extreme cases). True stagnation is never possible.

Positive feedback in a positive direction yes! But what I stated - positive feedback in a negative direction, is bad, that should only happen when a move has a poor evaluation.

The evidence for why we need to change should be clear. We are stuck with the 3-4 point having a very low policy score (not to mention the 3-3 or 3-5 openings) not because Leela struggles to play with it (its tree evaluation result is better than 4-4) but because of the feedback loop I mentioned, which was probably started very early on in LZ lifetime, but will be even worse now due to people using /next with the fpu-reduction.

My version will play the 3-4 point often so it's policy score will soon rise though successive training/self-play. But this is not just about the opening but every blindspot Leela Zero has currently.

Ah, now you said something that made sense to me :). So in the current situation, a better move with low policy priority actually has a negative feedback loop. That's indeed something we cant have.

Negative feedback means resistance to change, what we have is positive feedback in a negative direction. Though there are limits so it does not reduce quite to zero.

34 wins, 24 losses
58 games played.
Status: 0 LLR 0.721189 Lower Bound -2.94444 Upper Bound 2.94444

Looks like my version 1.0 is around even or stronger than fpu-0.22. I could do with someone with a decent computer confirming this as I can not keep running this much longer.
https://github.com/evanroberts85/evan-leela/tree/1.0

Feel free to experiment with reducing puct even more, or changing the exponent calculation. See my other branches for inspiration.

But wait, you said it was staying on the 4-4, and it already is on the 4-4. Isn't that "resistance to change"? It's already low at the 4-3 point, and with positive feedback it should move away from that quicker and quicker, but instead its getting fixed at the 4-4 more stable each iteration.

If you have 2 moves with very similar winrates, like the 4-4 with 52.9% and the 4-3 with 53.3%, but the policy net going 90% for the 4-4 and only 10% for the 4-3, there is a positive feedback for the new 4-3 move from the winning %, but this is very small (as the difference is just 0.4%), but negative feedback from the policy net leading it to the 4-4 move 90% of the time. Because the difference for the policy net is big (80%) as opposite to the difference in winrate (0.4%), the positive feedback of the winrate does not overcome the negative feedback of the policy net.

Anyway i think we mean the same thing, so this discussion isn't very interesting. All we care about it that it should be able to converge towards better moves once it finds them. If moves are very similar in win %, it would actually be preferred if the net liked all of them equal.

but will be even worse now due to people using /next with the fpu reduction.

Actually, if you look at the top 8 searched moves that you posted, those 3/4 are looked at more with fpu reduction than with current version, though not enough to select them, probably because of the hundreds of playouts that current version invests into exploring 0.01% policy moves. So it doesn't seem to make this feedback loop worse.

@Dorus yea we sort of mean the same thing but looking at it at different angles.

@Eddh I guess you are right that once a node is explored /next ends up with similar values or sometimes better values (I am not sure why they would be better that could have been noise). It has a lot more unexplored nodes though..

Edit: I think cfg_puct was reduced slightly in /next so that would explain slight higher visits for explored nodes with low policy scores.

As long as a low policy prior move gets selected at some point where it can get a percentage of playouts that is more than it's current prior, the feedback is positive, not negative.

Even for hopeless moves, the dirichlet noise is exactly there so even they can break through.

If it doesn't play 3-4 now it's because it needs to search to deep to see that the moves are really good, not because it never selects them. It will learn better by an improved value network. This does not need search magic.

@gcp that is the problem, it is not getting more playouts than it's current prior!

The 3-4 evaluates as good very early on in the search, problem is that puct is too large.

I get some where it's more, some where it's less, so it seems to be in equilibrium if anything. The ones that have less have got worse evaluations than 4-4, so see my point about the problem being the evaluation, not some feedback loop in the search.

The policy is learning the outcome from the 1600 node search, not from a hypothetical infinite search.

Being slightly worse (not the case in my network) should not mean that low policy moves get stuck low.

But they aren't, that's what I just said!

  D4 ->     346 (V: 48.47%) (N: 19.15%) PV: D4 D16 Q16 Q4 C17 D17 C16 C15 B15 B14 C14 D15 B13 B16 A14 B17
 D16 ->     306 (V: 48.39%) (N: 17.67%) PV: D16 Q4 Q16 D4 R3 Q3 R4 R5 S5 S6 R6 Q5 S7 S4 T6 S3 C3 D3
  Q4 ->     300 (V: 48.43%) (N: 16.99%) PV: Q4 D16 D4 Q16 C17 D17 C16 C15 B15 B14 C14 D15 B13 B16 A14 B17 R17 Q17 R16
 Q16 ->     277 (V: 48.49%) (N: 15.11%) PV: Q16 D4 Q4 D16 C3 C4 D3 E3 E2 F2 F3 E4 G2 D2 F1 C2 C17 C16
  R4 ->      72 (V: 48.68%) (N:  3.46%) PV: R4 Q16 D4 D16 P4 C3 D3 C4 C5 B5 B6
  Q3 ->      49 (V: 48.51%) (N:  2.59%) PV: Q3 D16 D4 Q16 Q5 C3 D3 C4 C5
 R16 ->      47 (V: 48.12%) (N:  3.13%) PV: R16 D4 Q4 D16 P16 R3 R4 Q3 P3 P2
  C4 ->      46 (V: 48.14%) (N:  3.06%) PV: C4 D16 Q16 Q4 C17 D17 C16 C15
 Q17 ->      45 (V: 48.49%) (N:  2.47%) PV: Q17 D16 D4 Q4 R3 Q3 R4
 C16 ->      43 (V: 47.97%) (N:  3.05%) PV: C16 Q4 Q16 D4 E16 R17 Q17 R16
  D3 ->      35 (V: 47.90%) (N:  2.59%) PV: D3 Q16 Q4 D16 D5 R3 Q3 R4
 D17 ->      34 (V: 47.75%) (N:  2.63%) PV: D17 Q4 Q16 D4 D15 R17 Q17 R16

R4 3.46% -> 4.50%
Q3 2.58% -> 3.06%

Note that the others have inferior evaluations, so it's working exactly as it should.

As I stated once a moves starts evaluating as decent (not necessary have to be better) then a feedback loop should increase its policy score. This is not happening, instead a feedback loop is trying to decrease its policy score, though the dirichlet noise and the way the search works should stop it falling to zero

then a feedback loop should increase its policy score. This is not happening, instead a feedback loop is trying to decrease

This is totally dependent on what the PUCT parameter is set at, because that is what controls the shape of the search tree with relation to the move evaluation.

Low PUCT strongly favors move with high evaluation. High PUCT favors moves that have not been played a lot or have strong priors.

@gcp Those are the policy scores, not the evaluations. The evaluation scores are high even early in the search.

@gcp exactly which is why I said the PUCT value is too low. I am not talking about any magic, just that.

Edit: sorry too high

Those are the policy scores, not the evaluations. The Value scores are high even early in the search.

They (the evaluations) are lower than the 4-4 moves for the moves that are ending up having their prior reduced, and they are higher for the moves that are ending up having their prior increased.

Again, that's exactly what should happen.

Ok I see what you are getting at. But in my network the visits at 1600 playouts was less than the policy.

What I am saying is that if the search results show that a low percentage policy score is actually a good move, it's policy score needs to increase through training.

And I just showed you that that is indeed what happens.

exactly which is why I said the PUCT value is too low
it is very slowly due to Puct being to high.

See my point above about the effect of PUCT. If you lower PUCT, you will reduce exploration because the search will only consider moves that have a high winrate from the very outset. This prevents the value network from learning or the search from discovering new things.

It's not a free lunch. The current value was tuned for strength on a thousand games or so, not 6.

Anyway, I'm not saying a better UCT+FPU formula or parametrisation can't be found. I'm objecting to the idea that there a bad feedback loop. As you can clearly see above, there is not.

Yea sorry I did not instantly see what you meant by your figures and was too quick to reply.

Also see my branch, lower puct increases visits for lower policy scores that evaluate better, though I guess it would decrease them for ones that evaluate worse now I think about it. You are right there is a balance here, but my branch showed there is a benefit to reducing it (in conjunction with other changes)

The feedback loop was bad on my network that I tested, obviously it depends on a few factors. It is possible that it has become good because 3-4 point has started evaluating very well now early in the search, but it shouldn't take that. In any case this needs to be checked closely as it is key.

Note that the others have inferior evaluations, so it's working exactly as it should.

I am not sure if that is good behaviour, if the evaluation is just slightly worse, the policy score should not become drastically worse. I realise my reduction of puct could make this even worse in my branch so I need to look at the data.

if the evaluation is just slightly worse, the policy score should not become drastically worse. I realise my reduction of puct could make this even worse in my branch so I need to look at the data.

It does, because the PUCT factor sits in front of the exploration term. PUCT=0 will only ever explore the best move and never the alternatives.

Yea, but other changes I made may make up for it, I am checking now..

I think another thing to keep in mind is that lower cfg_puct decrease effectiveness of noise

It does not seem to have make much difference to visit numbers for lower evaluations, as when visits are low the PUCT factor is still dominant.

For my experimental 1.1-Patagonia the visits are higher due to the less than 1 PSA exponent when parentvisits are greater than 800 or so. I need to make sure they are not too high if it were to be used for training games.

35 wins, 29 losses
64 games played.
Status: 0 LLR 0.279518 Lower Bound -2.94444 Upper Bound 2.94444

I am stopping this validation on my v.1.0 branch due to the discussion above, the reduction in PUCT seems to reduce visit numbers for slightly worse evaluations just a little bit, but enough that could cause training to suffer as a more closer check suggests 11.0 is quite well balanced in this respect. In any case it was using up all my CPU and I need to use my laptop for other things.

I'll continue working on my other branches where I can balance visits to policy/evaluation better.

I think https://github.com/Eddh/leela-zero/tree/sqrt_totalvp_clean is still our best contender for something that works for all networks. If that works, then anything more complex to improve the formula can come in a later release.

@Eddh Perhaps making a pull request on next and squashing all the commits into one will make it easier for others to test.

@gcp what do you think about the effect of the reduced amount of visited nodes in the /next branch? Noise can only help so much. Anything with a policy less than 0.7% does not seem to get any visits, compared to 11.0 which searches down to 0.05% (exact values depend on a few factors, but they mostly hold true)

It's good because the program became clearly stronger.

Noise can only help so much.

Dirichlet noise has very good behavior for helping to "find out" unknown things.

The effect on training data I meant.

The effect on training data I meant.

As I said: It's good because the program became clearly stronger, so the training data is better.

I think you must know what I mean, it will almost never visit/play super low policy moves so they will remain blindspots or gaps in its knowledge.

But it already never plays low policy moves! You just arbitrarily took 0.7% as a critical value compared to 0.05%.

Any amount of tuning you do will never get it to play policy moves that have been trained to be arbitrarily low. And they are blind spots because the program learned they were bad.

You are trying to reverse progress (force the program to look at moves it has already learned are bad!) because you think it is not playing or searching to what you in your human intuition feel is natural (4-4 vs 4-3) or what a longer search (which it can't do!) would reveal, and then even in a particular position...

If the training goes for an arbitrarily long time in one direction (eval of 4-4 looking better than 4-3 with 1600 playouts) then the network will train it that way.

That's what the noise is for: to make sure that if the evaluation gets better enough that it can see that 4-3 is really better than 4-4 enough with a small search, that it has a possibility to enter the feedback loop in the positive direction, and so that they will not remain blind spots.

It will play them occasionally because temperature is 1 in the opening, and it can use the difference between seeing if it gets 1 visit or 2-10 to help it train.

You say the programme learned they are bad, but moves that are less than 0.7% sometimes turn out to be very good moves, the network's predictions are far from perfect. Also they may be bad in an earlier version but the new version can make them work.

And yes I am interested in the opening. The FPU reduction means it only considers 4-4 points and 4-3. The 3-3 point which we know AlphaGo thinks it a reasonably acceptable opening is not even close to getting a single visit because its typical policy score is way less than 0.7% so I am not sure even noise will overcome this, even if it does occasionally it will not be enough to make its average visit count percentage get close to its policy prior.

Also for what it is worth when watching my version it was not unusual for it to play sub 0.7% policy moves, because the tree search showed that they were very good moves. It is one of the reasons for its strength.

FPU reduction has not been used to produce any training game yet (to my knowledge) so blaming it for 44 points (which are clearly a preference the net learned from training games) is a bit silly. It could be because of parent net_eval fpu, it could be because of 5 blocks network versus 20 for alphagozero, but you can't blame fpu reduction for looking at moves the policy network recommends, that is literally what the search algorithm is made for.
I completely agree the current fpu reduction and also my sqrt_totalvp branch has some points to improve though. But in terms of opening search, I think looking at 0.7% moves is enough cause I'm pretty sure noise can make up for this and more (correct me if I'm wrong though). And there are potentially far more 0.05% policy moves than 0.7% ones, so the cost in terms of playouts to search them is much higher.

My main worry in terms of using fpu reduction for training is that what the net will have to learn will be very different from what it learned. I'm not an expert but we can at least expect it to have problems when the training window will be half fpu reduction half net_eval fpu ?

I am not blaming it for the 4-4 point, it seems it headed down that route earlier on, I believe other Go AI have developed other strong preferences. I think the FPU reduction has been very interesting and has inspired my branch which I have enjoyed the challenge of creating, but presumably the intention is to use it for self-play and we should look at the consequences of that for training data.

@evanroberts85 Is this your code? https://github.com/evanroberts85/evan-leela Can you make it a fork of the main leela-zero code? It's hard for me to see what changes you are making. Also you have *.o and *.d files in there that shouldn't be.

@Eddh I think your main point is valid, but some people (including me) run the latest code from the next branch.

@killerducky Yes that is my code, I am new to Git (and c++) so was just guessing at how to use it and was in a rush. Will look into making it a proper fork, which I though I did but obviously not.

@evanroberts85 from what I understand you have to fork it on github, then clone it locally, merge your change on a new local branch, then push the branch to the origin remote.

@WhiteHalmos I squashed the commits https://github.com/Eddh/leela-zero/commit/8ff67a1ec0b7f35c90d7215c74909d78a0f79a55. For the pull I guess I got to wait test results from gcp ?

Earlier nets had pretty much a 0 on the heatmap for the 3-4 points, current ones have a growing number for the 3-4 points. Isn't this a point towards that the 3-4 point is being considered more and more?

The 257aee network played a lot of 3-4 in its selfplay and the current one does as well (even though a bit less). I've also seen games of the current network on CGOS playing 3-4 points and winning.

Anyways, i've been running matches from this afternoon with @evanroberts85 network.
Sadly my results haven't been very good. I'll happily leave this running overnight, but please confirm that i'm using the correct version.

I copied https://github.com/evanroberts85/evan-leela/tree/1.0 today in the afternoon, together with leela-next.

Results so far (with current 54bfb network):

Evan_1.0_1600p v Leela_Next_1600p (68/400 games)
board size: 19   komi: 7.5
                   wins              black         white       avg cpu
Evan_1.0_1600p       24 35.29%       8  23.53%     16 47.06%    250.50
Leela_Next_1600p     44 64.71%       18 52.94%     26 76.47%    183.07
                                     26 38.24%     42 61.76%

Leela Next is from github today in the afternoon, Evan 1.0 is github from today in the afternoon.

Am i using the correct version of your verison Evan? Did you test your games against current Next? Your results were quite different, and there's also a rather huge amount of difference in used CPU time.

EDIT:

Here's the used command lines as well from Gomill ringmaster:
players = {
'Evan_1.0_1600p' : Player('/home/max/evan-leela-1.0/src/leelaz -g -w 54bfb --noponder -p1600 -r1'),
'Leela_Next_1600p' : Player('/home/max/leela-zero-next/src/leelaz -g -w 54bfb --noponder -p1600 -r1'),
}

Yep the 3-4 point has been grown a lot from the 5x64 network I test on compared to the current network, which shows that at least the training feedback is working reasonably well. I was wrong to doubt that.

Your results are indeed very different. I'll check that the version of /next I based my version on is missing any important recent changes, or if the version I contested it against I had weakened somehow, or if I uploaded the wrong version. I get the same nodes per second as my local FPU-0.22 version, but I guess I need to take a reading of the CPU usage directly? I was also running the validation on an old 5x64 network (35df).

Thanks for doing the test. No need to continue it.

@Eddh Might as well make it a pull request so people can post test results there. This thread is getting really long. Pull requests aren't final, they can always be closed and new ones made, just like issues.

Your branch seems to have a lot of numbers guessed and tuned manually. One of the reason I don't like going for solutions with a lot of numbers like that is that you may end up with something tuned just for the net/playout number you are testing on. I think this can already happen with just the 2 parameters puct and fpu reduction that we have in next but it probably gets worse as you add complexity. Basically overfitting the algorithm for the net. Already I am considering trying std::pow(total_visited_policy, exponent) instead of simple sqrt but if it adds another value to worry about is it worth it even though sqrt is just a particular case of std::pow(total_visited_policy, exponent) ?

I personally think it is better adding extra variables than just using one (huge FPU reduction beyond what would seem reasonable) that has a number of side effects that you can not control easily and are difficult to calculate. So far MaxMaki's test show it has not quite worked but I will try again at some point.

Most the numbers used were based on observations using a spreadsheet, adjusting them a little one way or the other often makes little difference so I spent almost no time tuning, I expected to go back at the end and clean up and optimise the code and simplifying things where possible.

Really I have only made three changes, the FPU reduction based on policy score and best child (using the parent init as an alternative did not really make much difference but I left it in), using an exponent on the policy score, and varying that exponent based on parentvisit count. I adjusted the Puct coefficient but that gets adjusted anyway, that one did seem to make quite a big difference.

As an example the 11.0 version visits drop of naturally in the list of children down to usually 1. With the FPU reduction as soon as a node is explored its win rate will almost always shoot up, so it will then generally pick up several more visits at a minimum. This can't be optimal but it does work quite well.

Just using one FPU variable can be enough for our purpose. We can theorically modulate it with a lot of information like visited nodes number, parentvisit, or other, without needing 10 more variables to tune. Look at the formula from deepmind : they could have had fun applying some function taking some parameters on winrate (log ? pow ? exp ?) or using an exponent on psa like you did. They could have tried using exponents on every single one of their variable. I would be willing to bet you could tune an exponent on every variable in puct formula and find another value than 1.0 that can be better. But would it still be better with different nets ? with human trained nets ? With another number of playouts ?

I think its the same as with neural network that can be overfit to their dataset. From what I've read on the subject the more complicated the NN the easier it can overfit.

Look at this old paper on UCT : https://hal.archives-ouvertes.fr/hal-00115330/document
they used a more complicated formula than what deepmind use. Their formula obviously wasn't perfect and could be made better. Does that mean you need to add something to it ? Deepmind, the company that can hire dozen of persons to manually tune dozens of potential variables and if conditions, chose a formula that is a lot simpler.

My fpu reduction solution makes the algorithm a bit more complicated which I don't really like. It feels to me like there might be an algorithm as simple as the deepmind one that can takes care of exploration/exploitation dilemna better. But I'm at a loss. It seems like the papers on the subjects just assume you will explore every solution first and start from there. This isn't viable for LeelaZero.

The exponent has a very similar effect to one of the side effects of the FPU reduction, so if the FPU reduction works for all networks so should the exponent.

The FPU reduction using the psa score is a rough guess at predicting the actual winrate based on some very crude debugging data I gathered. This seems quite a reasonable thing to do.

Changing the exponent depending on parentcount is based on the idea to search wider nearer the root. Again the exact value is less important than the idea. If the idea is sound it should work for all networks.

The rest of the variables are just basic maths functions to create fades.

AlphaGo have shown none of this is necessary to create a strong AI, but I do not think they tuned the FPU reduction either.

In the paper they say they used 0 for initialazing node. But Aja Huang apparently wasn't allowed to say something about it, which seems weird if the value was really 0. On top of that given how the equivalent 0.5 is faring for us and the logical problems such as different behavior for winning and losing player that come from this FPU, I find it unlikely Deepmind actually used that. Or maybe they did for AlphagoZero but optimized it for AlphaZero ?

Maybe you could just train a neural network to decide which node to expand next? :-P

Note that Alphago zero used 8 threads with a loss function to encourage exploration. Leelaz has this function implemented but is currently not being used in self plays and evaluation matchups. It might be worth it to test this method.

@evanroberts85 Haha that is almost what this projet is doing. Proof being the performance of fpu = 0 (our 0) for strong networks.
@mkw1234 Interesting, that could indeed make quite a difference.

Btw I'm not sure I get how to use the validation tool. It seems there for comparing 2 networks, not 2 binaries ?

You can use the -b parameter to pass two binaries. What I'm currently running is validation.exe -n eddh0eb8 -n next0eb8 -b ".\leelazeddh" -b ".\leelaz" -o "-g -p 1600 --noponder -t 1 -q -d -r 1 -w" -o "-g -p 1600 --noponder -t 1 -q -d -r 1 -w" -g3 -u1 Where next0eb8 and eddh0eb8 are the same network just named so that I can tell which binary gets which wins as the wins are reported per network name.

@evanroberts85 i stopped the match at 160+ games, the results seem quite unchanged from 60+ games.

text Evan_1.0_1600p v Leela_Next_1600p (163/400 games) board size: 19 komi: 7.5 wins black white avg cpu Evan_1.0_1600p 62 38.04% 24 29.27% 38 46.91% 263.99 Leela_Next_1600p 101 61.96% 43 53.09% 58 70.73% 199.89 67 41.10% 96 58.90%

This is eddh's branch so far, match will continue tomorrow.

leelaz-54bf v leelaz-eddh (197/1000 games)
board size: 19   komi: 7.5
              wins              black         white        avg cpu
leelaz-54bf     89 45.18%       43 43.43%     46  46.94%    430.48
leelaz-eddh    108 54.82%       52 53.06%     56  56.57%    431.46
                                95 48.22%     102 51.78%

It might be worth it to test this method.

It's not suitable for our purposes because some of our clients won't work well with > 1 thread. And in any case, it's not about exploration, you could just tune PUCT for that. They need to force exploration (and thus make the search WORSE) because otherwise all 8 threads will block on the same mutex, or even worse, end up in the same node (they use batching). Without that, they would lose a huge amount of efficiency on their TPUs.

Using virtual loss is not a voluntary choice!

So, no, this is not even worth testing. AGZ did not do that because it was good. They did it because they have no other choice.

My main worry in terms of using fpu reduction for training is that what the net will have to learn will be very different from what it learned.

That's ok. If the improvement is smooth, it will make slow jumps up. If it is not, it'll make a huge jump as soon as 0.12 (or whatever) has filled most of the training window.

One thing about tuning the UCT formula: smoothing the network probabilities (or making them sharper, which most people won't dare suggest because it makes the search even less explorative, even though it may be better!) can be done by modifying cfg_softmax_temp, which will have similar effects as modifying the way psa is handled in the PUCT formula.

But I've said before I'd like to keep cfg_softmax_temp at 1.0, because this means the exploration probability that moves get (at the root, after they are played once!) maps directly to the probabilities that the network outputs (it's a direct factor in the formula). It's not clear the policy learning will work as well if this relation is severely adjusted.

This isn't a problem for FPU adjustments, because they're don't break that relation. They mostly affect moves with very low priors that only receive 1 playout.

But it's something to keep in mind when replacing the PUCT formula entirely.

@gcp re: virtual loss. A side effect of virtual loss is that the search is made broader (and 'worse'), but it is at least possible that the additional breadth is something they took into account when they tuned c_puct. In any rate, i added it because it was one of the last remaining differences that we have with the paper (AFAICT), and i'd really kick myself if it turned out that that breadth was consequential. My hunch is though that it is indeed mostly suboptimal -- a paper i found on the matter suggested that it was a good compromise, though.

Prior to doing virtual loss, we had root parallelism (as described in that paper) -- each client played a batch of games, all using the best node each and using that to batch for the GPU. The net result was the same amount of games in total, more or less, but we'd play e.g. 8 games in 2 hours instead of 1 game in 15 minutes. This had some consequences for the latency between a model being trained and games from that model showing up in the window.

I'm aware of that tradeoff, but note that our c_puct is tuned for 1 thread. My point was that forcing exploration through virtual loss is rather suboptimal if you can do it via c_puct too.

Virtual loss is known to be suboptimal, but it's the best we know if if you have to use multiple threads. This is important for when the bot plays, well, real games! For self-play it's not required.

@MaxMaki
I updated the changes in the last week or so from /next into my local branches, but I think the difference in our results is due to you testing the much stronger 6x128 network rather than the 5x64 network I tested on. The stronger the network the better it is to put more trust it its policy output I guess. If I continue working on my branch I'll make sure to test on the latest network in the future!

Edit: looks like I also ran at threads=2 which also would have affected my results due to the lower puct.

@gcp no disagreement that it is suboptimal; my concern is that the 'suboptimality' is somehow a necessary part of the exploration that c_puct doesn't necessarily capture equivalently

Forgive me if this is something you've already covered in another thread: when you say "our c_puct is tuned for 1 thread" -- do you mean it is tuned for strongest play? Or tuned in some other manner, e.g. to have some measured amount of exploration?

For strongest play. If you tune for strongest play with 8 threads, it's quite possible you'd get a c_puct that favors less exploration.

(That being said, the optimal range was like 0.5...1.5 or so, so it doesn't actually make that much of a difference)

Virtual loss lowers the winrate for the highest moves, so it's closely related to the PUCT factor.

got it, thanks :)

Well, it has been a while since I pulled the repo because this fpu reduction didn't seem like a good idea to me, but I decided to test it myself to make sure. Here are the provisional results (for 9x9 go) of a match between the current /next branch (leelaz-next) and the /next branch dating back to the 20th of january (leelaz, which I'm currently using for self-play and games against humans):

leelaz v leelaz-next (329/400 games)
board size: 9   komi: 5.5
              wins              black          white        avg cpu
leelaz         182 55.32%       101 61.21%     81  49.39%     37.37
leelaz-next    147 44.68%       83  50.61%     64  38.79%     24.86
                                184 55.93%     145 44.07%

For 9x9, it seems the fpu reduction is actually a bit harmful to level of play. Since it seemed so much better than just parent-eval for 19x19, it kind of surprises me. Of course I wasn't expecting to get the same results since the puct constant and the reduction were tuned for 19x19, but I was still expecting it to be at least as strong... I'll make a few but I think I'll put the fpu reduction to 0 in the end to just get back to parent-eval.

@Alderi-Tokori do you have an idea why the CPU load is so much lower on the latest next? Is that due to optimisations or are there differences in settings with the playouts/visits? Given the difference, 45% win rate is not that bad - but if the same optimisations are also possible for the other branch, maybe not.

It might be interactions with tree reuse ? Since leelaz has explored more moves, it has more chance to be able to reuse its tree, while the move it play might be unexpected for leelaz-next, with less exploration. So leelaz makes use of tree reuse, and it ends up stronger and slower, while leelaz-next can't because the moves the opponent plays were not explored. I don't know much about that part of the code so I could be wrong.

@odeint the settings for both binaries were exactly the same, so I think this is due to code optimisations that might work well for 9x9. I think I'll try a few different fpu reduction values to try and tune it for 9x9, and if I don't find any good value, I'll just set it to 0. Here are the final results of the parent-eval - 0.22 :

leelaz v leelaz-next (400/400 games)
board size: 9   komi: 5.5
              wins              black          white        avg cpu
leelaz         222 55.50%       125 62.50%     97  48.50%     37.09
leelaz-next    178 44.50%       103 51.50%     75  37.50%     24.68
                                228 57.00%     172 43.00%

@Eddh I'm using -v and not -p though so there's a fixed number of visit per move, so what you describe wouldn't happen. On the contrary, tree-reuse with -v makes the program faster (but weaker than -p since it means less playouts per move on average).

Ok that's not it then.

Currently testing an idea I had, rather than reducing fpu by a scalar, use the minimum of the best evals of all visited children: https://github.com/WhiteHalmos/leela-zero/commit/844ebe49ef244e78784e38dedf5f01354986dcad

@Alderi-Tokori How many visits are you using? /next is strong at 1600 playouts but gets relatively weaker the less playouts used, in my experience. My branch (2.0 coming soon) suffers the same.

I'm using -v 1000, but it's supposed to be irrelevant here since both versions (my leelaz and the leelaz-next) use exactly the same parameters, so they're on equal grounds in terms of tree expansion.

@Alderi-Tokori Theoretically you are right, they are on even grounds, but we've found through testing that results for 1600 playouts (~2200 visits IIRC) is very different from 800 playouts or fewer.

If you tried -v 2000 you will get better results I think, the FPU reduction reduces exploration until the node is visited, this is bad if the tree is small but works well when the tree is bigger as it can search deeper.

Also 9x9 is more tactical, so you might get slightly worse results as I feel a broader search would be better, but that is very much a guess.

It would be nice if the data dump actually showed max search depth for each move. I noticed in the Alpha Go movie they had this analysis, though naturally they had something much more fancy.

@evanroberts85 We'll see, I'm doing a first tuning run with -0.15, -0.10 and -0.5 to try and find the sweet spot, then I'll fine tune the value further. I've put the puct constant back to 0.85 so the fpu reduction is the only difference between both of the programs in the equation (basically I can consider the old version as having a fpu reduction of 0, and see if I can find a better value).

As far as number of visits is concerned, my bot got to a solid 6d by training with -v 1000 and shows no sign of slowing down yet, so I don't think it matters.

It's a bit sad that this FPU reduction business is so difficult. It delays the next release a lot even though there are quite a few good optimisations already that are independent of it. Us here on Github mostly compile our own versions of course, but it would be nice to upgrade the bulk of clients at some point.

Edit: Would it help to add some of the FPU & pUCT related settings as something the server can request via JSON? Then it only has to be distributed once and isn't hardcoded. Testing and tuning could be faster in that case as well. There could even be a meta-tuning-network at some point which optimises these sorts of parameters and for example time-control algorithms like mentioned in another thread.

@Alderi-Tokori You might compare /next vs /next with just the fpu changed back instead of the Jan 20th version. Since you're seeing a ~30% difference in CPU usage. That's a lot of computation time.

Or in other words, you could try 30% more visits or playouts using the latest version so they use equal CPU and see which is stronger given equal processing time.

@roy7 From experience, it seems more games is better for training than a little more playouts in less games, this is what made me change from -p 1000 to -v 1000 originally, which almost tripled my game output without any apparent slowdown in strength gain. So if anything I'll just stay at -v 1000 but generate more games.

@Alderi-Tokori I think you need to test same codebase vs itself, and vary the FPU tuning only. I don't know which versions you are comparing, but it probably includes also removal of TTable. That change is known to make it run faster but weaker for the same number of playouts/visits, but make up for it if you run on equal time.

@killerducky Ah I see, that could explain it. Regardless, I think I've found a good fpu reduction value, I still have to wait for the 400 games to be played in order to be sure, but I think I'll be able to have the same strength as my current leelaz. So no need to change the number of visits.

Maybe @Alderi-Tokori 's experience for 9x9, namely that more games without slowdown in strength gain, is worth considering -v 1600 instead of -p 1600 and get the speed boost?

@WhiteHalmos I agree with the concept, but didn't see any positive response to it so far. Maybe it will be reconsidered with the change to 10 blocks.

Yeah, agreed. When speed goes to a crawl, then change happens. 😁

@WhiteHalmos Remember that what works for 9x9 might not work as well (or at all) for 19x19.

Tried alternating between best/worst child evals depending on whether current best eval is winning or losing. Didn't work. :)

leelaz-best-worst_66 v leelaz-next_66 (207/400 games)                                                            
board size: 19   komi: 7.5
                       wins              black          white        avg cpu
leelaz-best-worst_66     24 11.59%       14  13.46%     10   9.71%    197.46
leelaz-next_66          183 88.41%       93  90.29%     90  86.54%    161.98
                                         107 51.69%     100 48.31%

@eddh @WhiteHalmos I was thinking of people like you when I wrote https://github.com/gcp/leela-zero/issues/860. :)

This is a simple dynamic parent eval vs next (before @Eddh's PR was merged, congrats!):
cd is cde9c8d4, elo 7957

leelaz-parent-eval_cd v leelaz-next_cd (222/400 games)
board size: 19   komi: 7.5
                        wins              black         white        avg cpu
leelaz-parent-eval_cd    107 48.20%       47 42.34%     60  54.05%    236.13
leelaz-next_cd           115 51.80%       51 45.95%     64  57.66%    205.61
                                          98 44.14%     124 55.86%

Started testing dynamic parent eval against 5ac1d898 (elo 7998) and 1ac3 versus current next after rebasing (not yet pushed rebase to my repo but it's largely the same).

Advantages include:

  • Simpler to understand, reason about without more tuning parameters.
  • Improves in FPU prediction as the search progresses rather than being a static value throughout.
  • Vastly better for flatter (weak) policy networks, indicating generalization is better across nets.
  • Allows memory reduction per node since we can remove m_init_eval

Disadvantages:

  • Not as fully tested out as current next
  • Nothing to tune
  • Only about even for current (strong) policy networks

Interesting. Some times ago I tried something like get_eval - 0.2, it was about even with parent's net_eval - 0.2 too. I didn't think you could get something similar with raw get_eval but apparently I was wrong. The fpu reduction difference seems a lot smaller when using dynamic fpu solutions.
Tests for a branch based on average visited child evaluation are running for me too. Last time I looked it was about even at 56/54 against next, both on 6665 and 600 playouts.
Edit :
92 wins, 100 losses
192 games played.
Status: 0 LLR -1.79283 Lower Bound -2.94444 Upper Bound 2.94444

I stopped here, because, taking another look at @WhiteHalmos results with a simpler solution, I think using get_eval(color) as a starting point for fpu seems better. It doesn't even make much sense to me, since with get_eval(color), one uniquely good move in a bad position would rise fpu "unjustly" since it's only one good move among many bad moves. But it seems get_eval always seem to get decent results. Even back then when I did something like get_eval(color)-0.2 it was about equal with -0.2 according to @gcp. So get_eval somehow seems like the more solid option for dynamic fpu to me. I will try to do something with it.

Disadvantages:

  • ...
  • Nothing to tune
  • ...

Hahaha! :-D

Yeah it may not be a disadvantage xD
Though "nothing to tune" is often just an illusion. Nothing is stopping you from taking out 0.2 from any kind of fpu :-)

Btw, @evanroberts85 are you still trying new things ? I'm thinking of using something based on a log of number of visited nodes, how did that fare when used alone with net_eval for you ?

I think a good litmus test for new search algorithms (including fpu) ideas is to run 100 games with the current best network and a weak network. The new network must win at least 43/100 games against the best network and at least 58/100 games against the weak network.

>>> sprt.SPRT(42,1,58,-35,35,0.05,0.05)                                                 
'H0'
>>> sprt.SPRT(43,1,57,-35,35,0.05,0.05)                                                 
''
>>> sprt.SPRT(57,1,43,-35,35,0.05,0.05)
''               
>>> sprt.SPRT(58,1,42,-35,35,0.05,0.05)  
'H1'

Afterwards in a thorough test of 400 games it should win at least 193/400 games against the best network and at least 208/400 games against the weak network.

>>> sprt.SPRT(192,1,208,-35,35,0.05,0.05)
'H0'
>>> sprt.SPRT(193,1,207,-35,35,0.05,0.05)
''
>>> sprt.SPRT(207,1,193,-35,35,0.05,0.05)
''
>>> sprt.SPRT(208,1,192,-35,35,0.05,0.05)
'H1'

A weak network that we've all been frequently testing against is 1ac2 (elo 543), so that would probably be a good one for comparable results. Current best changes over time.

I don't think we need to keep testing against weak networks with fpu_reduction = 0 since we're just looking for something that is:

  • As good (within 35 elo, 95% confidence) as current algorithm for best network
  • Definitively better (35+ elo, 95% confidence) than current algorithm for weak networks

I tested my average value branch https://github.com/Eddh/leela-zero/commit/a0c9ede5fecc1e400008df0a7a831434ac789003 again, on c83 (best 5 block net), with 800 playouts this time, and I got a SPRT pass :

The first net is better than the second
networks v networks ( 151 games)
              wins        black       white
networks   94 62.25%   49 61.25%   45 63.38%
networks   57 37.75%   31 38.75%   26 36.62%
                       80 52.98%   71 47.02%

(the second one being next before sqrt_totalvp)
So taking into account the result on 600 playouts and 6665, which isn't bad :

92 wins, 100 losses
192 games played.
Status: 0 LLR -1.79283 Lower Bound -2.94444 Upper Bound 2.94444

I think this fpu method has potential. I will do test with a decaying fpu version in the hope of getting something good for weak networks.

@Eddh nice experiments. You might want to merge in the fix to m_root here: https://github.com/gcp/leela-zero/pull/872/files/85b6dc91bc7018d1e402e43aa27ffaa9fed807a5#diff-70aab7683fe64f9428b8bc47d1d183c0R491

@WhiteHalmos I think it doesn't affect my branch ? I'm not sure if it afffects branches not using parent's get_eval. Anyway both the next I'm testing against and my branch don't have this fix.

I think this not only affects your branch, it affects all our branches since it effectively didn't propagate the very first visit's value network eval (assuming no tree reuse).

Edit: Thinking about it some more, you are right, since get_eval for the root is never considered nor used otherwise. :)

I feel like it will help chances of getting good parent value based solution though. Maybe it is also part of the reason parent value fpu seems to work better in minigo.

True, although even without this a parent value based solution already outperforms next: https://github.com/gcp/leela-zero/pull/866#issuecomment-365745035

Edit:
```

sprt.SPRT(61,1,46,-35,35,0.05,0.05)
'H1'
````

Yeah that's promising. (I have to stop accidentally clicking the close button)

Haha, it's okay, the button is literally next to the comment button. :)

I'm restarting the training from my net n°42 (the last one before I pulled fpu reduction, at 505k games) with FPU reduction set to 0.

When I'm back again at around 920k games, I'll set up a match between the best network trained with FPU reduction = 0 and my current best trained with dynamic FPU reduction = 0.15, and I'll see which one is stronger.

It will take over 2 weeks before I have the results but at least we'll be a little more fixed on whether FPU worsens training "speed" (understand rate of progression) or not.

It will be very interesting to finally have data like that. Thanks for the effort !
In the meantime I'm testing yet another fpu calculation method which shows promise and basically just tries to more accurately predict expected winrate of unexplored move. I should have some preliminary result on acf9 tomorrow. I have tried and failed so many times to make any improvement that I don't get my hopes up though xD

@Alderi-Tokori Does fpu reduction 0 mean back to parent init eval or parent dynamic eval? Dynamic eval had the problem of virtual loss being a pseudo reduction.

@WhiteHalmos Parent dynamic eval I guess? I just use the --fpu_reduction 0.0 argument, are you telling me there's a bug with that? I'm using -t 1 so as far as I know virtual loss shouldn't have any effect anyway, right?

Oh, that should be fine, that is parent init eval. I thought you changed the code for 9x9. Looking forward to a PR for different sizes so people can test with 5x5 or 7x7.

Well the code is changed for 9x9 but it mainly affects the OpenCl and Network modules. And yeah, I definitely have to get around to that PR.

The results for this branch : https://github.com/Eddh/leela-zero/tree/fpu_estimation

130 wins, 117 losses
247 games played.
Status: 0 LLR 0.0535582 Lower Bound -2.94444 Upper Bound 2.94444

Using this command :
validation -n networks\af9c_true_fpu_estimation.txt -n networks\af9c_next_sqrttvp.txt -b "true_fpu_estimation\leelaz" -b "next_sqrttvp\leelaz" -o "-g -p 800 --noponder --fpu_reduction 0.3 -t 1 -q -d -r 3 -w" -o "-g -p 800 --noponder --fpu_reduction 0.25 -t 1 -q -d -r 3 -w" -g1 -k "sgf_true_fpu_est_m03_nextsqrttvp_p800_af9c"
All in all it seems like a very small improvement. Needs more tests on other nets though.

The idea behind this fpu method is that if you have an explored move with a certain value estimation, you can use that to guess better what an unexpanded move's value would be, based on the difference in policy between the expanded and the unexpanded move. So if you have a move with current action value estimation of 0.5, and policy of 0.2, and the top unexpanded has a policy of 0.1, you can expect the unexpanded one to be slightly worse than the expanded one. I use a simple formula to try to model that, using cfg_fpu_reduction*sqrt(policydifference). So with the example and 0.3 cfg_fpu_reduction, that leads to 0.3*sqrt(0.1)=0.095. That means we would expect the 0.1 policy move to be around 0.4 winrate.
It gets a bit more complex because then, in order to use the information from all the explored move, I do a weighted average of this formula for all explored moves, and use the net_eval of the parent in this average too.

So it is a relatively complex solution, but it seems to behave quite logically and to have decent results for now.

Looked over this old MoGo paper tonight and one thing they tried but didn't test deeply enough to publish results was using the value of a move from a parent or grandparent's position with the thought that a good move a move or two ago is likely still a good move now.

That is, not the parent's eval itself, but the eval of move X at the parent or grandparent's level as the fpu estimate for the move now. Which does make some intuitive sense to me. Imagine various opening moves, or areas to attack, where you do something locally in one area but the other area has barely (if at all) changed. So the moves you liked best in that area of the board should still be the same now as 1 move ago.

But I can also think of situations where this might totally backfire too.

I'm wary of ad hoc tweaks, even if they make intuitive sense at first glance. I'm thinking that there must be a more formal/theoretical solution. However, I still haven't done enough reading to have enough background knowledge to give any solid suggestions.

But my hunch is that choosing the FPU calculation is very much like choosing the 'alpha' and 'beta' priors of a Beta Distribution, namely that you can arbitrarily decide they should start at 1 and 1, because that intuitively makes sense, which is what Laplace did (essentially), or you can use a more theoretical approach to come up with a more theoretically justified initialization of 1/2 and 1/2 as Jeffreys did (see https://en.wikipedia.org/wiki/Jeffreys_prior#Bernoulli_trial). But then you can even go further and use the constraint alpha + beta = 1, and choose alpha and beta such that the prior probability given by the Beta distribution equals whatever prior makes the most sense (rather than being strictly equal to 1/2).

To really figure this out, what we need to do is identify, from the literature, what exactly the fpu number is supposed to represent (e.g. 'prior expected winrate'?, 'prior expected probability of exploring this move'?, etc.), and then choose a method of calculating it that is a theoretically 'best' or at least 'best we can manage' approximation of that quantity.

My current hunch (and it's only a hunch) is that parent->get_eval() is the most accurate prior, and any reduction from that is theoretically 'incorrect', even if it happens to give us an apparent 'boost' in the short term. But I can't justify that because I haven't done enough reading, and it's quite possible that I'm missing something important. But if FPU should be 'reduced' in some way, then it should have to do with what FPU is actually trying to measure, not just some arbitrary (albeit tuned) constant. IMHO.

I'm concerned about the results from 9x9 where after 20+ generations of "improvements" after fpu reduction it resulted in actually no noticable improvement from base. ☹️

Edit: Perhaps fpu reduction should only be used in matches.

I would also be concerned about that, but I think it's equally possible that this problem observed by @Alderi-Tokori was caused by training new nets only from games from a single network. This may be much more susceptible to producing a rock-paper-scissors effect than training on a window, especially when the neural net is already close to its performance limit for the network size. So, the effect may have been linked to FPU reduction, but it may also have been a completely different reason.

@jkiliani That's assuming a rock-paper-scissors dynamic can actually exist in go, which we don't know if it can, and is unlikely in my opinion especially at high level (high amateur dan). Regardless, we'll know more when I'll compare the results of both training methods. If they both end up equally strong or the FPU reduction one is stronger, then what I'm observing might indeed come from some other thing (and at that point my choice of only using the latest net's games will be the most likely suspect).

Rock-paper-scissors would be the extreme case, but I think the self-play rating inflation is rather conclusive proof that in general, win rates are not transitive. So far, only isolated tests have been conducted between nets that were several generations apart, so cgos ratings are unfortunately the closest data we have to a real rating progression.

We could always use more anchors running on CGOS. 😊

Edit: Perhaps fpu reduction should only be used in matches.

Hmm. That seems like a reasonable possibility. Still, I would prefer some theoretical reasoning/justification for that, if it were up to me.

@wctgit There has actually been very little data on the strength of fpu reduction with randomization on. I will do more extensive test to be sure. It may actually turn out to not be an improvement with -m 30 and -n

So yeah it is SPRT verified that fpu reduction keeps its strength when there is randomization :

The first net is better than the second
networks v networks ( 58 games)
              wins        black       white
networks   45 77.59%   22 78.57%   23 76.67%
networks   13 22.41%    6 21.43%    7 23.33%
                       28 48.28%   30 51.72%

So at least, if somehow training games using fpu reduction end up producing bad data, that will means we need different settings between training and match/practical use. Which would be quite problematic.

@Eddh I don't see how that would be problematic, timemanage already works like that.

I was talking more about UCT parameters in particular.

After thinking about the idea of subtracting a constant from a probability, there are normally few real-life cases where that actually makes sense (although see note below about 'negative probabilities'), due to the nature of the axioms of probability theory. The usual exceptions are when dealing with mutually exclusive events and/or the unions/sums of mutually exclusive events. E.g. There's the addition law of probability, that:

P(A v B) = P(A) + P(B) - P(A ^ B)

However, those situations rarely involve a constant value for some probability P(A ^ B).

Therefore, I'm making the argument that the fpu reduction expression:

parent->get_eval() - some_constant

is fundamentally wrong as it does not seem to correspond to any reasonable version of subtracting a mutex event from a sum of mutex events. It's just a crude 1st order correction term that has to be tuned to some 'optimum' value, but which is not theoretically sound in any case.

I propose an alternative, that will also be a tuned constant, but at least will fit within the framework of probability theory. Multiplying probabilities, especially independent or conditional probabilities, is a very common operation.

P(A ^ B) = P(A) * P(B | A)

Or if A and B are independent variables, then that simplifies, because by independence, P(B | A) = P(B), so:

 P(A ^ B) = P(A) * P(B). 

The main point is that multiplying probabilities is much more natural than subtracting some constant from a probability. When you multiply any two numbers in [0, 1], you always end up with another number in [0, 1], whereas if you subtract a constant in [0, 1] from a variable in [0, 1], you have the potential to end up with a negative result, which usually does not make sense. And even if you use a minimum lower limit of 0, you will just as often end up with events which have 0 probability, which is just not that common of an occurrence in real life, and should be generally avoided in most machine learning applications. Multiplying two numbers in [0, 1] rarely gives a value of 0 unless one or both numbers are 0 themselves (or for floating point approximations, they are two very small numbers very close to 0).

Therefore, I propose a new expression for fpu reduction,

parent->get_eval() * some_constant

where some_constant is restricted to be within [0, 1]. Perhaps, for the sake of making optimization easier, it can actually be transformed from a float in [-oo, +oo] via a sigmoid transformation s = (1 / (1 + e^(-x))), which maps the Reals onto [0, 1]: when x = -oo, s = 0; when x = 0, s = 1/2; and when x = +oo, s = 1. [Perhaps such a transformation is not necessary for the optimizer; it's just an idea.]

I expect that if we were to adopt this strategy, we would end up with a constant of multiplicative fpu reduction somewhere around 0.3 to 0.8, probably closer to 0.5 to 0.7. This is just a ballpark estimate based on the current constant subtractive fpu reduction of around 0.22ish.

By adopting multiplicative reduction rather than subtractive, we will very likely avoid many/most/perhaps-all of the problems we seem to be running into with negative or zero probabilities. Instead of 0.1 - 0.22 leading to -0.12 (or 0.0 if limited by minimum of 0), we would end up worst case scenario with 0.1 * 0.3 leading to 0.03, which is still a possible choice for the tree search. Even 0.01 * 0.3 = 0.003 which, while very low, has an expectation of more than 1 visit if given 1600 or 2600 tries at it (using a naive assumption that fpu is treated exactly as a probability, which I understand is not actually the case; just a ballpark example).

On the surface, if we tune this constant to the current networks, I expect a temporary slight drop in performance, but only slight. That is because I believe the current networks are already 'adapted' (in an evolutionary sense) or 'optimized' (if you prefer not using evolutionary metaphors) to the current subtractive fpu reduction constant. However, I expect that this slight drop will quickly be regained, and the new networks will produce stronger training games, especially in those unusual edge cases such as self-ataris, big dragons dying, and perhaps even ladders.

As a possible alternative to straight-out replacement of the subtractive constant, you could simply introduce a new multiplicative constant, initialize it to 1, and then perform a combined tuning on both constants, i.e.

parent->get_eval() * mult_const - sub_const

If my speculation is correct, you should expect to see a slight performance bump initially (since you are adding an extra degree of freedom to the optimization process), but the sub_const will probably reduce from 0.22ish to some value significantly closer to 0, such as perhaps 0.1. And subsequent tunings should later see this subtractive constant continue to decrease closer to 0 over time, while the multiplicative constant should stabilize to some number not close to 1 (or maybe both will converge to 0 and 1 respectively over time, if it turns out that fpu reduction from parent->get_eval() happens to not have been a good idea after all).

Personally, I'm confident enough about my reasoning based on probability theory, that I would jump straight to replacement of subtractive to multiplicative right away, but I could easily understand anyone's caution and wanting to try using two variables first. As a bonus, using two variables should produce a short-term slight increase in network strength, after combined tuning, so I think trying it out has few risks. (There is a slight/small risk of over fitting, but we already have that with the subtractive constant. Adding a second constant probably won't increase that risk very much, if at all. It could actually decrease over-fitting risk, if my hypothesis is correct, and subtraction is fundamentally the wrong operator.)

[Note about negative probabilities: I was going to say something like 'negative probabilities don't make any sense', but decided to check on that, just to be sure. It turns out that there is actually a sensible case to be made for [negative probabilities](https://en.wikipedia.org/wiki/Negative_probability). And who knows, it might actually be useful at some point for this project. However, I strongly doubt that it is applicable to the case of fpu calculation.]

Isn't a multiplicative factor almost equivalent to changing the softmax temperature? I'm not sure about the theoretical foundations of FPU reduction, but practically it turned out to do something which softmax tuning couldn't. As usual, we need data ^_^

@odeint I think your reasoning has merit but there are some assumptions that don't necessarily match reality. Mainly, using probability theory to reason about the action value. Here it is a winrate, so a probability, that is true, but UCT search isn't based on using a probability as action value. Originally it is just a reward from looking at a move. So the action value isn't necessarily a winrate. It can be anything. So applying probability theory to it seems like it might not be valid to me.

@Eddh While I understand your point, and it is mostly valid, you must admit that the _input_ to the 'action value' (in this case, an expected winrate) starts out as a probability, so at least it makes sense in terms of adjusting that probability by some other independent probability.

Thus, if you think about it as adjusting the 'expected winrate' rather than as adjusting the 'action value', even though in this particular case the 'action value' just starts off as equal to 'expected winrate', then at least it can be made sense of as operating on a probability, hence probability theory would apply.

[And also, even if we only think in terms of 'action value', it turns out that 'action values' less than or equal to zero cause a move to never be chosen, right? In that case, the 'action values' are acting like probabilities anyway, right? (Or maybe I'm mistaken about this part?)]

Action values that are less than zero can still be chosen because of the policy pushing them up basically. And I'm still not sure probability theory can apply in this case. Deepmind used action value between -1 and 1, so already we are out of standard probabilities that are from 0 to 1, even though you can do a mapping between them. Then with Alphazero in chess, they used -1 for loss, 0 for a draw, 1 for a win. What does 0 represent then ? Certainly not that you have half a chance to win. It could be, but you could have an absolutely certain draw instead. That means it isn't a direct probability. Just a score, a value, assigned to moves to rank them.

Action values that are less than zero can still be chosen because of the policy pushing them up basically.

But if the policy is too small, the result might still be less than or equal to zero, right? In which case, it won't be chosen, right? (Or again, maybe I'm wrong on that. Just trying to confirm my understanding.)

Then with Alphazero in chess, they used -1 for loss, 0 for a draw, 1 for a win. What does 0 represent then ? Certainly not that you have half a chance to win. It could be, but you could have an absolutely certain draw instead.

I think if you consider it in terms of _expected_ winrate, i.e E( WR ), then yes, a draw could be considered an expected winrate of 1/2.

Imagine playing N --> oo games, where half of the games are sure losses, and half are sure wins, that would equal E(WR) = 1/2. And if all the games were sure draws, that would also equal E(WR) = 1/2 (for the purposes of scoring). In other words, the value of N/2 sure-wins + N/2 sure-losses is equal to N sure-draws.

Also consider the sequence 0, 1, 0, 1, 0, 1, ..., which can be considered (given certain assumptions) to 'converge' to 1/2, even though intuitively and in normal circumstances it is considered non-convergent.

So, I'm starting to wonder, where did the idea to use parent->get_eval() (originially, parent->net_eval) come from? It was an alternative to using 1/2, right? Why use a number representing expected winrate in the first place? Is it roughly equivalent to what Deep Mind did, or are we jumping to a conclusion about that?

I think if you consider it in terms of expected winrate, i.e E( WR ), then yes, a draw could be considered an expected winrate of 1/2.

Expect it's probably more accurate to see it not as an expected winrate, but as an expected score. How can you have an expected winrate of 1/2 when you are sure to draw ? the expected winrate should be 0, for my definition of winrate at least.
Basically my point is that the UCT algorithm treats these numbers as a score, a reward, to pursue, not as a probability of winning.

Alright, my no-reduction VS reduction experiment is over, the no-reduction net finally beat the best fpu reduction net, but it did so with roughly the same number of games, so at the very least FPU-reduction doesn't seem harmful to the rate of progress. Though Elo seems to be more inflated with FPU redution.

That's a very interesting result because even with the higher level games because of fpu reduction, it doesn't really help the network become better. So this seems to prove there is more to generating good quality self play games than strength.

My current hunch (and it's only a hunch) is that parent->get_eval() is the most accurate prior, and any reduction from that is theoretically 'incorrect', even if it happens to give us an apparent 'boost' in the short term.

I explained why I believe this is wrong in the commit comment when I added FPU reductions.

The parent->get_eval() is a very accurate prior...for good or the best moves only. For moves that are suboptimal, we expect our evaluation to drop compared to the parent. FPU reductions (with policy summing, as we use now) drop the prior according to how "suboptimal" the move is, and so it behaves very well in that regard.

If the parent eval is 0.5, and I play the worst possible move, the prior on the evaluation of my position is likely NOT that it is even but that I am now worse.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wensdong picture wensdong  ·  3Comments

destanig picture destanig  ·  4Comments

l1t1 picture l1t1  ·  4Comments

holicst picture holicst  ·  4Comments

jslechasseur picture jslechasseur  ·  4Comments