Stockfish 🚀 - Negative regression???

Look at the number of games taken for most of the passed elo gainers. Totally unreliable, imho.

joergoster on 28 Jan 2020

SPRT elo estimates are not accurate, but the confidence of superiority is very good. It's designed to stop once there is a very high confidence it doesn't regress and a high confidence it progresses.

The regression test is far from finished, but at the very best it will be something like +1 or +2 elo. That's bad news.

Alayan-stk-2 on 29 Jan 2020

👍3

It will be valuable to repeat RT with just the elo gainers after sf11. (Master - ( simpl + non funct)). In the past a couple of this kinds of tests showed no problem but that doesnt mean it will be forever no problem. Now its also different bounds and statistics.

NKONSTANTAKIS on 29 Jan 2020

Another interesting RT will be with the noob3 book, as the optimization we do for it might not correlate that strong with 8-moves performance. If thats the case, probably its a good idea to do RT's on noob3, for better tracking of progress, more accuracy and less panicking.

NKONSTANTAKIS on 29 Jan 2020

👍1

Another thing to examine is 0,2 LTC possibly being too easy, allowing too many false positives through sheer number of tests. We basically rely on a single test, current STC adds very little LTC confidence.

NKONSTANTAKIS on 29 Jan 2020

With current settings a probability of a <0 elo patch passing LTC is 5% according to https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=0&elo-1=2&draw-ratio=0.61&rms-bias=0. NKONSTANTAKIS made a good point about the book being different. If it is indeed the book making a difference then it's gonna be fun...

Sopel97 on 29 Jan 2020

If we only look at the 15 tests between SF11 and 200128,
3 things stands out and they happened in the most recent 5 tests committed

a) bench jumped from 4725546 (test no 14) to 5545845 (test no 15) on 200128

b) 2 LTC finished before 10000 games (test no 12 and test no 14)

c) http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1
(test no 11) is currently struggling, although it already passed a STC regression
http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1

I therefore suggest to test SF11 against test no 10, which is
https://github.com/official-stockfish/Stockfish/commit/6d0eabd5fe2961551477820ab7619e2c31e01ffd

and then we can continue bissecting up or down from there.

Rocky640 on 29 Jan 2020

👍2

https://www.sp-cc.de/ tested Stockfish 11 official 200118 playing against a pool of AB engines and compared with Stockfish 200107 playing against same pool of engines, at 180+1

He says "AB-testrun of Stockfish 11 finished. Sadly a clear regression to Stockfish 200107 (-7 Elo)"

This is only one result, but there is a puzzling coincidence:
the first commits of test which used the new pentanomial model, 3 moves book and new SPRT bounds were done on January 7. Maybe the problem started there ?

Rocky640 on 29 Jan 2020

The error bars with 5K games against significantly weaker opponents do not allow to tell with a reasonable degree of confidence that SF11 would be inferior to SF 200107.

Alayan-stk-2 on 29 Jan 2020

👍1

@Rocky640 That is possible, but certainly would be surprising...our own regression testing showed a slight single-threaded gain from January 7 to SF11. While that is close-to-within error bars, we should have detected a 7-Elo regression...

31m059 on 29 Jan 2020

👍1

Should we temporarily disallow any new patch submissions to fishtest and stop all currently running ones, so we can focus our attention on this issue, since we'd probably want to run a bunch of regression tests to pinpoint the problem? Seems like this could have some pretty big consequences on the future of SF's improvement.

adentong on 29 Jan 2020

The fishtest server has a priority field (which could be increased for regression tests, if those are a priority).

ddugovic on 29 Jan 2020

let's look at this carefully, but without hectic ...

first, the results so far (RT not yet finished) is consistent with NCM https://nextchessmove.com/dev-builds where no single patch stands out.

second, as next thing I will test with the book which is now being used for development, out of curiosity to see if this matters.

vondele on 29 Jan 2020

👍1

I don't think it will be necessary to stop all other ongoing tests, or to elevate priority...those are drastic measures. We can use throughput for a "softer" approach.

But since the new throughput calculation prefers positive LLR, @snicolet's non-regression verification test is going to lose workers as it progresses towards LLR = -2.94 (the more informative result, since failure may mean reversion of that commit). Therefore, I've artificially raised its throughput to 150% for now. Hopefully, this represents a good compromise...

31m059 on 29 Jan 2020

@31m059 no need... failure of that test will not necessarily imply revert. First, I think starting those tests was premature, second, let's not forget the statistical nature of fishtest.

vondele on 29 Jan 2020

@vondele My apologies, I've now restored the default throughput. I agree with your approach of exercising restraint here.

31m059 on 29 Jan 2020

okay, I didn't notice this topic, I will repost there :)

Vizvezdenec on 29 Jan 2020

I think it's about time to respin this discussion after quite dissapointing regression test (it's not finished but it's quite obvious that it most likely wouldn't finish positive).
So, we made STC bounds really loose and now probability of patch being not a regression became 18% * 5%, so like 0.9%, it seems that it's too much - 7 elo gainers result in what seems to be slightly negative elo gain.
I guess we should do smth with this.

Most obvious is that we probably should do simplification attempts for all 7 passed patches that made it into master since sf11 release. Probably just at LTC;
It seems that 0,9% is too high of a chance for negative patch to pass. Probably since we want loose STC bounds to give more patches shot at LTC we should slightly stricten LTC bounds themselves since a lot of patches that are negative go to LTC and each of them has decent chance to pass.
My proposition will be to change LTC SPRT bounds to {0.25, 2.25} or {0.5, 2.5} - second one is closer to 0,25% of regression chance we had like forever, first one will allow more patches to pass.
I think that the stronger engine gets the stricter should be % of non-regression (yes, it's sad, because less patches will pass) because % of passed patches becomes lower and lower thus more and more patches are tested and more and more patches lie in "slightly negative" zone.
Maybe we can also slightly move lower bound of STC.
I think good compromise between everything can be
STC {-0.5; 3}, LTC {0.25; 2.25}.
Chances of negative patch will be smth like 0,26% - more or less what we used for years, we will have slightly less LTCs (which is, imho, a good thing, nowadays we run infinite LTCs most of them are not even close to passing), overall game number wouldn't increase this much, STC-LTC correlation will be slightly more reliable.
I guess it's all from me for now, your opinion is really appreciated :)

Vizvezdenec on 29 Jan 2020

👍3

@Vizvezdenec I appreciate your proposition but IMO it's to early to take measures. Let us first try understand (and measure) what went wrong.

pb00068 on 29 Jan 2020

👍1

Well first stuff that should be done is trying to simplify passed tests with {-1,5; 0.5}, imho.

Vizvezdenec on 29 Jan 2020

Trying to simplify the passed patches (but maybe not all at once!) seems
like a good idea.

Alternatively, with 15 patches since sf11 how about testing the first 8
patches for progression / regression? If there is a single patch causing a
problem this would be a first step towards identifying it.

xoto10 on 29 Jan 2020

👍1

The regression is not possible to have come from the LTC gainers. Even if we hitted the 5% "jackpot" 3/7 the rest would cover up the tiny elo loss, landing us into +1-2 at worst.

Nevertheless, STC -0.5,3 definitely agreed + 15" for extra correlation. LTC maybe then will be fine as is, (viz suggestion sounds good too, but it might rarify greens too much) so STC 1st needed step and reassess.

More on topic,
Sinplifications with current -1.5,0.5 feel like passing easier than before, and one has to question their value in general. Imo if they are not removing serious code its not worth it. Now ie we have a visible ocb weakness (from ccc observation), because the cycle of small regression into new better formula didnt suceed (yet).

Large number of non-functional patches, untested, hold the risk of random unforeseen side-effects, like it seems to be in this case. Probably its wise to do "better safe than sorry" type of tests like sn often does, as neither humans nor compilers can be 100% trusted.

NKONSTANTAKIS on 29 Jan 2020

👍2

Oh, again simplifications are the cultprit, of course.
Heard it a lot of times and have never seen tests of 20 simplifications together being negative elo - it was done multiple times and NEVER had shown result worse than -1 elo.
But "elo gaining tests" being actually negative even with old bounds is nothing that is unheard of and it's even more probable with current ones.
Just saying that with number of patches we test and % of patches passing we should actually decrease % of patches that can pass being negative.
Nowadays 0,25% threshold for passing patch is -0.3 elo and 0.1% is -0.5 elo , with previous bounds we had this as 0 elo (0,5 0,5 and 0,5;4,5 and 0; 3,5 respectively).
With our goal to be better at 60+0.6 and mostly rely on this test as an indication of this patch being good, imho, it's pretty logical to make lower bound > 0 so LTC regressions will have really low chance to pass. Even at cost of lower number of patches passing in general.

Vizvezdenec on 29 Jan 2020

imo starting a RT with only simplifications wouldn't hurt too much, and can determine if the SPRT bounds are too loose.

owldw21 on 29 Jan 2020

I think we should wait for the tests to finish before discussing anything. Note that from the test by @vondele it may well follow that we are just witnessing a case of selection bias. This could be a confirmation of the existence of such a phenomenon (on which I have speculated in many posts).

vdbergh on 29 Jan 2020

👍1

Well, opening dependancy for sure can be a thing. But tbh it's within error bounds.
Also I need to say - if LTC on 2 moves book ends with 0 it's still pretty bad.

Vizvezdenec on 29 Jan 2020

Too many simplifications removed at once is extremely noisy test, as in between a lot of elo gainers were adapted on the basis of the simplifications. Turning them on at once would distort the elo gainers performance unpedictably. 20 is a ridiculous number, useless test. I dont understand the rationale for suspecting just the 1% chance of elo gainers regressing while at the same time -1.5,0.5 means that a -0.5 elo test is more or less a coin flip, so 25% to pass both and be merged.

Edit: An interesting usage of a new master - 20 simpl vs master test would be as comparison with an old master + 20 simpl. This way one can measure their relative dependency of that period.

NKONSTANTAKIS on 29 Jan 2020

Error 2 elo, book 2 elo, linux patch 2 elo, simpl 2 elo, sum 8 elo, hence -3 instead of +5. Easy.
Case 1. Is unavoidable

Easily eliminated with same book RT
Requires constant caution for untested changes
Elo is much harder to get now than it was a decade ago when -3,1 was established. We can adapt our price to simpl ratio to current needs. For example with -1.25,0.75

NKONSTANTAKIS on 29 Jan 2020

So from the regression tests it doesn't look like book is the problem. We also know that the linux large page commit failed to pass non regression.

adentong on 29 Jan 2020

Oof. Viz's test 1, 5, 6 already passed simplification, 3 is about to pass, 4 and 7 are still neutral, and 2 looks like the only one that's resisting the simplification.

adentong on 30 Jan 2020

@adentong I have some concerns about simplification 6, since it's not really a simplification (it reverts a complexity-neutral parameter tweak for TrappedRook). But overall you're right, this pattern is striking.

31m059 on 30 Jan 2020

well I tested all "elo-gaining" patches since sf11 release regardless of them being param tweak or not.
Next thing I want to do is to squash them into a single commit and test it on fixed 60k games against master on 8 moves book. Will do so closer to 17-18 Moscow time since I'm at work now :)

Vizvezdenec on 30 Jan 2020

@Vizvezdenec Which simplifications do you plan to combine for the fixed-games test? Just the ones that pass STC? (That would be my recommendation.)

You might also just run them as a LTC SPRT with simplification bounds...

31m059 on 30 Jan 2020

obviously the one that pass STC :)
Yeah, but I want to see some estimate of elo on fixed games.

Vizvezdenec on 30 Jan 2020

Initial regression test is complete:

ELO: -2.47 +-1.3 (95%) LOS: 0.0%
Total: 60000 W: 7490 L: 7917 D: 44593
Ptnml(0-2): 330, 5657, 18424, 5285, 303
https://tests.stockfishchess.org/tests/view/5e307251ab2d69d58394fdb9

31m059 on 30 Jan 2020

So 5 simplifications by Viz have passed stc, are we going to run ltc for
them individually or combine them somehow?

It seems as if something has changed and patches are passing stc and ltc
too easily allowing elo losing changes into the codebase. Or is the problem
that these simplification tests are passing too easily?
Tricky.

xoto10 on 30 Jan 2020

@Vizvezdenec test is interesting IMO. 5 out of 7 'Elo gainers' can be simplified without regression (at least STC), for quite a few of them, the Elo estimate for removal is positive... I think these can be individually reschedule for LTC simplifications.

We'll have to reflect a little on what that means for our testing procedure.

vondele on 30 Jan 2020

👍2

I think that this test will tell us more.
http://tests.stockfishchess.org/tests/view/5e32b470ec661e2e6a340d66
If it also lands in firmly negative zone I think we should definitely rethink our SPRT bounds. :)

Vizvezdenec on 30 Jan 2020

At least this one failed https://github.com/Vizvezdenec/Stockfish/compare/a910ba7...2907081 . This was what appeared to be an unambiguous Elo gainer: http://tests.stockfishchess.org/html/live_elo.html?5e2f767bab2d69d58394fd04.

Elo: 7.71 [3.37,12.05] (LOS 99.974%).

vdbergh on 30 Jan 2020

2 tests failing simplification out of 7 is nothing to be proud of, imho...

Vizvezdenec on 30 Jan 2020

👍1

@Vizvezdenec I started LTC simplifications

vondele on 30 Jan 2020

okay I'm not at home now so can't do it myself anyway :)

Vizvezdenec on 30 Jan 2020

If we can regularly pass and then simplify away the same test, that
suggests our elo gainer and simplification bounds are too close together. I
guess the poor regression test results suggest it's the elo gainer bounds
that need to get tougher.

xoto10 on 30 Jan 2020

Let's see how all tests finish, but if it's the case... I already proposed some solutions :P

Vizvezdenec on 30 Jan 2020

and @noobpwnftw has added a heap of new machines to the framework now! Thanks a lot! :-)

snicolet on 30 Jan 2020

👍7 🎉3

thx @noobpwnftw
Now we can only wait for data to converge and then make something out of it...

Vizvezdenec on 30 Jan 2020

I repost here my comment from there:
https://github.com/snicolet/Stockfish/commit/b64a9bba9a4f5466c5b4795527684170fd2164a7

It's diffcult. That some tests are false positives is normal but that most tests falls in this category seems really odd and is not likely. So we should look for a common source for this. Perhaps the implemention of new pentanomial model has some bug. What i have seen that for few games the error bar seems only about the half than if we use simple statistics (2*standardDev). I don't know if this normal. Perhaps we should also recheck the commits under the new model before SF 11.

Someone propose to change the SPRT bounds but perhaps its better to decrease the probability of false positive/negatives by reducing alpha and beta.

EDIT:
I had not considered the draw ratio so the error bars are smaller. So i get around 20-30% more deviation more few hundred games than was displayed at fishtest. But i think that could be explained by the better aöpproximation with the pentnomial model.

locutus2 on 30 Jan 2020

Its a bit of both, elo gainers pass easier than before and simplifications pass easier too. As they are almost the reverse (-2,0 is too close to -1.5,0.5 !), not only any elo gainer has high chance to pass -1.5,0.5 but also any simplification has high chance to pass 0,2 when reverted.

So besides testing all elo gainers with simpl bounds, I propose testing reverting all (or some, like the ocb one) simplifications since -1.5,0.5 with elo gaining bounds.

Its evident that the change to logistic elo altered the analogies and lowered confidence. (as all tests resolve much faster)

NKONSTANTAKIS on 30 Jan 2020

well so far it looks like combo of 7 "elo-gainers" is firmly negative vs sf 11.
So stop blaming simplifications for it, at least for now.
We tried to reintroduce OCB scalefactor with @locutus2 but it all failed LTC while passed STC a lot of times - there seems to be basically no elo there.

Vizvezdenec on 30 Jan 2020

How about lowering contempt to reduce jitter to the statistical model. Beforehand we needed extra resolution, now it seems we need more stability of the signal.

Not by a lot, just to fullfill the old rule of not regressing vs ct=0. This would also result into more accurate optimization. To optimize everything for max gain ct24 v ct24 self play inevitably introduces some bias, maybe too much.

NKONSTANTAKIS on 30 Jan 2020

The combo of 7 gainers has not been tested yet, @snicolet made 2 files, the one tested atm includes all functional patches. Its basically the initial RT minus non functional.

NKONSTANTAKIS on 30 Jan 2020

@NKONSTANTAKIS
That with the simplifications is indeed a problem. Perhaps a increase of LTC bounds like [0,3] would help. This has several advantages:

Simplify away an elo gainer is harder
currently STC und LTC have the same break even point (1 ELO). With new bounds LTC would have 1.5 ELO so we probably increase number of good scaling patches
through the wider window LTC would be faster (around 55%)
we can keep STC bounds which seems have a good ratio of greens under the aspect of motivation

Disadvantage is that less LTC passes but if we avoid more regressions it seems reasonable. But this has to be simulated for real anwsers.

locutus2 on 30 Jan 2020

@locutus2 Logical direction but a bit too much imo. Halfway there with 0,2.5 and assess its safer, and to categorise simplifications into light (borderline parameter tweaks) and heavy (considerable code removal), with different bounds

NKONSTANTAKIS on 30 Jan 2020

I firmly believe that it's better to increase lower bound to > 0 value. Since it's THE BEST (and the only) way to decrease probablily of patch regressing at LTC. Everything else is half-measures.

Vizvezdenec on 30 Jan 2020

Considering the longterm perspective, I now agree with @Vizvezdenec. It ensures that added code will be worth its weight. Fewer elo gainers, less need for simplifications, cleaner code. So a lot of hidden economy involved. 0.25,2.25. It will be expensive though, its a bad idea to feed ltc as much as now. Stc can still be easier, just not as much as now. -0.5,3 + tc increase

Edit: Just realised proposed stc its not (much) easier, just more noisy. We would like it to be a tad easier than ltc, as scaling pillow. So -0.5,2.5, as easy as now but less false positives, so more accurate.

NKONSTANTAKIS on 30 Jan 2020

https://github.com/Vizvezdenec/Stockfish/compare/a910ba7...3e1d8c7
number one
https://cdn.discordapp.com/emojis/648699415099605002.png?v=1

Vizvezdenec on 30 Jan 2020

I believe it is increasingly clear that a number of 'Elo gainers' passed through our STC [-1, 3] and LTC [0,2] bounds, that turn out all to be slightly negative in true Elo. This is possible in the statistical framework that fishtest is. The most natural and simplest way to deal with this is to adjust the bounds.

However, let's first look carefully if there are other possible causes for what we observe.

vondele on 30 Jan 2020

👍2

I'm just saying that I was always concerned about this new bounds.
And now we are getting a solid proof that in medium-scale they can introduce regression masked with "elo-gain".
I think that our goal at first is to be confident that what we introduce is an elo gain and not to increase number of code changes just for the sake of more passing patches. I honestly liked older bounds more, yes, it was much harder to get positive STC to let LTC run but at least you could've been confident that LTC is not trash with big probability.
Nowadays we are (more or less) picking random patches and let them run on LTC to see which one flukes out to be green. It's okay but if we do so we SHOULD make lower passing bound more strict. So at least LTC SPRT will give us quite big confidence that we are not accepting some junk (I'm not trying to be offensive, one of this "simplifications" is my patch, I'm just quite pissed with clean regress we managed to push).

Vizvezdenec on 30 Jan 2020

👍1

As I already sad, I had multiple cases where I had 3-4 passed STCs on the same idea that failed LTC badly and STC retest had shown also negative perf. But with some % one of them could've been positive for sure.

Vizvezdenec on 30 Jan 2020

👍1

Even before sf11 we had statistically strange RT results with new bounds. Maybe its best for stc and ltc to have equal (or similar) confidence, like the old 0,5 0,5 era. Ltc confidence is more valuable but stc confidence is cheaper. But not equal elo bar, stc needs to be easier.

0,2 stc + 0.25,2.25 ltc

NKONSTANTAKIS on 30 Jan 2020

well ngl this results was also with old bounds.
But in my memory in past 3 years it's the first case of clean and not bug-related regression of master.

Vizvezdenec on 30 Jan 2020

Yes, no clear proof but the problem was probably masked by error margins + strong elo gainers carrying the bad ones. We are very lucky now to not have a +5 elo patch camouflaging the issues

NKONSTANTAKIS on 30 Jan 2020

I have a full proposition, with synthesis of all viewpoints: to keep adequate widness of 2, and tweak only the elo bar in relation to code.

Code adders: 0,2 + 0.25,2.25
Tweaks: -0.5,1.5 + -0.5,1.5
Trivial simpl: -1,1 + -1,1
Real simpl: -1.5,0.5 + -1.5,0.5

NKONSTANTAKIS on 30 Jan 2020

Along with 3 more propositions:

Stc to 15+0.15, for more correlation
RT book = patch book , for less noise
Base ct decrease to old rule, for higher optimization coherency of CT0 to CTdef

How about initially voting on each one seperately, in order to get the community signal and then decisions fall to area of responsibility?

NKONSTANTAKIS on 30 Jan 2020

What is the theoretical probability of a test failing SPRT(0..2), given that it has already passed the same SPRT(0..2) once?

snicolet on 30 Jan 2020

I can't say that I'm all that for voting to decide stuff, not gonna lie.
Votes will be including a lot of people that didn't write a single patch and this people (no offence) naturally know and understand much less of how sf improvement works than people who try to improve sf on a constant basis.
Maybe I sound cocky but it's the truth.

Vizvezdenec on 30 Jan 2020

@snicolet... I just answered that question elsewhere... If we assume the patch has a true Elo of 1, it fails a second SPRT test with 50% chance.

In general, it depends on the prior distribution of patches you have.

To give an example:

assume you have 100 patches all of 0 Elo, those that pass a SPRT(0,2) by luck (about 5% of them), will fail a second SPRT(0,2) with 95% chance.
if you have 100 patches all of 1 Elo, those 50% that pass a SPRT(0,2), will pass the second SPRT with 50% prob. as well
if you have 100 patches all of 2 Elo, those 95% that pass a SPRT(0,2), will pass the second SPRT(0,2) with 95% chance as well

vondele on 30 Jan 2020

👍1

Some thoughts :

It is beyond any doubt now that we got regressions in master.
The test with a different book is also failing, and confirms that blaming the regression on the book was incorrect. Different books show different sensitivity (poor sensitivity requires more games to validate a result), but the vast majority of changes that are good for one book are good for the others. How is e.g. a search tweak supposed to do great at noob_3moves and terribly at 8moves_v3 ?? One can simply look at chess960, which is much more different from standard chess than our different books are from each other, and see that the relative strength of engines in CCRL FRC is still quite similar to the relative strength in standard chess.
Preliminary results from the new tests shows that some supposed elo gainers are between useless and elo-losing.
These merged regression follow shortly in time the introduction of the pentanomial model and the switch from bayes elo to logistic elo.
The pentanomial model allows to compute more accurate error bars. Those happen to be lower than the error bars with the trinomial model, because counting an opening that was won twice by the same color just like two draws reduce noise. But while the theoretical stop rules have not changed, what this really means is that because trinomial overestimated the error bars, the implicit confidence level used by fishtest previously was higher. It apperas this higher confidence was not wasted.
The STC bounds adopted a few months back, combining a low elo threshold and quick acceptance, are very noisy, making flukes much more likely.
The switch from bayes elo to logistic elo used roughly rounded values, because keeping values ending in .5 or .0 was deemed more important than doing a precise conversion. I'm not sure if this rounding made things better or worse, but we shouldn't consider the "prettiness" of the bounds as an important factor.

The tests currently running should help identify the patches behind the regression, and remove them.

But moving forward, to avoid this happening again, the solution obviously is to revise the bounds to make a fluke less likely. The 0 elo passing probability is not the only metric to look at ; noisy bounds make -0.5 elo and -1 elo much more likely at equal 0 elo passing probability.

@NKONSTANTAKIS Please avoid publishing several messages in a row, prefer editing your latest message, or wait a bit before pressing the comment button to be sure you didn't forget anything important. It makes the discussion more readable.

EDIT : also, voting is a very bad idea. People who have thoughtful opinions can express them and contribute to the discussion, but in the end the maintainer should take a decision. People with a minor implication in the project or a poor understanding of the issue should not have the power to decide.

Alayan-stk-2 on 30 Jan 2020

❤1

My thoughts are:

the main requirement is to make the bounds higher for elo gainers
stc bounds could perhaps rise by more than ltc, and/or narrow a bit
as mentioned elsewhere, I would also like to see stc tests move to a longer tc, 15+0.15 seems the obvious candidate, only 1/4 of ltc instead of 1/6. I am not sure if this should be left until later, but perhaps it is best done while we are changing bounds

E.g. Something like :
STC {-0.5, 3.5} 15+0.15
LTC {0.25,2.25} 60+0.6

I based the increases on simple eyeballing of the sprt graphs linked from fishtest, no doubt the maths guys can come up with more rigorous assessments of the increases required, e.g. using the results of the tests snicolet is currently running

xoto10 on 30 Jan 2020

👍1

Sadly I was misunderstood. By voting I didn't imply any responsibility for those in charge to act upon vote majority, or votes to have equal value. More like as a missing signal for people who don't post. Everyone has his own unique point of view. Somehow those need to interact. If some think that voting info will do more damage than good its a respected opinion. We could make a pre-vote about voting on bounds. The truth is that the effort I put, is not that respected because I don't write patches. Hence what I express is not judged equally, that is natural. AIso I have full awareness that my unconformist character is an annoyance to many. This is the reason I occassionaly dissappear, doing probably more self-rewarding stuff. But I have have hard time abstaining for long. So you could even vote if my involvement is overall positive or negative, I will use it wisely.

NKONSTANTAKIS on 30 Jan 2020

👍1

@Vizvezdenec could you please merge your 5 passed simplifications/reverts (1, 3, 4, 5, 6) into one branch and perform a regression test wrt. SF11 at our standard conditions ?

vondele on 30 Jan 2020

@snicolet did the sequence of tests Test how "..." yield some insight that you can share

vondele on 30 Jan 2020

@vondele What about the simplifications? For example the linux large page commit failed to pass LTC nonregression: http://tests.stockfishchess.org/tests/view/5e30ab0bab2d69d58394fdf1. Though most of the simplifications were non functional, so they probably didn't matter.

adentong on 30 Jan 2020

Furthermore, do we also want to look at commits before SF11 release and after we switched to pentanomial and new book?

adentong on 30 Jan 2020

large pages also passed. http://tests.stockfishchess.org/tests/view/5e32c748ec661e2e6a340d96
this was already mentioned, failing to pass a non-regression test doesn't mean a test regresses.

Your second question, I think we can try to run simplification tests on pre-SF11 patches as well, but not right now, to keep some order.

vondele on 30 Jan 2020

@Vizvezdenec combined 5 patches tested here http://tests.stockfishchess.org/tests/view/5e334098708b13464ceea330

vondele on 30 Jan 2020

@vondele @Vizvezdenec The combined simplification has passed quickly:

LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 7829 W: 1079 L: 964 D: 5786
Ptnml(0-2): 52, 690, 2281, 781, 65

31m059 on 30 Jan 2020

and ongoing regression test after the 5-fold revert:
http://tests.stockfishchess.org/tests/view/5e334851708b13464ceea33c

vondele on 30 Jan 2020

ugh sorry I was sleeping :)

Vizvezdenec on 31 Jan 2020

So even with the 5 patches reverted there's still no elo gain...

adentong on 31 Jan 2020

well you can't be sure there is none, maybe it's let's say like 1 elo.
But at least it's not a clean regression :)

Vizvezdenec on 31 Jan 2020

Do we continue trying to write new patches for Stockfish? Or should we halt development until we've made changes? I currently have a test pending, but I'm not sure myself...

The framework is currently mostly idle...

31m059 on 31 Jan 2020

We need to rethink our testing methodology, otherwise it's mostly meaningless when 5 out of 7 elo gainers can be simplified away just as easily.

adentong on 31 Jan 2020

@Vizvezdenec could you today turn these 5 patches in to 1 PR, with a commit message what has been reverted and with the test results. I'll try to merge tonight.

vondele on 31 Jan 2020

👍1

This exercise (which required more than 1M LTC games to sort out, thanks @noobpwnftw ) has shown that the bounds to filter out(in?) the Elo gainers are not strict enough, which means the lower bounds of our SPRT tests need to go up. At the same time, I don't want to make it more difficult for an 1-Elo patch to pass our testing, on the contrary. This means the average of the two bounds has to stay or reduce. These two requirements imply narrower bounds for testing, and thus more resources need to be invested per patch. This will avoid tests that pass with <10k games, but obviously some will need >100k. I don't want to change at the same time the TC of testing, so we can clearly see the effect of the bounds changing. Yet, by reducing a little the requirement STC, we can facilitate 'good scalers' to pass.

As a result I propose:

standard STC {-0.5, 1.5} -> 50% pass rate at 0.5 Elo (100k games), 1% pass rate at -1.0 Elo
standard LTC {0.25, 1.75} -> 50% pass rate at 1.0 Elo (137k games), 0.3% pass rate at -0.5 Elo
simplification STC/LTC {-1.5, 0.5} unmodified.

Give a thumbs up if this idea makes sense, even if the precise bounds deviate a bit from what you would have picked. If there is some community buy-in, we can merge these changes tonight or tomorrow in fishtest. Meanwhile we continue use fishtest with the current bounds.

vondele on 31 Jan 2020

👍10

@vondele I think the expected number of games at LTC is a bit lower than your calculations there--unless I've made a mistake of course :)

The expected number of games is at least partly determined by the draw rate, which is very different between STC and LTC, and the default on the SPRT Calc corresponds closely to STC. Using a draw rate of .705 (based on the recent LTC tests), we have a more optimistic estimate of 137k games for a +1.0 Elo patch with those bounds. That would make a big difference in terms of computational feasibility.

31m059 on 31 Jan 2020

👍1

Eh, I don't have time to do it now. Can probably do it somewhere near 10 hours from now :)

Vizvezdenec on 31 Jan 2020

👍1

Doesn't it also make more sense to take [-1, 1] for simplications?

In the proposition of @vondele a +1elo patch has 50% chance of passing and its revert 50% chance of being simplified away.

ddobbelaere on 31 Jan 2020

@ddobbelaere not quite right, a simplification that removes a 1Elo feature has about ~4% chance to pass both STC and LTC (quick check, correct if wrong).

vondele on 31 Jan 2020

👍1

@ddobbelaere There is currently already an issue that a neutral patch has relatively high probability of failing a simplification test. SPRT {-1,1} would make that worse.

The probability of simplifying away a 1 Elo patch away is far less than 50%.

vdbergh on 31 Jan 2020

👍1

I prefer STC to be smth like {-0.25, 1.75}
0.75 elo 50%, negative patch overall will have like 0,18% chances of passing, also upper bound being the same looks aesthetically more right :)

Vizvezdenec on 31 Jan 2020

@vondele @vdbergh You're right, sorry

My statement holds for a +0.5 elo patch, (0.5 lies in middle of [-0.5; 1.5], -0.5 lies in middle of [-1.5; 0.5], so 50% chance of passing STC). Actually with the proposed bounds, it's still less likely that a 0.5 elo patch passes both STC and LTC (with higher bounds than STC) than that a -0.5 elo simplification passes both STC and LTC.

Anyway, I think the additional 'constraint' that most simplications don't just revert earlier patches but actually make the code more readable/smaller is helping us here.

ddobbelaere on 31 Jan 2020

If we focus on -0.5 Elo patches, my understanding is the numbers change from 10% STC pass and 1.2% LTC pass now, to 5% and 0.3%, is that correct?
The change sounds reasonable on that basis, but it doesn't feel like only 1.2% of our -0.5 Elo patches have been passing LTC - is the SPRT calculator underestimating this somehow? If it is, is this change likely to be good anyway, or can we refine it?

Edit: "have been passing LTC"

xoto10 on 31 Jan 2020

Do we continue trying to write new patches for Stockfish? Or should we
halt development until we've made changes? I currently have a test pending,
but I'm not sure myself...

Try some simplifications?

xoto10 on 31 Jan 2020

@vondele so what about new bounds?
I'm starting tests with the ones you proposed for now, want to really get back into writing something and not debugging ;) .

Vizvezdenec on 31 Jan 2020

preparing a fishtest commit right now...

vondele on 31 Jan 2020

Okay, I just started some tests with adjusted bounds (customly selecting them).

Vizvezdenec on 31 Jan 2020

The regressing tests have been reverted https://github.com/official-stockfish/Stockfish/commit/6ccb1cac5aaaf7337da8b1738448793be63fdfdb and the new SPRT bounds introduced https://github.com/glinscott/fishtest/commit/cc84d9c57e5224144b5219605f20f5b82c66c642

This issue has been resolved. I'm sure the process has been useful, and I hope the new bounds help to make progress quickly and in a robust fashion.

vondele on 31 Jan 2020

👍8

https://github.com/official-stockfish/Stockfish/issues/2538

mstembera on 1 Feb 2020

I am back after some months of absence, so sorry to react with a lot of delay.

This is just my personal feeling about this huge change in SPRT bounds :

the new bounds are better and much robust than older ones,
as always the drawback is the long test queue with 50k+ tests.

So I will kindly suggest a small adjustment : [-0.5, 2] for STC. This will significantly decrease number of games for ~0 ELO patchs which are not a potential big improvements to SF. The drawback is a small increase in +1ELO failures but after all this should be acceptable if we concentrate our efforts and ressources on better patchs.

MJZ1977 on 21 Feb 2020

👍1

I'd like to keep the bounds like this for a while to check... in the past days, the number of tests running has been relatively modest, and overall throughput still OK. The bounds are such that 'patches that scale with TC' have a reasonable chance to make STC and perform at LTC.

vondele on 21 Feb 2020

@vondele For what it's worth, I agree with @MJZ1977. While it might seem true that a modest number of test submissions can support more precise bounds, the problem is that since developers are not blind to the currently running tests, _the number of tests running and the number of tests submitted are not independent_. The more tests are running, the fewer tests are submitted.

I know that I often deliberately forgo submitting tests, or reduce the number of variations I submit, when I see the framework is under heavy load (out of courtesy to other developers). I'm sure I'm not the only one who does this...

31m059 on 26 Feb 2020

👍3

Clearly, there is a correlation between running tests and tests submitted. That's natural and good, I think. On a nearly empty framework, there is lots of random stuff, and countless variations being run... that's not necessarily good. Right now, running time of a test requiring 100k STC games is still <24h.

I went once through the list of tests that ran since the introduction of the new bounds.... if I did the counting correctly, we had a total of 750 tests running, 50 where Elo-gainer LTC, 3 passed {0.25, 1.75}. So about 6% promote from STC to LTC and about 6% of Elo gainers LTC are successful.

If we increase the current STC bounds from {-0.5, 1.5} to {-0.5, 2.0}, fewer tests will pass STC, and it will be more difficult for 'good scalers' to pass testing. Is that what we want?

vondele on 26 Feb 2020

I don't think that changing from {-0.5, 1.5} to {-0.5, 2.0} will dramatically kill 'good scalers'. In the last 50 patchs how many would have been lost because we have slightly changed these bounds?

Now if we had made 1500 tests instead of 750, my feeling is that we should find more successful ELO gainers at the end. And of course we are not in the extreme of last months with STC tests finishing in 20 minutes with empty framework.

We can notice that from STC finishing in 20 min to STC finishing in 20 hours there is a big gap :-)

MJZ1977 on 26 Feb 2020

@vondele

If we increase the current STC bounds from {-0.5, 1.5} to {-0.5, 2.0}, fewer tests will pass STC, and it will be more difficult for 'good scalers' to pass testing. Is that what we want?

(emphasis added)

I don't think this is correct, though. A smaller _proportion_ of tests pass, but this is offset by an increased number of tests. Counter-intuitively, we should expect more STC greens if we make the bounds stricter (up to the point that developers no longer can submit good tests fast enough to meet the added framework capacity).

Please allow me to clarify what I mean:

Current STC bounds: [-.5, 1.5]
Expected number of games for a neutral (+0) test: 86,900
Expected pass rate of +0.75 Elo tests: 67.6%
Expected pass rate of +1 Elo tests: 81.3%

Modified STC bounds: [-.5, 2]
Expected number of games for a neutral (+0) test: 52,300
Expected pass rate of +0.75 Elo tests: 50%
Expected pass rate of +1 Elo tests: 64.3%

The capacity of the framework occupied by running +0 tests would be 1 - 52300 / 86900 = 40% lower.

The pass rate for +0.75 Elo tests would only be 1 - 50/67.6 = 26% lower.
The pass rate for +1.00 Elo tests would only be 1 - 64.3/81.3 = 21% lower.

So the increase in capacity for new tests outpaces the decrease in pass rate. Assuming the distribution of submitted test Elo doesn't change (which is a potentially flawed but unavoidable assumption), we would expect 23% more +0.75 tests to pass, and 32% more +1.00 tests, unless I've made an error in the calculations.

31m059 on 27 Feb 2020

Assuming neutral scaling, we currently harvest just 40.5% (81%*50%) of +1 elo patches.
With new proposal this will go down to 32%.

I think at current state strong elo gainers are almost non existent, and elo gainers in general too rare. Personally I feel sad when a parameter tweak or a modest code adder fails at high LTC count. Usually it represents a harmless clear improvement which fell a bit short, and with tons of resources invested. Even by skipping the game-count indication, the +0.5 to +0.8 elo range obviously has very low passing chances. Especially when a solid idea of about that elo range is tried in multiple versions, it consumes extreme resources.

I think that an important question is:
If we knew the exact elo gain of a patch what would be the requirement?

I assume around 0.5 elo. Obviously due to lack of perfect confidence the bounds also serve the purpose of making it less likely for useless or regressing code to sneak in. Hence the LTC mid-point target of +1 elo, with very high confidence.

I still don't understand why we use same bounds for tweaks and code adders. When we are willing to use considerable resources and pay a little elo for cleaner and less code via simplifications, shouldn't that reversely reflect?
Are a 0 elo tweak and a 0 elo code adder equally bad when fluking in?
What about a +0.2 patch? I can't think why someone would prefer the slightly inferior parameters but certainly sub-par code hurts long-term.

The way I see it there are 2 safely profitable possibilities:

We could economize considerable resources by exploiting the harmless nature of tweaks through lower confidence.
We could enable continuous small elo micro-optimization with a statistical model of average elo parameter tweak acceptance.

Elaborating 2. is an effort to find the most efficient way for example to be plus 2-3 elo after 10 tweaks bypassing accurate individual performance. Currently this is scarcely attempted with combos, but this method is clearly inefficient cause firstly normal bounds are tried, then the combo selections are chosen on the shaky basis of sprt gamecount.
My proposition is basically to profitably combo the elements straight into master.

NKONSTANTAKIS on 27 Feb 2020

The logic of @31m059 is even bigger for +2 ELO and+3 ELO ....

Edit : for me we must avoid extremes.

quick tests with low robustness (10k games) and often empty framework
heavy tests with long test queue and people discouraged to launch new tests

Of course all depends on available cores in the framework. If we have 5k cores, [-0.5, 1.5] is clearly the best choice.

MJZ1977 on 27 Feb 2020

👍3

Stockfish: Negative regression???

Most helpful comment

All 107 comments

Related issues