-2.38 +-4.7 after about 5000 games. obviously still very early, but after 7 elo gainers this is not what I was expecting...
Look at the number of games taken for most of the passed elo gainers. Totally unreliable, imho.
SPRT elo estimates are not accurate, but the confidence of superiority is very good. It's designed to stop once there is a very high confidence it doesn't regress and a high confidence it progresses.
The regression test is far from finished, but at the very best it will be something like +1 or +2 elo. That's bad news.
It will be valuable to repeat RT with just the elo gainers after sf11. (Master - ( simpl + non funct)). In the past a couple of this kinds of tests showed no problem but that doesnt mean it will be forever no problem. Now its also different bounds and statistics.
Another interesting RT will be with the noob3 book, as the optimization we do for it might not correlate that strong with 8-moves performance. If thats the case, probably its a good idea to do RT's on noob3, for better tracking of progress, more accuracy and less panicking.
Another thing to examine is 0,2 LTC possibly being too easy, allowing too many false positives through sheer number of tests. We basically rely on a single test, current STC adds very little LTC confidence.
With current settings a probability of a <0 elo patch passing LTC is 5% according to https://tests.stockfishchess.org/html/SPRTcalculator.html?elo-0=0&elo-1=2&draw-ratio=0.61&rms-bias=0. NKONSTANTAKIS made a good point about the book being different. If it is indeed the book making a difference then it's gonna be fun...
If we only look at the 15 tests between SF11 and 200128,
3 things stands out and they happened in the most recent 5 tests committed
a) bench jumped from 4725546 (test no 14) to 5545845 (test no 15) on 200128
b) 2 LTC finished before 10000 games (test no 12 and test no 14)
c) http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1
(test no 11) is currently struggling, although it already passed a STC regression
http://tests.stockfishchess.org/html/live_elo.html?5e30ab0bab2d69d58394fdf1
I therefore suggest to test SF11 against test no 10, which is
https://github.com/official-stockfish/Stockfish/commit/6d0eabd5fe2961551477820ab7619e2c31e01ffd
and then we can continue bissecting up or down from there.
https://www.sp-cc.de/ tested Stockfish 11 official 200118 playing against a pool of AB engines and compared with Stockfish 200107 playing against same pool of engines, at 180+1
He says "AB-testrun of Stockfish 11 finished. Sadly a clear regression to Stockfish 200107 (-7 Elo)"
This is only one result, but there is a puzzling coincidence:
the first commits of test which used the new pentanomial model, 3 moves book and new SPRT bounds were done on January 7. Maybe the problem started there ?
The error bars with 5K games against significantly weaker opponents do not allow to tell with a reasonable degree of confidence that SF11 would be inferior to SF 200107.
@Rocky640 That is possible, but certainly would be surprising...our own regression testing showed a slight single-threaded gain from January 7 to SF11. While that is close-to-within error bars, we should have detected a 7-Elo regression...
Should we temporarily disallow any new patch submissions to fishtest and stop all currently running ones, so we can focus our attention on this issue, since we'd probably want to run a bunch of regression tests to pinpoint the problem? Seems like this could have some pretty big consequences on the future of SF's improvement.
The fishtest server has a priority field (which could be increased for regression tests, if those are a priority).
let's look at this carefully, but without hectic ...
first, the results so far (RT not yet finished) is consistent with NCM https://nextchessmove.com/dev-builds where no single patch stands out.
second, as next thing I will test with the book which is now being used for development, out of curiosity to see if this matters.
I don't think it will be necessary to stop all other ongoing tests, or to elevate priority...those are drastic measures. We can use throughput for a "softer" approach.
But since the new throughput calculation prefers positive LLR, @snicolet's non-regression verification test is going to lose workers as it progresses towards LLR = -2.94 (the more informative result, since failure may mean reversion of that commit). Therefore, I've artificially raised its throughput to 150% for now. Hopefully, this represents a good compromise...
@31m059 no need... failure of that test will not necessarily imply revert. First, I think starting those tests was premature, second, let's not forget the statistical nature of fishtest.
@vondele My apologies, I've now restored the default throughput. I agree with your approach of exercising restraint here.
okay, I didn't notice this topic, I will repost there :)
I think it's about time to respin this discussion after quite dissapointing regression test (it's not finished but it's quite obvious that it most likely wouldn't finish positive).
So, we made STC bounds really loose and now probability of patch being not a regression became 18% * 5%, so like 0.9%, it seems that it's too much - 7 elo gainers result in what seems to be slightly negative elo gain.
I guess we should do smth with this.
Most obvious is that we probably should do simplification attempts for all 7 passed patches that made it into master since sf11 release. Probably just at LTC;
It seems that 0,9% is too high of a chance for negative patch to pass. Probably since we want loose STC bounds to give more patches shot at LTC we should slightly stricten LTC bounds themselves since a lot of patches that are negative go to LTC and each of them has decent chance to pass.
My proposition will be to change LTC SPRT bounds to {0.25, 2.25} or {0.5, 2.5} - second one is closer to 0,25% of regression chance we had like forever, first one will allow more patches to pass.
I think that the stronger engine gets the stricter should be % of non-regression (yes, it's sad, because less patches will pass) because % of passed patches becomes lower and lower thus more and more patches are tested and more and more patches lie in "slightly negative" zone.
Maybe we can also slightly move lower bound of STC.
I think good compromise between everything can be
STC {-0.5; 3}, LTC {0.25; 2.25}.
Chances of negative patch will be smth like 0,26% - more or less what we used for years, we will have slightly less LTCs (which is, imho, a good thing, nowadays we run infinite LTCs most of them are not even close to passing), overall game number wouldn't increase this much, STC-LTC correlation will be slightly more reliable.
I guess it's all from me for now, your opinion is really appreciated :)
@Vizvezdenec I appreciate your proposition but IMO it's to early to take measures. Let us first try understand (and measure) what went wrong.
Well first stuff that should be done is trying to simplify passed tests with {-1,5; 0.5}, imho.
Trying to simplify the passed patches (but maybe not all at once!) seems
like a good idea.
Alternatively, with 15 patches since sf11 how about testing the first 8
patches for progression / regression? If there is a single patch causing a
problem this would be a first step towards identifying it.
The regression is not possible to have come from the LTC gainers. Even if we hitted the 5% "jackpot" 3/7 the rest would cover up the tiny elo loss, landing us into +1-2 at worst.
Nevertheless, STC -0.5,3 definitely agreed + 15" for extra correlation. LTC maybe then will be fine as is, (viz suggestion sounds good too, but it might rarify greens too much) so STC 1st needed step and reassess.
More on topic,
Sinplifications with current -1.5,0.5 feel like passing easier than before, and one has to question their value in general. Imo if they are not removing serious code its not worth it. Now ie we have a visible ocb weakness (from ccc observation), because the cycle of small regression into new better formula didnt suceed (yet).
Large number of non-functional patches, untested, hold the risk of random unforeseen side-effects, like it seems to be in this case. Probably its wise to do "better safe than sorry" type of tests like sn often does, as neither humans nor compilers can be 100% trusted.
Oh, again simplifications are the cultprit, of course.
Heard it a lot of times and have never seen tests of 20 simplifications together being negative elo - it was done multiple times and NEVER had shown result worse than -1 elo.
But "elo gaining tests" being actually negative even with old bounds is nothing that is unheard of and it's even more probable with current ones.
Just saying that with number of patches we test and % of patches passing we should actually decrease % of patches that can pass being negative.
Nowadays 0,25% threshold for passing patch is -0.3 elo and 0.1% is -0.5 elo , with previous bounds we had this as 0 elo (0,5 0,5 and 0,5;4,5 and 0; 3,5 respectively).
With our goal to be better at 60+0.6 and mostly rely on this test as an indication of this patch being good, imho, it's pretty logical to make lower bound > 0 so LTC regressions will have really low chance to pass. Even at cost of lower number of patches passing in general.
imo starting a RT with only simplifications wouldn't hurt too much, and can determine if the SPRT bounds are too loose.
I think we should wait for the tests to finish before discussing anything. Note that from the test by @vondele it may well follow that we are just witnessing a case of selection bias. This could be a confirmation of the existence of such a phenomenon (on which I have speculated in many posts).
Well, opening dependancy for sure can be a thing. But tbh it's within error bounds.
Also I need to say - if LTC on 2 moves book ends with 0 it's still pretty bad.
Too many simplifications removed at once is extremely noisy test, as in between a lot of elo gainers were adapted on the basis of the simplifications. Turning them on at once would distort the elo gainers performance unpedictably. 20 is a ridiculous number, useless test. I dont understand the rationale for suspecting just the 1% chance of elo gainers regressing while at the same time -1.5,0.5 means that a -0.5 elo test is more or less a coin flip, so 25% to pass both and be merged.
Edit: An interesting usage of a new master - 20 simpl vs master test would be as comparison with an old master + 20 simpl. This way one can measure their relative dependency of that period.
Error 2 elo, book 2 elo, linux patch 2 elo, simpl 2 elo, sum 8 elo, hence -3 instead of +5. Easy.
Case 1. Is unavoidable
So from the regression tests it doesn't look like book is the problem. We also know that the linux large page commit failed to pass non regression.
Oof. Viz's test 1, 5, 6 already passed simplification, 3 is about to pass, 4 and 7 are still neutral, and 2 looks like the only one that's resisting the simplification.
@adentong I have some concerns about simplification 6, since it's not really a simplification (it reverts a complexity-neutral parameter tweak for TrappedRook). But overall you're right, this pattern is striking.
well I tested all "elo-gaining" patches since sf11 release regardless of them being param tweak or not.
Next thing I want to do is to squash them into a single commit and test it on fixed 60k games against master on 8 moves book. Will do so closer to 17-18 Moscow time since I'm at work now :)
@Vizvezdenec Which simplifications do you plan to combine for the fixed-games test? Just the ones that pass STC? (That would be my recommendation.)
You might also just run them as a LTC SPRT with simplification bounds...
obviously the one that pass STC :)
Yeah, but I want to see some estimate of elo on fixed games.
Initial regression test is complete:
ELO: -2.47 +-1.3 (95%) LOS: 0.0%
Total: 60000 W: 7490 L: 7917 D: 44593
Ptnml(0-2): 330, 5657, 18424, 5285, 303
https://tests.stockfishchess.org/tests/view/5e307251ab2d69d58394fdb9
So 5 simplifications by Viz have passed stc, are we going to run ltc for
them individually or combine them somehow?
It seems as if something has changed and patches are passing stc and ltc
too easily allowing elo losing changes into the codebase. Or is the problem
that these simplification tests are passing too easily?
Tricky.
@Vizvezdenec test is interesting IMO. 5 out of 7 'Elo gainers' can be simplified without regression (at least STC), for quite a few of them, the Elo estimate for removal is positive... I think these can be individually reschedule for LTC simplifications.
We'll have to reflect a little on what that means for our testing procedure.
I think that this test will tell us more.
http://tests.stockfishchess.org/tests/view/5e32b470ec661e2e6a340d66
If it also lands in firmly negative zone I think we should definitely rethink our SPRT bounds. :)
At least this one failed https://github.com/Vizvezdenec/Stockfish/compare/a910ba7...2907081 . This was what appeared to be an unambiguous Elo gainer: http://tests.stockfishchess.org/html/live_elo.html?5e2f767bab2d69d58394fd04.
Elo: 7.71 [3.37,12.05] (LOS 99.974%).
2 tests failing simplification out of 7 is nothing to be proud of, imho...
@Vizvezdenec I started LTC simplifications
okay I'm not at home now so can't do it myself anyway :)
If we can regularly pass and then simplify away the same test, that
suggests our elo gainer and simplification bounds are too close together. I
guess the poor regression test results suggest it's the elo gainer bounds
that need to get tougher.
Let's see how all tests finish, but if it's the case... I already proposed some solutions :P
and @noobpwnftw has added a heap of new machines to the framework now! Thanks a lot! :-)
thx @noobpwnftw
Now we can only wait for data to converge and then make something out of it...
I repost here my comment from there:
https://github.com/snicolet/Stockfish/commit/b64a9bba9a4f5466c5b4795527684170fd2164a7
It's diffcult. That some tests are false positives is normal but that most tests falls in this category seems really odd and is not likely. So we should look for a common source for this. Perhaps the implemention of new pentanomial model has some bug. What i have seen that for few games the error bar seems only about the half than if we use simple statistics (2*standardDev). I don't know if this normal. Perhaps we should also recheck the commits under the new model before SF 11.
Someone propose to change the SPRT bounds but perhaps its better to decrease the probability of false positive/negatives by reducing alpha and beta.
EDIT:
I had not considered the draw ratio so the error bars are smaller. So i get around 20-30% more deviation more few hundred games than was displayed at fishtest. But i think that could be explained by the better a枚pproximation with the pentnomial model.
Its a bit of both, elo gainers pass easier than before and simplifications pass easier too. As they are almost the reverse (-2,0 is too close to -1.5,0.5 !), not only any elo gainer has high chance to pass -1.5,0.5 but also any simplification has high chance to pass 0,2 when reverted.
So besides testing all elo gainers with simpl bounds, I propose testing reverting all (or some, like the ocb one) simplifications since -1.5,0.5 with elo gaining bounds.
Its evident that the change to logistic elo altered the analogies and lowered confidence. (as all tests resolve much faster)
well so far it looks like combo of 7 "elo-gainers" is firmly negative vs sf 11.
So stop blaming simplifications for it, at least for now.
We tried to reintroduce OCB scalefactor with @locutus2 but it all failed LTC while passed STC a lot of times - there seems to be basically no elo there.
How about lowering contempt to reduce jitter to the statistical model. Beforehand we needed extra resolution, now it seems we need more stability of the signal.
Not by a lot, just to fullfill the old rule of not regressing vs ct=0. This would also result into more accurate optimization. To optimize everything for max gain ct24 v ct24 self play inevitably introduces some bias, maybe too much.
The combo of 7 gainers has not been tested yet, @snicolet made 2 files, the one tested atm includes all functional patches. Its basically the initial RT minus non functional.
@NKONSTANTAKIS
That with the simplifications is indeed a problem. Perhaps a increase of LTC bounds like [0,3] would help. This has several advantages:
Disadvantage is that less LTC passes but if we avoid more regressions it seems reasonable. But this has to be simulated for real anwsers.
@locutus2 Logical direction but a bit too much imo. Halfway there with 0,2.5 and assess its safer, and to categorise simplifications into light (borderline parameter tweaks) and heavy (considerable code removal), with different bounds
I firmly believe that it's better to increase lower bound to > 0 value. Since it's THE BEST (and the only) way to decrease probablily of patch regressing at LTC. Everything else is half-measures.
Considering the longterm perspective, I now agree with @Vizvezdenec. It ensures that added code will be worth its weight. Fewer elo gainers, less need for simplifications, cleaner code. So a lot of hidden economy involved. 0.25,2.25. It will be expensive though, its a bad idea to feed ltc as much as now. Stc can still be easier, just not as much as now. -0.5,3 + tc increase
Edit: Just realised proposed stc its not (much) easier, just more noisy. We would like it to be a tad easier than ltc, as scaling pillow. So -0.5,2.5, as easy as now but less false positives, so more accurate.
I believe it is increasingly clear that a number of 'Elo gainers' passed through our STC [-1, 3] and LTC [0,2] bounds, that turn out all to be slightly negative in true Elo. This is possible in the statistical framework that fishtest is. The most natural and simplest way to deal with this is to adjust the bounds.
However, let's first look carefully if there are other possible causes for what we observe.
I'm just saying that I was always concerned about this new bounds.
And now we are getting a solid proof that in medium-scale they can introduce regression masked with "elo-gain".
I think that our goal at first is to be confident that what we introduce is an elo gain and not to increase number of code changes just for the sake of more passing patches. I honestly liked older bounds more, yes, it was much harder to get positive STC to let LTC run but at least you could've been confident that LTC is not trash with big probability.
Nowadays we are (more or less) picking random patches and let them run on LTC to see which one flukes out to be green. It's okay but if we do so we SHOULD make lower passing bound more strict. So at least LTC SPRT will give us quite big confidence that we are not accepting some junk (I'm not trying to be offensive, one of this "simplifications" is my patch, I'm just quite pissed with clean regress we managed to push).
As I already sad, I had multiple cases where I had 3-4 passed STCs on the same idea that failed LTC badly and STC retest had shown also negative perf. But with some % one of them could've been positive for sure.
Even before sf11 we had statistically strange RT results with new bounds. Maybe its best for stc and ltc to have equal (or similar) confidence, like the old 0,5 0,5 era. Ltc confidence is more valuable but stc confidence is cheaper. But not equal elo bar, stc needs to be easier.
0,2 stc + 0.25,2.25 ltc
well ngl this results was also with old bounds.
But in my memory in past 3 years it's the first case of clean and not bug-related regression of master.
Yes, no clear proof but the problem was probably masked by error margins + strong elo gainers carrying the bad ones. We are very lucky now to not have a +5 elo patch camouflaging the issues
I have a full proposition, with synthesis of all viewpoints: to keep adequate widness of 2, and tweak only the elo bar in relation to code.
Code adders: 0,2 + 0.25,2.25
Tweaks: -0.5,1.5 + -0.5,1.5
Trivial simpl: -1,1 + -1,1
Real simpl: -1.5,0.5 + -1.5,0.5
Along with 3 more propositions:
How about initially voting on each one seperately, in order to get the community signal and then decisions fall to area of responsibility?
What is the theoretical probability of a test failing SPRT(0..2), given that it has already passed the same SPRT(0..2) once?
I can't say that I'm all that for voting to decide stuff, not gonna lie.
Votes will be including a lot of people that didn't write a single patch and this people (no offence) naturally know and understand much less of how sf improvement works than people who try to improve sf on a constant basis.
Maybe I sound cocky but it's the truth.
@snicolet... I just answered that question elsewhere... If we assume the patch has a true Elo of 1, it fails a second SPRT test with 50% chance.
In general, it depends on the prior distribution of patches you have.
To give an example:
Some thoughts :
The tests currently running should help identify the patches behind the regression, and remove them.
But moving forward, to avoid this happening again, the solution obviously is to revise the bounds to make a fluke less likely. The 0 elo passing probability is not the only metric to look at ; noisy bounds make -0.5 elo and -1 elo much more likely at equal 0 elo passing probability.
@NKONSTANTAKIS Please avoid publishing several messages in a row, prefer editing your latest message, or wait a bit before pressing the comment button to be sure you didn't forget anything important. It makes the discussion more readable.
EDIT : also, voting is a very bad idea. People who have thoughtful opinions can express them and contribute to the discussion, but in the end the maintainer should take a decision. People with a minor implication in the project or a poor understanding of the issue should not have the power to decide.
My thoughts are:
E.g. Something like :
STC {-0.5, 3.5} 15+0.15
LTC {0.25,2.25} 60+0.6
I based the increases on simple eyeballing of the sprt graphs linked from fishtest, no doubt the maths guys can come up with more rigorous assessments of the increases required, e.g. using the results of the tests snicolet is currently running
Sadly I was misunderstood. By voting I didn't imply any responsibility for those in charge to act upon vote majority, or votes to have equal value. More like as a missing signal for people who don't post. Everyone has his own unique point of view. Somehow those need to interact. If some think that voting info will do more damage than good its a respected opinion. We could make a pre-vote about voting on bounds. The truth is that the effort I put, is not that respected because I don't write patches. Hence what I express is not judged equally, that is natural. AIso I have full awareness that my unconformist character is an annoyance to many. This is the reason I occassionaly dissappear, doing probably more self-rewarding stuff. But I have have hard time abstaining for long. So you could even vote if my involvement is overall positive or negative, I will use it wisely.
@Vizvezdenec could you please merge your 5 passed simplifications/reverts (1, 3, 4, 5, 6) into one branch and perform a regression test wrt. SF11 at our standard conditions ?
@snicolet did the sequence of tests Test how "..." yield some insight that you can share
@vondele What about the simplifications? For example the linux large page commit failed to pass LTC nonregression: http://tests.stockfishchess.org/tests/view/5e30ab0bab2d69d58394fdf1. Though most of the simplifications were non functional, so they probably didn't matter.
Furthermore, do we also want to look at commits before SF11 release and after we switched to pentanomial and new book?
large pages also passed. http://tests.stockfishchess.org/tests/view/5e32c748ec661e2e6a340d96
this was already mentioned, failing to pass a non-regression test doesn't mean a test regresses.
Your second question, I think we can try to run simplification tests on pre-SF11 patches as well, but not right now, to keep some order.
@Vizvezdenec combined 5 patches tested here http://tests.stockfishchess.org/tests/view/5e334098708b13464ceea330
@vondele @Vizvezdenec The combined simplification has passed quickly:
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 7829 W: 1079 L: 964 D: 5786
Ptnml(0-2): 52, 690, 2281, 781, 65
and ongoing regression test after the 5-fold revert:
http://tests.stockfishchess.org/tests/view/5e334851708b13464ceea33c
ugh sorry I was sleeping :)
So even with the 5 patches reverted there's still no elo gain...
well you can't be sure there is none, maybe it's let's say like 1 elo.
But at least it's not a clean regression :)
Do we continue trying to write new patches for Stockfish? Or should we halt development until we've made changes? I currently have a test pending, but I'm not sure myself...
The framework is currently mostly idle...
We need to rethink our testing methodology, otherwise it's mostly meaningless when 5 out of 7 elo gainers can be simplified away just as easily.
@Vizvezdenec could you today turn these 5 patches in to 1 PR, with a commit message what has been reverted and with the test results. I'll try to merge tonight.
This exercise (which required more than 1M LTC games to sort out, thanks @noobpwnftw ) has shown that the bounds to filter out(in?) the Elo gainers are not strict enough, which means the lower bounds of our SPRT tests need to go up. At the same time, I don't want to make it more difficult for an 1-Elo patch to pass our testing, on the contrary. This means the average of the two bounds has to stay or reduce. These two requirements imply narrower bounds for testing, and thus more resources need to be invested per patch. This will avoid tests that pass with <10k games, but obviously some will need >100k. I don't want to change at the same time the TC of testing, so we can clearly see the effect of the bounds changing. Yet, by reducing a little the requirement STC, we can facilitate 'good scalers' to pass.
As a result I propose:
Give a thumbs up if this idea makes sense, even if the precise bounds deviate a bit from what you would have picked. If there is some community buy-in, we can merge these changes tonight or tomorrow in fishtest. Meanwhile we continue use fishtest with the current bounds.
@vondele I think the expected number of games at LTC is a bit lower than your calculations there--unless I've made a mistake of course :)
The expected number of games is at least partly determined by the draw rate, which is very different between STC and LTC, and the default on the SPRT Calc corresponds closely to STC. Using a draw rate of .705 (based on the recent LTC tests), we have a more optimistic estimate of 137k games for a +1.0 Elo patch with those bounds. That would make a big difference in terms of computational feasibility.
Eh, I don't have time to do it now. Can probably do it somewhere near 10 hours from now :)
Doesn't it also make more sense to take [-1, 1] for simplications?
In the proposition of @vondele a +1elo patch has 50% chance of passing and its revert 50% chance of being simplified away.
@ddobbelaere not quite right, a simplification that removes a 1Elo feature has about ~4% chance to pass both STC and LTC (quick check, correct if wrong).
@ddobbelaere There is currently already an issue that a neutral patch has relatively high probability of failing a simplification test. SPRT {-1,1} would make that worse.
The probability of simplifying away a 1 Elo patch away is far less than 50%.
I prefer STC to be smth like {-0.25, 1.75}
0.75 elo 50%, negative patch overall will have like 0,18% chances of passing, also upper bound being the same looks aesthetically more right :)
@vondele @vdbergh You're right, sorry
My statement holds for a +0.5 elo patch, (0.5 lies in middle of [-0.5; 1.5], -0.5 lies in middle of [-1.5; 0.5], so 50% chance of passing STC). Actually with the proposed bounds, it's still less likely that a 0.5 elo patch passes both STC and LTC (with higher bounds than STC) than that a -0.5 elo simplification passes both STC and LTC.
Anyway, I think the additional 'constraint' that most simplications don't just revert earlier patches but actually make the code more readable/smaller is helping us here.
If we focus on -0.5 Elo patches, my understanding is the numbers change from 10% STC pass and 1.2% LTC pass now, to 5% and 0.3%, is that correct?
The change sounds reasonable on that basis, but it doesn't feel like only 1.2% of our -0.5 Elo patches have been passing LTC - is the SPRT calculator underestimating this somehow? If it is, is this change likely to be good anyway, or can we refine it?
Edit: "have been passing LTC"
Do we continue trying to write new patches for Stockfish? Or should we
halt development until we've made changes? I currently have a test pending,
but I'm not sure myself...
Try some simplifications?
@vondele so what about new bounds?
I'm starting tests with the ones you proposed for now, want to really get back into writing something and not debugging ;) .
preparing a fishtest commit right now...
Okay, I just started some tests with adjusted bounds (customly selecting them).
The regressing tests have been reverted https://github.com/official-stockfish/Stockfish/commit/6ccb1cac5aaaf7337da8b1738448793be63fdfdb and the new SPRT bounds introduced https://github.com/glinscott/fishtest/commit/cc84d9c57e5224144b5219605f20f5b82c66c642
This issue has been resolved. I'm sure the process has been useful, and I hope the new bounds help to make progress quickly and in a robust fashion.
I am back after some months of absence, so sorry to react with a lot of delay.
This is just my personal feeling about this huge change in SPRT bounds :
So I will kindly suggest a small adjustment : [-0.5, 2] for STC. This will significantly decrease number of games for ~0 ELO patchs which are not a potential big improvements to SF. The drawback is a small increase in +1ELO failures but after all this should be acceptable if we concentrate our efforts and ressources on better patchs.
I'd like to keep the bounds like this for a while to check... in the past days, the number of tests running has been relatively modest, and overall throughput still OK. The bounds are such that 'patches that scale with TC' have a reasonable chance to make STC and perform at LTC.
@vondele For what it's worth, I agree with @MJZ1977. While it might seem true that a modest number of test submissions can support more precise bounds, the problem is that since developers are not blind to the currently running tests, _the number of tests running and the number of tests submitted are not independent_. The more tests are running, the fewer tests are submitted.
I know that I often deliberately forgo submitting tests, or reduce the number of variations I submit, when I see the framework is under heavy load (out of courtesy to other developers). I'm sure I'm not the only one who does this...
Clearly, there is a correlation between running tests and tests submitted. That's natural and good, I think. On a nearly empty framework, there is lots of random stuff, and countless variations being run... that's not necessarily good. Right now, running time of a test requiring 100k STC games is still <24h.
I went once through the list of tests that ran since the introduction of the new bounds.... if I did the counting correctly, we had a total of 750 tests running, 50 where Elo-gainer LTC, 3 passed {0.25, 1.75}. So about 6% promote from STC to LTC and about 6% of Elo gainers LTC are successful.
If we increase the current STC bounds from {-0.5, 1.5} to {-0.5, 2.0}, fewer tests will pass STC, and it will be more difficult for 'good scalers' to pass testing. Is that what we want?
I don't think that changing from {-0.5, 1.5} to {-0.5, 2.0} will dramatically kill 'good scalers'. In the last 50 patchs how many would have been lost because we have slightly changed these bounds?
Now if we had made 1500 tests instead of 750, my feeling is that we should find more successful ELO gainers at the end. And of course we are not in the extreme of last months with STC tests finishing in 20 minutes with empty framework.
We can notice that from STC finishing in 20 min to STC finishing in 20 hours there is a big gap :-)
@vondele
If we increase the current STC bounds from {-0.5, 1.5} to {-0.5, 2.0}, fewer tests will pass STC, and it will be more difficult for 'good scalers' to pass testing. Is that what we want?
(emphasis added)
I don't think this is correct, though. A smaller _proportion_ of tests pass, but this is offset by an increased number of tests. Counter-intuitively, we should expect more STC greens if we make the bounds stricter (up to the point that developers no longer can submit good tests fast enough to meet the added framework capacity).
Please allow me to clarify what I mean:
Current STC bounds: [-.5, 1.5]
Expected number of games for a neutral (+0) test: 86,900
Expected pass rate of +0.75 Elo tests: 67.6%
Expected pass rate of +1 Elo tests: 81.3%
Modified STC bounds: [-.5, 2]
Expected number of games for a neutral (+0) test: 52,300
Expected pass rate of +0.75 Elo tests: 50%
Expected pass rate of +1 Elo tests: 64.3%
The capacity of the framework occupied by running +0 tests would be 1 - 52300 / 86900 = 40% lower.
The pass rate for +0.75 Elo tests would only be 1 - 50/67.6 = 26% lower.
The pass rate for +1.00 Elo tests would only be 1 - 64.3/81.3 = 21% lower.
So the increase in capacity for new tests outpaces the decrease in pass rate. Assuming the distribution of submitted test Elo doesn't change (which is a potentially flawed but unavoidable assumption), we would expect 23% more +0.75 tests to pass, and 32% more +1.00 tests, unless I've made an error in the calculations.
Assuming neutral scaling, we currently harvest just 40.5% (81%*50%) of +1 elo patches.
With new proposal this will go down to 32%.
I think at current state strong elo gainers are almost non existent, and elo gainers in general too rare. Personally I feel sad when a parameter tweak or a modest code adder fails at high LTC count. Usually it represents a harmless clear improvement which fell a bit short, and with tons of resources invested. Even by skipping the game-count indication, the +0.5 to +0.8 elo range obviously has very low passing chances. Especially when a solid idea of about that elo range is tried in multiple versions, it consumes extreme resources.
I think that an important question is:
If we knew the exact elo gain of a patch what would be the requirement?
I assume around 0.5 elo. Obviously due to lack of perfect confidence the bounds also serve the purpose of making it less likely for useless or regressing code to sneak in. Hence the LTC mid-point target of +1 elo, with very high confidence.
I still don't understand why we use same bounds for tweaks and code adders. When we are willing to use considerable resources and pay a little elo for cleaner and less code via simplifications, shouldn't that reversely reflect?
Are a 0 elo tweak and a 0 elo code adder equally bad when fluking in?
What about a +0.2 patch? I can't think why someone would prefer the slightly inferior parameters but certainly sub-par code hurts long-term.
The way I see it there are 2 safely profitable possibilities:
Elaborating 2. is an effort to find the most efficient way for example to be plus 2-3 elo after 10 tweaks bypassing accurate individual performance. Currently this is scarcely attempted with combos, but this method is clearly inefficient cause firstly normal bounds are tried, then the combo selections are chosen on the shaky basis of sprt gamecount.
My proposition is basically to profitably combo the elements straight into master.
The logic of @31m059 is even bigger for +2 ELO and+3 ELO ....
Edit : for me we must avoid extremes.
Of course all depends on available cores in the framework. If we have 5k cores, [-0.5, 1.5] is clearly the best choice.
Most helpful comment
This exercise (which required more than 1M LTC games to sort out, thanks @noobpwnftw ) has shown that the bounds to filter out(in?) the Elo gainers are not strict enough, which means the lower bounds of our SPRT tests need to go up. At the same time, I don't want to make it more difficult for an 1-Elo patch to pass our testing, on the contrary. This means the average of the two bounds has to stay or reduce. These two requirements imply narrower bounds for testing, and thus more resources need to be invested per patch. This will avoid tests that pass with <10k games, but obviously some will need >100k. I don't want to change at the same time the TC of testing, so we can clearly see the effect of the bounds changing. Yet, by reducing a little the requirement STC, we can facilitate 'good scalers' to pass.
As a result I propose:
Give a thumbs up if this idea makes sense, even if the precise bounds deviate a bit from what you would have picked. If there is some community buy-in, we can merge these changes tonight or tomorrow in fishtest. Meanwhile we continue use fishtest with the current bounds.