Stockfish: STC-LTC correlation ?

Created on 28 Mar 2020 · 22Comments · Source: official-stockfish/Stockfish

Out of the 40 last LTC tests with elo-gaining bounds, there has been 23 reds, 16 yellows, 1 green.

That's an extremely low pass rate considering the STC bounds have been made stricter. Obviously, the elo perf required to pass LTC bounds is higher, but the likelihood of a green STC translating to a green LTC is truly abysmal as it stands.

Source

Alayan-stk-2

👍2

Most helpful comment

so, the last few days, fishtest has been buzzing along, and we have had quite a few patches passing LTC. The elo gain / day is back to standard. I had the feeling there was a good dynamics, inspiration for nice patches from watching games, innovative ideas, and interaction among devs to turn these ideas in working code. I think that, together with @noobpwnftw hardware support, is what brings us further along.

In my opinion current bounds are currently working. So I'll keep them as is at least for next few weeks/months.

vondele on 18 Apr 2020

👍5

All 22 comments

Imo its due to LTC elo gainers of 0.8+ elo being almost non existent. With noisier bounds a lot of passed ltcs were <0.5 elo. Probably they helped the elo progress but at a cost of more code and useless code. Atm the bounds guarantee that only quality patches will make it. I think its a good mindset regarding long term evolution, seeking perfection so to speak.

On the other hand the rate of green patches is indeed abysmal. This however imo has little to do with stc: this stc is easy enough and also reasonably friendly to scalers.

My proposition is to adapt to the "approximity to elo ceiling" situation and lower the elo bar while keeping high confidence. To be happy with +0.5 elo patches getting in.

My recommendation is a initial mild LTC adjustment to {0.15 , 1.65}

I consider crucial the lowering of the low bound, while the high one a matter of taste, being a tradeoff between economy and accuracy.

NKONSTANTAKIS on 29 Mar 2020

👍2

Good topic @Alayan-stk-2. The ELO gain per 30 day is at a record low. There was a false with the first regression where in fact SF lost some ELO due to the change in measurement methodology. But nonetheless, the ELO gain slope is much much lower than the last versions after 2 months of active developement.

OuaisBla on 29 Mar 2020

Which is to be expected, as SF gets stronger and stronger. But that certainly doesn't mean we can't improve our testing methodology.

adentong on 29 Mar 2020

@NKONSTANTAKIS To test that assumption safely, one strategy would be to create a DEV temporary branch and stacking the patch that fit the new criteria. Once we have, lets say 10 or so of thoses, we can rebase it and launch a regression test between the MASTER and the DEV branch to see if the ELO gains are adding up, stalling or simply regressing.

OuaisBla on 29 Mar 2020

This is tricky, due to the interaction between changes only the 1st result will be accurate. Consequtive results will be less and less correlated with the branch, especially if they affect same part of code/parameters. But SF is so interlinked that any change could affect another part of the code.

But also why would that assumption need to be safely checked? The bounds were/are chosen through discussion, judgement and will of the maintainer. After months of evaluating results, mild adjustments to a favorable direction are healthy. Its extreme changes that can be (and have been) dangerous.

NKONSTANTAKIS on 30 Mar 2020

The goal to work with a branch in parallel is to be able to gather some data so that the assumptions is either confirmed or not, making it easier for the maintainer to accept the changes or to refuse it.

Doing in steps, we could thing of the following workflow:

Create a protected DEV branch
Stack only patchs that pass LTC at {0.15 , 1.65} but not {0.25 , 1.75}
Run a regression test against the master.
If not conclusing, delete the branch, If yes, rebase it.

That way we could validate if the interaction of few lesser patch (up to 10 max) is positive overall , neutral or negative.

I would expect that to show a positive result of course. Also, this is only a simple suggestion.

OuaisBla on 30 Mar 2020

What I have been describing is that patches passing vs master are not patches passing vs branch. So the more patches you'll stack, the less indicative will be of master. With different SF the results become invalid.

Furthermore there is no doubt that more patches of positive elo will give more elo in general to SF. But that comes with more code, which tends to obfuscate the code, create local maximae, and hurt the long-term progress. For example if we had perfect confidence of a patch to be 0.2 elo, we would definitely not accept it.

So the question is what is the minimum elo we are content with for accepting more code. If we define it, it will help sorting the rest.

NKONSTANTAKIS on 30 Mar 2020

Another idea which would naturally help with the green stc to green ltc ratio, supported by @31m059 and @MJZ1977 here https://github.com/official-stockfish/Stockfish/issues/2531

Is to make stc harder and faster with {-0.5, 2} in order to allow a greater volume of patches. The downside is that there will be more missed opportunities (unlucky failed stcs that would pass). But as those represent the lower end of STC, and given the already low pass ratio, I support this recommendation too.

NKONSTANTAKIS on 30 Mar 2020

Recent flurry of LTC greens definitely alleviate worries, but I still think that lowering the LTC 50% pass target bar by 0.1 elo, from 1 to 0.9 is a safe small improvement, enabling a small increase of accepted patches.

For LTC a +1 elo patch will have 60% chance from 50% , a 0.5 elo patch 18% from 16% and a 0 elo 3% from 2%. Combined with STC and for neutral scaling, a +1 elo patch will rise to 48% from 40%, +0.5 elo will have 9% from 8%, and 0 elo sneaking in chance rises from 0.4% to 0.6%.

To cut it rough and simple, the elo gaining slope will rise from 5 to 10%, estimation including the very slight increase of LTC resources. The cost will be a slight increase of added code and a negligible increase of useless code.

@vondele Knowing that your philosophy dislikes low confidence models and risks, I hope this proposition or some alternative of your preference gets consideration. I deem that the progress of SF is extremely robust and can easily speed up a tad.

PS. Sorry that I highjacked with my usual style, I wanted to leave it to the real devs, but it seems that I cant abstain for long periods. Please feel free to dismiss my suggestions at will and continue discussion. In this case I hope that I represent the majority of SF fans and devs who wish a faster progress. Hard and natural as it might be be at this level, its painful to watch neutral elo RTs after months of SF11. I peace out.

NKONSTANTAKIS on 31 Mar 2020

I suggest we move STC to a slightly longer tc to improve the correlation with LTC results. It was first suggested by someone else (don't remember who) when we debated changing test bounds years ago, and I was interested but not in favour at the time. Since then I have seen quite a few cases where patches failed at STC but passed at LTC, they are far enough apart that they do play differently. I think we need STC to be longer now to help find the things that matter at longer time controls.

xoto10 on 31 Mar 2020

👍3

@xoto10 In this case a possibility to make up for the extra cost is economising rescources by hardening the current STC bounds which are designed to be scaling tolerant. This way negative & neutral patches will fail faster.

NKONSTANTAKIS on 31 Mar 2020

I think combining a slightly longer STC and moderate bound changes to reflect the fact that a bigger part of the final confidence goes in the STC part of STC+LTC could work well.

Picking numbers by guessing : 15s+0.15 STC with [-0.25 ; 1.60] and LTC with [0.2 ; 1.8]

I wouldn't mind 20s+0.2 ; but going through smaller steps may be safer.

Alayan-stk-2 on 1 Apr 2020

One suggestion for next time we change bounds, whether it's soon or ages away ... we could move to values with just 1 decimal place. It seems likely we need a higher resolution than 0.25 so let's lose the 2nd decimal place and move to multiples of 0.1

xoto10 on 1 Apr 2020

👍2

On one hand correlation is quite low.
On the other hand it will always be low.
What I mean... Since discovering that quite a lot of stuff in SF scales non-linearly (especially extensions in search, countermove pruning and initiative in eval) it's pretty logical to have more strict bounds for LTC than for STC - so patches that scale badly will have lower pass rate compared to patches that scale well. But this logical way to do it will lead to a lot of patches passing STC and failing LTC - this is sad but I don't see a way to fight it (other than make STC bounds stricter but then we will have like 2 STCs passing / month and people going for speculative LTCs 24/7 out of pure desperation of nothing passing).
And about low elo/month - well, engine is getting stronger + this data is badly corrupted with negative patches passing after sf11 release - it ate 1 month of work and brought like 0 elo. I think that we are recovering recently + progress was never gigastable for sf, periods of like 0 elo gained were 3 months at worst if my memory isn't failing me so it's nothing new to have dry period.
With this stuff like bounds there is never a haste to do anything because effect of their change can't be measured over span of day, week or even some months. Let's have current ones for now and see how it goes.
I do believe that correlation between STC and LTC have gotten better for me in last 2 weeks out of complete randomness, for example :D

Vizvezdenec on 5 Apr 2020

👍2

Safest change is to make this ultra solid LTC a tiny tad easier to get 10% more patches. Worst thing that can happen is some +0.2 elo patch going in instead of +0.3 currently. Is that a disaster?

NKONSTANTAKIS on 6 Apr 2020

well confidence of 99,1% turned out to be a disaster.

Vizvezdenec on 6 Apr 2020

👍1

@NKONSTANTAKIS Relaxing bounds is very dangerous given the volume of tests on Fishtest. You will increase the number of negative Elo tests that pass by pure luck. I think that currently the bounds are barely strict enough to avoid regressions of this nature.

vdbergh on 6 Apr 2020

Ok, what if we make the STC stricter then, so that bad/neutral patches also fail faster, thus reducing danger from lucky tests, and compensating for the reduced friendliness to scaling with an increase of TC, a highly supported and logical transition, offering a cleaner LTC correlation. This way the overall load can be kept roughly the same, but the STC passed patch quality will be higher. Less STC greens, but both less noisy & more correlated. With this strong upgrade of STC confidence, LTC could use a part of it to relax. An overall upgrade of confidence.

Because as it is now, I have a strong impression that there is a leak of economy at the STC-LTC link: by accounting for scaling through laxed STC bounds, neutral and negative scalers abuse our STC (and consequently LTC) resources! I think that's also the main point of this thread, and interestingly every recommendation expressed reduces this leak.

Something like 15" {-0.3 , 1.7} , {0.2 , 1.7} , actually similar to the alayan proposition but maintaining current width. Basically +0.2 STC elo bar and -0.05 LTC

This proposition takes into account the confidence concerns of @vdbergh and @vizvezdenec and upgrades it, incorporates the consistent long term wishes of @Alayan-stk-2 @xoto10 & others for higher STC, and enables a considerably faster STC conversion as requested by @31m059 and @MJZ1977 but without adding randomness to STC, undesired by @vondele.

An pleasant side effect of 15" STC is the reduction of server load. With massive power surges of @noobpwnftw after a point the server bottlenecks at game generation wasting resources, malfunctions and can crash. This point will be considerably higher and the capacity of resource usage expanded.

NKONSTANTAKIS on 6 Apr 2020

👍1

I fully support longer stc testing with stricter bounds for both Elo and simplification patches at the stc level. I also support stricter bounds for simplification patches at the ltc level as well. It is obvious that far too many elo regressive patches are making it through. I am also very suspicious of functional change patches that get passed as simplifications. It’s only common sense that as Stockfish gets stronger , elo gaining patches will be harder to find - so to me , it only makes sense to protect the elo gaining patches from regressive Elo losing simplification patches. Just my $.02 - I appreciate that everyone here wants to make SF stronger and we all do have a common goal. Also - it is true that many patches pass against a different version by the time it gets approved. With every commit - there is a change in interaction between the submitted patch and the version it tested against against the version it is actually applied. This approach may be fine against a non game application - but in a game application, it would only make sense to test against the current version. Committing 4 or 6 patches on one day - none of them testing the interaction of the patches between each other, seems to be inherently flawed and a process design defect. Unfortunately, I have no easy solution to fix that. Perhaps a slow and steady methodical process to approve patches “One by One” against the most current version with all previous successful patches applied, would secure more Elo in the long run. Debatable of course - but treat my comments as food for thought to stimulate further deliberations. This is a healthy discussion for the SF team to deliberate.

MichaelB7 on 7 Apr 2020

👍2

If the question is why we have low passing LTC tests, my first answer is that it is not easy to find a good and simple patch with +2ELO LTC.
To do that, we must try and test innovative ideas and not only little tweaks with 100k+ STC games.
So the first thing to do is to change the [-0.5,1.5] bounds that makes 0 ELO patchs eat all the ressources and analyse games, think about new ideas that will first be very bad before taking (perhaps) their way to pass LTC.
To say it simply, I prefer [-0.5,2] or something like that in 40k STC games gives a vague assessment of the idea. If the idea is potentially good it will probably finish yellow or green.
Green = LTC test
Yellow = LTC with low thp.
In the final, we will have less STC tests and more LTC tests in comparison with actual system. Statically, it will gives more chance to good scaling patchs than today.