Stockfish: Safeguard against negative scaling

Created on 14 Jan 2020  路  6Comments  路  Source: official-stockfish/Stockfish

Currently the scaling info usually consists of just 2 signals, the 10" STC and the 60" LTC.
Furthermore STC is low confidence so its hard to make any solid conclusions.
I guess validation is expensive and we have to live with the inherent risks of the adapted methodology.
But how could we frugally ensure that ie. a 15K STC + 100K LTC is not a negative scaler that would potentially weaken high TC's and analysis?

How about using a VLTC check to test if the elo gain holds, just for a small % of the most dangerous situations. (mainly very high LTC game count + secondarily low STC count)

Most helpful comment

SPRT elo estimates mean nothing.
Simple example - I had 3 patches on the same idea that passed STC, 2 of them with +7 elo perf.
They all failed LTC in red zone.
I respin the "best" of them with [0;2] STC bounds - it also failed red. So it was not actually a "bad scaler", but just a "good STC fluker", nothing more.
There can be patches that scale bad but mostly people define "bad" scaling from exxagerating SPRT usage - it was NEVER meant to show any realistic elo performance.

All 6 comments

SPRT elo estimates mean nothing.
Simple example - I had 3 patches on the same idea that passed STC, 2 of them with +7 elo perf.
They all failed LTC in red zone.
I respin the "best" of them with [0;2] STC bounds - it also failed red. So it was not actually a "bad scaler", but just a "good STC fluker", nothing more.
There can be patches that scale bad but mostly people define "bad" scaling from exxagerating SPRT usage - it was NEVER meant to show any realistic elo performance.

@Vizvezdenec

I agree with you that one cannot obtain any information regarding scaling from inspecting the STC/LTC tests.

But the Elo estimates are fine. They mean what they mean in the sense that they have a precisely defined statistical interpretation.

In particular it is important that the Elo estimates are considered with their confidence interval. For 95% of the tests the true Elo is inside the confidence interval.

Which STC's are you referring too? If you mean the kawTuning familiy then it seems that in that case the STC tests all had a confidence interval which included zero. But even if they didn't then one still has too take into account that for 1 in 20 tests, the true Elo will be outside the confidence interval.

Sure but our bounds are such that basically any passing patch with 5% probability is close to 0 elo :) And with 1% it's pretty negative. This is a trade for patches to converge "reasonably fast".

A 0 elo patch has a 18.7% STC pass probability with the current [-1, 3]聽logistic bounds.

I find even more worrisome that a +2 elo patch has about the same chance for failing. Many missed opportunities.

With new bounds negative scaling will not be a problem, closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Sopel97 picture Sopel97  路  8Comments

NightlyKing picture NightlyKing  路  7Comments

Alayan-stk-2 picture Alayan-stk-2  路  5Comments

BKSpurgeon picture BKSpurgeon  路  6Comments

d3vv picture d3vv  路  5Comments