Stockfish: Big regressions since Thu Oct 22 ?

Created on 1 Nov 2020  路  9Comments  路  Source: official-stockfish/Stockfish

Hi, it seems that all lasts versions of SF (since Thu Oct 22) are regressions.

https://m.nextchessmove.com/dev-builds

Most helpful comment

our regression test just finished with good results: https://tests.stockfishchess.org/tests/view/5f9e7c786a2c112b60691e39
Note essentially all results at NCM are within error bars equal to the expected result.

All 9 comments

our regression test just finished with good results: https://tests.stockfishchess.org/tests/view/5f9e7c786a2c112b60691e39
Note essentially all results at NCM are within error bars equal to the expected result.

It appears likely that the runs at NCM which hit around +370 were "fluke highs". If this is true, it'd be possible for "fluke lows" to occur as well.

For completion/interest/information (and if fishtest isn't busy), it'd be interesting to run an equivalent "NCM run" on fishtest to check that "net contempt" isn't also playing a role, as this may have implications for the future.

Ahh yes, so this patch gained 9 Elo https://nextchessmove.com/dev-builds/258af8ae44fc15407996e0a21a80ee8b9cfa12cb
Maybe we should complain about too much Elo gained for non-functional patches :)

Also the next patch is very likely a fluke too https://nextchessmove.com/dev-builds/2046d5da30b2cd505b69bddb40062b0d37b43bc7

Admit that it's quite disturbing to see a second site that comes to the same conclusion.

https://www.sp-cc.de/index.htm

-6 ELO SF 201028 vs Stockfish 201022.

All those sites should stop testing then ?

Non functional patches doesn't mean no ELO improvement. They are very often improved and optimized C++ coding, giving a punch to speed in terms of nps and hense some ELO points. Logical.

Pohl's test is well within error bars which he clearly quotes. However, he often makes seemingly definitive statements about "progress" and "regression", which sometimes doesn't reflect in the statistics he provides.

No one here is implying that "those sites" should stop testing. However, we all need to be careful in how we interpret the data.

Non functional patches doesn't mean no ELO improvement. They are very often improved and optimized C++ coding, giving a punch to speed in terms of nps and hense some ELO points. Logical.

Except that 9 elo gain patch that SFisGOD quoted is truly non-functional. It's like updating the authors list gaining 5 elo, which has happened not infrequently!

It would be great to have a Progression test against SF12 for Stockfish 201022 patch. Then we will see if there is a regression or not. But this will cost time. I have an idea : why don't we run all patch commit vs SF12 before a patch is accepted or rejected in STC + LTC ? The "goal" (LLR and LOS) would have to be changed : Beat the actual master progression vs SF12.

@rconstant42 Regressions are very rare. You can look at this FAQ entry why tests sometimes show regressions.

https://github.com/glinscott/fishtest/wiki/Fishtest-faq#why-is-the-regression-test-bad

It is bit fishtest specific but most principles apply to foreign tests as well.

I'm closing this, I think the regression test shows things are fine, and the results at ncm and elsewhere are within the error bounds equal to the expected numbers

Was this page helpful?
0 / 5 - 0 ratings

Related issues

NKONSTANTAKIS picture NKONSTANTAKIS  路  6Comments

d3vv picture d3vv  路  4Comments

anonymous7002 picture anonymous7002  路  3Comments

Silver-Fang picture Silver-Fang  路  7Comments

fun8 picture fun8  路  4Comments