Suppose there's an additional test. Let the new version play against Komodo, Lc0, and Houdini for some games, and record points it gets. Then let the master play against the same three, and compare the points with the new version's. If the new version seizes more wins and less losses, then definitely it should be accepted. Otherwise, whether to accept it should be rethought.
Such test, apparently, will directly improve SF's competitiveness in tournaments such as CCC or TCEC.
I'll gladly install Houdini and Komodo on my testing rig if you pay for them.
I think instead of Houdini or Komodo we should test against Leela since she's apparently quickly becoming our biggest competitor as well as being a NN engine.
This is really really really useless stuff.
1) From all history of sf testing there was always a clean evidence that gaining elo in selfplay -> gaining elo vs any opponent. And number is the same (if you take error bars into account). No sufficient proof of anything else was ever provided by anyone.
2) Testing vs other engines will require much bigger number of games because error bars will be doubled. And it will provide basically square zero additional data but will load fishtest with completely useless tests. Not to mention that you need to buy commercial engines/have GPU for leela.
@adentong Why not gating the promotion of Leela nets with matches versus Stockfish instead? Every machine there are perfectly capable of running Stockfish, and it is a more reliable source for measuring performance for whoever interested.
This suggestion adds complexity , randomness, noise, confusion and costs where none is needed. The history of Elo gains over the last 10 years by Stockfish is unmatched by any other traditional A/B chess search engine. I do not believe there will be any tangible benefits from the suggestion thhat would increase Elo at faster rate than what SF gains now. Just my $.02, yomd. ( your opinion may differ).
Perhaps for endgame "start position" experimentation if the opponent is similar in strength to SF (either before or after a patch) there may be some value in this. But in general I would expect noise to greatly increase.
Most helpful comment
This is really really really useless stuff.
1) From all history of sf testing there was always a clean evidence that gaining elo in selfplay -> gaining elo vs any opponent. And number is the same (if you take error bars into account). No sufficient proof of anything else was ever provided by anyone.
2) Testing vs other engines will require much bigger number of games because error bars will be doubled. And it will provide basically square zero additional data but will load fishtest with completely useless tests. Not to mention that you need to buy commercial engines/have GPU for leela.