Stockfish: New discussion regarding testing bounds, testing book, etc.

Created on 26 Aug 2019  ·  178Comments  ·  Source: official-stockfish/Stockfish

There has been some discussion in https://github.com/official-stockfish/Stockfish/pull/2260 about the test bounds, how to optimize for resource usage and how to verify scaling and whatnot. So I opened a new issue for people to discuss. Also relevant is potentially switching to pentanomial statistics, which increases throughput by 10 to 15%, or so I've read in one of the discussions here.

Most helpful comment

By the way.
What about parameter tweaks?
Now we have them being harder to pass in both STC AND LTC which seems to be completely illogical.
Maybe unify bounds for this?

All 178 comments

Also since we're mostly done with the tuning patch now I propose we close https://github.com/official-stockfish/Stockfish/pull/2260.

I was hopeful about the new test bounds, but with hindsight:

  • the tests take much longer (obviously)
  • it seems there is an increased chance of merging patches with poor scaling with time (not sure how significant this issue is, and it will always be possible, but I think it is more likely now than under the old scheme)
  • I'm not sure we have gained anything with the new bounds

Overall, it seems the tests take longer with no gain.

Various tests since the change indicate that ltc is very different from stc in some areas, to such a degree that stc is not predicting the ltc performance. If stc was slightly longer we would benefit from stc being more representative of ltc performance, e.g. see these figures from the patch from the first big search tune:

10+0.1 th 1: (-3.7 Elo +/- 5.3)
LLR: -2.96 (-2.94,2.94) [0.00,4.00]
Total: 10152 W: 2237 L: 2362 D: 5553

20+0.2 th 1: (+1.8 Elo +/- 3.0)
LLR: 0.57 (-2.94,2.94) [0.00,4.00]
Total: 21250 W: 4359 L: 4249 D: 12642

60+0.6 th 1: (+5.1 Elo +/- 4.0)
LLR: 2.95 (-2.94,2.94) [0.50,4.50]
Total: 12954 W: 2280 L: 2074 D: 8600

180+1.8 th 1:
ELO: 12.52 +-2.6 (95%) LOS: 100.0%
Total: 19291 W: 3119 L: 2424 D: 13748

Requiring a pass at stc would have stopped us testing this at ltc, but it was positive at 20+0.2.

I suggest we relax the bounds in some way to make stc tests quicker again, and use the extra time made available to increase stc a little.

Note: Perhaps for now the focus should be on Elo gaining patches for TCEC. We could discuss this for a while but leave any testing until the Premier Division is under way (or even the Superfinal, if we make it this time).

the tests take much longer (obviously)

Yes.

it seems there is an increased chance of merging patches with poor scaling with time (not sure how significant this issue is, and it will always be possible, but I think it is more likely now than under the old scheme)

I agree, I made the same observation months ago.

I'm not sure we have gained anything with the new bounds

With the STC ones, I'm not sure either.

But here are elo gainers which have been merged in SF since june with a LTC test perf below what would have been needed with the old bounds :

2273, #2266, #2252, #2246; #2233, #2207, #2205, #2199, #2192, #2185, #2183

And there are many mores in earlier months since the new bounds have been introduced, of course.

The new LTC bounds made LTC tests often last longer, but helped overall getting a bunch of elo gainers in.

Various tests since the change indicate that ltc is very different from stc in some areas, to such a degree that stc is not predicting the ltc performance. If stc was slightly longer we would benefit from stc being more representative of ltc performance

Yes, I agree.

I suggest we relax the bounds in some way to make stc tests quicker again, and use the extra time made available to increase stc a little.

15+0.15 is probably a more realistic increase for STC than 20+0.2 ; but even then, I don't think you can avoid a resource usage hit.

I doubt you can reduce the game numbers enough to compensate for 50% TC (the old [0, 5] bounds wouldn't be enough).

If you also wish to address the issue of STC being the main obstacle to clear instead of LTC, you also need something less tough to get a green from, so there will be the resource usage of the associated LTC tests to account for (but as many of those are already frequently run as speculative LTC, only in an inconsistent way, I'm not sure it really qualifies as a resource usage increase).

One theoretical idea I had a few months ago was to use 3 stages instead of the current 2, using laxer bounds and overall dropping resource usage (simulations using vondele's tool with some tweaks showed an increase in elo gain per resource used), but the impact on scaling behavior was dubious.

I also think we could try to relax the STC bounds (and _maybe_ compensate by increasing the STC a bit to keep some part of the filtering quality). When I arrived in the project they were [-1.5...4.5]. It was great, everybody got greens STC all the time and was super happy, it gave momentum to the community.

I was not aware that STC bounds were like that previously !

This actually makes a lot of sense when considering that the core role of STC is to filter promising patch to save on resource, with only a moderate part on adding confidence.

Maybe STC bounds like [-1, 4] or something similar could work out well.

With noobpwnftw being able to contribute periodically thousands of cores, we could maybe bump STC to 15+0.15 to address the concern of scaling issues ; limit tests in the normal periods with a few hundreds cores to only STC (LTC, tunes, regression tests being submitted, but set to low prio) ; then periodically have the core surge take care of clearing up the queue.

For information, here is how SF from early may scaled with TC on my computer (I didn't standardize the TC to fishtest's 1.6Mnps, but it's not too far off) :

   # PLAYER            :  RATING  POINTS  PLAYED   (%)
   1 Stockfish_90+0.9    :  3147.5   886.5    1200    74
   2 Stockfish_60+0.6    :  3109.1  1682.0    2400    70
   3 Stockfish_40+0.4    :  3053.6   760.0    1200    63
   4 Stockfish_25+0.25   :  2981.5  1205.5    2400    50
   5 Stockfish_15+0.15   :  2891.7   864.5    2400    36
   6 Stockfish_10+0.1    :  2816.7   601.5    2400    25

There are different discussions on this topic and it is not easy to give a right opinion without separating them :

1- using STC as first test filter for patchs : in most of cases, I think that it is OK. For special patchs with depth dependancy, we can admit that LTC and STC have different behavior. If it is justified in these cases, I think that the best choice is to make a speculative LTC and cut the run if no improvement after 10k or 20k games.

2- current bounds and difficulty to have a green tests : in general STC passes with +1,8 ELO and LTC with +1 ELO. It seems OK for me, but if we cannot find a patch for several weeks, perhaps it is a good idea to change the bounds ... One idea anyway : current parameter tweak [0..4] is too high for me. If parameters are better and we are quiet sure of it, we should change them even for +0.5 ELO

3- long test queue : this issue is connected to amount of ressources we have. If we have >3000 cores, actual bounds are good. If we have <1000 cores, they are too strict and we are loosing time while probable good ideas are waiting almost-sure bad ideas. In this last case, [0..5] STC seems better for me ...

@Alayan-stk-2
Here are my selections of test positions:
https://www.chessdb.cn/downloads/2moves.zip
https://www.chessdb.cn/downloads/3moves.zip
https://www.chessdb.cn/downloads/4moves.zip

Number of moves are in fact irrelevant and I have decided not to remove "drawish" lines since I think doing so compromises diversity.

Some tests may have shown that mines have more sensibility over the improved 2moves book, but containing lesser positions and higher draw rate, so now there are more positions and I don't think having a higher draw rate is a bad thing - engines need to be tested not to lose on those draw positions, as long as sensitivity is also improving.

Those books need to be thoroughly re-tested though, I suggest after local verification, let's put them on fishtest and run a few regression tests.

EDIT:
Links updated.

Here is a concrete suggestion concerning bounds and TC (leaving the book topic out for now) :

TC

STC duration : 15s + 0.15 instead of 10s+0.1.

We've seen on multiple occasions, especially with xoto's tune, that changing parameters, all the more adding search or eval features, can produce significantly different results depending on the TC.

While this doesn't apply equally to all patches, it is often hard to predict how much it matters, hence the point of a general increase.

Raw nps is still a major concern at 15s+0.15, but behavior is closer to LTC, which in turn should reduce false positives and negatives.

LTC duration : 60s +0.6

It would be great to be able to start tuning and tweaking SF with a longer LTC, but this would be a bigger commitment in resources. If we can keep those sweet 5000 cores at all time, this may be a possibility to explore, but better proceed progressively, being able to test tons of ideas without clogging the queue is nice too.

1. Elo gainers

For the bounds, I suggest this :

STC : [-1, 4]

LTC : [0, 3.5]

This doesn't make a difference between param tweak and patches adding code, because param tweaks are less risky and don't add code complexity, but it's also easier to come up with param changes that are about elo neutral and that would go for extremely long LTCs with tighter SPRT bounds.

LTC tunes results should always have a shot at LTC SPRT. This is already the unofficial practice because it only make sense, but it should be acknowledged as legitimate.

2. Simplifications

Same STC and LTC duration as elo-gaining patches.

Two types of bounds :

  • Minor/trivial simplification. Examples : removing a multiplication or division in some formula, removing a "Them" or "Us" in a pos.pieces function call, etc. These don't really improve maintainability, and usually affect small enough things that they probably don't give more than +0.5 elo by themselves. Hence the need to guard against regressions, even at the cost of rejecting some of these simplifications that would actually not lose any elo.

STC : [-2.5, 1.5]
LTC : [-1.5, 2.5]

(If this is too complicated, just [-2, 2])

  • Major/significant simplification. Almost always remove several lines of code. Here a small regression is acceptable to limit eval/search features proliferation and enable easier elo gains later on. It should still be hard enough to pass that a patch having recently passed [0, 3.5] is unlikely to be simplified right away.

STC : [-3.5, 1]
LTC : [-2.5, 1]

(If this is too complicated, just [-3, 1])

3. Non-functional changes

These don't need as much care, as it's only a matter of code clarity and speed. In many cases, testing is not needed.
When required :

  • Unless depth is relevant, 10+0.1 is fine to evaluate speedups/slowdowns. No LTC needed.
  • For speedups : [0, 2.5]. The TC is short and there is no need to run a LTC afterwards, we can afford a longer STC for accuracy.
  • For simplifications : [-1.5, 1.5]. Same logic.

I just have one question:
why are accurate STC tests necessary when people still run "speculative" LTC and VLTC tests anyways?

It is all about how long people must wait to see their test results, previously was doing fine, then the bounds changed and the overall progress more or less stalled, now given 5x the amount of resources there are some progress and it doesn't look like the queue will run dry very often, why is that? Basically it discouraged people queuing up their tests and the extra accuracy doesn't justify the amount of resources it'd take.
I wonder how many CPU hours must be wasted in order to prove that good looking paper theory is an utter failure in practice.

Spirit of fishtest: "I think my change makes perfect sense!“ while most of them failed their tests.

I just have one question:
why are accurate STC tests necessary when people still run "speculative" LTC and VLTC tests anyways?

The only accurate STC in this proposal is for non-functional speedups. Regular elo-gainer STC go from [0.5, 4.5] to [-1, 4] which is less precise and demanding, and should eliminate altogether the regular use of spec. LTC (except when people want to try some high-depth-only stuff). It's too hard to get a green STC right now.

I agree that we don't want tests results taking forever to come or the queue filling up and discouraging new attempts.

Anyway, one major thing which would help is flexibility.

It's too hard to get a green STC right now.

Then we should start from right here, the rest will solve by itself.

The STC is meant to save resources but it seems to be wasting, as not only its too hard to pass but also the patches that pass it need to also pass LTC which is a much different environment (we saw how a -elo STC can be a +12elo VLTC). This has become very rare, thinking we would better off even by testing on LTC only.
On the other hand there are many promising ideas but its very difficult to guess the right values or formulas, requiring many tries. Its unrealistic to think we have the computing power to support LTC only.

Most of the progress came from tuning, SPSA with multiple parameters is doing wonders (obviously human intuition is inferior to result-derived values). And we see that only LTC tuning is working, STC tuning always fails. If we could afford VLTC tuning it would surely be awesome for TCEC TC.

So instead of keeping this nitpicking scheme of thorough testing of 1 change at a time, why not to go grand scale: introduce holistic strategy design models (or just a combo of promising ideas), VLTC tune & test them, simplify them afterwards.

For improvement of the current scheme I am on the same page with Alayan and noobpwn. For code adders STC [-1,4] 15" sounds good but why not take it one more step and go [-2,5] 20", transferring accuracy to quality. This will reduce the misfortune of filtering out good scalers and the confidence will be derived solely from LTC anyway.

The conventional, and potentially outdated wisdom, was that evaluation changes that do not impact search , could be tested at shorter time controls ( say STC and LTC) and anything that touches search should ultimately be tested at VLTC or something longer than LTC. Perhaps bifurcate testing parameters so eval changes are tested at STC and LTC and search changes ( we would need to define) would be tested at slightly longer time controls - would think 30+.03 sec and 120+1.2 seconds or just something a little longer than standard. We need to stop running speculative patch runs - but that also requires us to get more greens on the first pass so we don't miss opportunities. A general observation - no test should run more than 100K or 150K max games ( pick a number , but I have seen 200K runs- that is simply not efficient). If we loosen the bounds a tad , then we can add a requirement that it almost pass in a certain number of games. I'm not an expert in this stuff - but some people here are , and it definitely looks that it needs some tuning to make it easier to pass 1st test, reduce max games to xxxx and some changes need to do so some testing long than than 60/.6 , especially those items dealing with search extensions, LMR reductions etc so we don't miss when the search explodes at depth 35 or fail a test that does really great at VLTC. And I say all this, but I'm not saying it is worse than what it was before - but as others have mentioned, it hasn't quite lived up to expectations. Our 10+ year history of making steady Elo gain is still intact!

As I said before, all depends on fishtest ressources. If we have 3000+ cores we can try longer TC. With less than 1000 cores, it will be very hard.

Just another observation : actually to pass STC we need 0.4% more wins than losses. It supposes that the patch is triggered at least 1% or more to have an effect. So all special positions patchs (like for french defense, shuffle or special passed pawn configurations etc ...) will be very hard to pass. So lowering STC criteria is not a bad idea since we need to improve these special positions.

I also agree with MichaelB7 that making 150k+ games is not so useful. Perhaps there is parameters in SPRT that avoids these situations (other than bounds).

There seem to be a number of people wanting to ease the stc elo gainer bounds, either to the previous [0,5] or Alayant even suggests [-1,4]. Does anyone disagree?
Can we just change the stc bounds alone, or should we adjust ltc at the same time?

I think LTC bounds are fine, except for the incoherence between param tweaks and code-adders.

LTC bounds are fine for me too.
We can try STC [-1,4] as suggested, and for parameter tweak also.

I agree with all, and also propose LTC for parameter tweaks [-1,4].
They require periodical retuning, are harmless, and smaller gains help too. Can't find any reason to have as high bar as code adders and also to channel a lot of resources for accuracy.
Also, for 15" STC any objection?

[0, 3.5] bounds is quite resource-intensive, and retuning is something that must be periodically done, and so any worse-than-expected change would most likely get corrected the next time the value is tuned, there is an argument for going [-0.5, 4] if not [-1, 4] for param tweaks. However, such bounds are much more vulnerable to someone testing 10 minor variants and getting one through by pure luck though it brings nothing.

For the "not quite good enough" param tweaks, there is already the possibility of using combo patches. A mindful use of combo patches should work well enough. By mindful I mean e.g. for two tune results ; or for a tune result and one of a serie of promising hand tweaks ; but not for random manual tweaks that got a yellow but which were accompanied by several very similar tweaks that all failed and are most likely lucky results.

Now, if two param tweak go to 150K LTC while they would have passed much quicker in a combo or with laxer bounds, we still get a waste of fishtest resources... So there is room for improvement.

With noob's regular 9000 cores hardware injection, it may be useful to have a policy of putting LTCs on low prio so that the regular ~1000-1200 cores can be used for still getting quick STCs result.

I can see no drawback into occasionally accepting a param tweak which brings nothing. Why would it hurt? Code complexity is the reason we do resource extensive high confidence LTC, and we are willing to sacrifice tiny elo to simplify it. This should reflect on our bounds, using same [0,3.5] makes no sense whatsoever. On the other hand with [-1,4] we will both catch more small elo gainers and save resources. The faster resolution will allow more tests. Combos are currently a necessity, but also its like admitting we are using too hard bounds. 3 long tests are used when 2 shorter ones would suffice. Also its statistical sloppiness to use yellows as precondition for combos, its like asking an expensive question when we care for a different answer. Some reds would pass the answer we care for, and all yellows and greens would pass the answer much earlier.

The only risk I see with [-1,4] is if people, sweetened by the taste of greens and yellows, get addicted to gambling with param tweaks, thus diverting the focus from the more essential code category. More self-discipline would be required, its gonna be fun.

@snicolet So, what do we do from here ?

• I have created a new repository in the official-stockfish github site, so that we can store new books there for testing purposes: https://github.com/official-stockfish

• about the new bounds, I had the curiosity to count the speculative LTC in the last 100 LTC tests for Elo gaining bounds submitted to fishtest: at the date of octobre 19th, the pourcentage for speculative LTC was 70%, from all active Stockfish developers.

I think that this gives some feedback after a few months for https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-445429885, https://github.com/official-stockfish/Stockfish/issues/1859, https://github.com/glinscott/fishtest/pull/342, and shows that the STC new bounds were too strict for our community since even the most motivated members bypass them.

I shall open a pull request for [-1..5] bounds for STC.

I'm happy about this initiative.

The only slight fear I have is that [-1, 5] proves too wide (I've seen my share of very lucky & unlucky runs), but as the lowerbound gets lower the actual risk of a solid gainer being rejected doesn't really increase. Vondele's tool indicates that [-0.5, 4.5] wouldn't change much in the end.

Some stats with vondele's tool and its assumed patch elo distribution compared to strict rules now (i.e. no spec LTC) :

  • +86% patches applied with the STC change, with a predicted +61% elo
  • +42% likelihood of a +1 elo gainer passing
  • +0% for a +2 elo (beyond 2 elo, it gets lower than now)
  • +36% testing cost

In practice, as spec LTC are already very common, there probably won't be a testing cost increase, and the number of applied patches won't increase nearly as much either, but it should be more consistent.

A very positive change which saves a lot of wasted STC resources (lengthy STC + spec LTC).
The framework will operate much faster, and I suspect that humans instead of hardware will be the bottleneck, enabling a natural transition to higher TC STC. But lets first see that in practice.

Note that I would be fine with either [-1 , 4] , [-1 , 4.5], [-1 , 5] or even [-1.5 , 4.5] bounds.

cc @mcostalba @vondele @vdbergh @Chess13234 @Vizvezdenec

I'm okay with whatever bounds you like tbh.

I think we should indeed reduce the threshold for passing a patch, both stc and ltc, but I think it is a mistake to make the interval wider. Making the interval wider just says that we like to have more noise. [-1, 5] is like [0, 4] with low confidence.

@snicolet I'd rather shift the current STC bounds by 1 elo, e.g. [-0.5, 3.5] and the LTC upper bound by 0.5 to [0, 3].

@snicolet
Lets make a poll with only active users participating and 2 or 3 choices and then decide :-)
For example poll STC
[-0.5,3.5] or [-1,5] or [-1,4]

As we have seen, a lot of stuff behave differently at different TC's. This is why I consider STC confidence untrustworthy, expensive and slowing down the tempo. Our high LTC confidence ensures quality. On the other hand we should not miss unlucky good patches due to wide STC intervals. By reducing the threshold in parallel we accomplish 3 things:

  1. Ensuring that even unlucky good patches will make it
  2. Saving a lot of STC resources
  3. Allowing a random selection of lucky STC runs to be tested at LTC

With the strategy to eliminate spec LTC, we save a lot of resources but lose its most valuable asset, to catch scalers which are weak at STC. Point 3. will allow some without extra cost.

@snicolet Hence out of the suggested bounds [-1.5 , 4.5] seem to be the most suitable for eliminating spec LTC. There is also fishtest experience of those from early years. [-1 , 4] is also good.

I'll post again the link to the SPRT optimization tool, so one can experiment a bit:

https://mybinder.org/v2/gh/vondele/Stockfish/tools?filepath=tools%2FFishtest_SPRT_opimization.ipynb

concerning speculative LTC, it is really for the patch authors to show some discipline. In a few cases, there are good reasons to assume some TC dependence, but actually this is less common than what is often claimed.

@vondele : your tool seems very interesting. Unfortunately I'am not familiar with it and can't find the way to use it.

After following some links, I have found a comparison that Alayan made some months ago using the tool
https://github.com/official-stockfish/Stockfish/issues/1859#issuecomment-453751997

We can see that only changing STC to [-0.5,4.5] increases significantly +1 and +1.5 ELO patchs to pass while +0 ELO patchs probability to pass is still almost null. The tests average cost is higher, probably because of more LTC running but the simulator did'nt take in account the spec LTC I think :-)
There is also proposition for 3 stages or 4 stages tests but it is perhaps too much complicated, at least comparing to actual state.

In any case, if the final goal is to keep progressing SF, it seems to me obvious to accept more +1ELO and +1.5ELO patchs because +3 and +4 patchs are more and more rare. We have to make SF progress in special positions which are present only 2 or 3 times in hundreds of games.

@MJZ1977 to use it, you can input bounds to be used for STC and LTC in input cell 8 (_proposed), and evaluate the full notebook (see kernel menu). The pass probabilities are shown in the graphs as a function of Elo of the patch, and various other related quantities are computed as well. The notebook still refers to the old [0,5] bounds for 'now' (this could be fixed editing input cell 5).

I do agree that we need to make sure that 1 Elo patches have a reasonable passing rate.

As @vondele just said, the tool is rather easy to use. Edit the bounds in the relevant input cells, and you're done. You can also add additional data points in the cells towards the end, I did so when I did my table.

Here are results using [0.5, 4.5] + [0, 3.5] as the reference.

Assuming 1 STC try per patch:

| Limits | [0.5, 4.5] + [0, 3.5] | [0,5] + [0,5] | [-0.5,4.5] + [0,3.5] | [-1, 4] + [0,3.5] | [-0.5, 3.5] + [0, 3] | [-1, 5] + [0, 3.5] |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| -0.5 ELO pass prob | 0.0091% | 0.037% | 0.041% | 0.072% | 0.030% | 0.069% |
| 0 ELO pass prob | 0.123% | 0.25% | 0.433% | 0.729% | 0.495% | 0.616% |
| 0.5 ELO pass prob | 1.407% | 1.524% | 3.726% | 5.846% | 6.082% | 4.534% |
| 1 ELO pass prob | 10.02% | 7.439% | 19.28% | 27.08% | 33.14% | 20.56% |
| 1.5 ELO pass prob | 34.22% | 24.60% | 48.31% | 59.84% | 69.04% | 47.37% |
| 2 ELO pass prob | 64.38% | 51.35% | 73.20% | 82.01% | 88.60% | 69.67% |
| 2.5 ELO pass prob | 85.04% | 74.77% | 87.63% | 92.44% | 96.11% | 83.70% |
| 3 ELO pass prob | 94.56% | 88.59% | 94.61% | 96.87% | 98.70% | 91.64% |
| total ELO gain ratio | 1.0 | 0.758 | 1.570 | 2.054 | 2.391 | 1.611 |
| -0 ELO acceptance ratio | 0.0091% | 0.025% | 0.036% | 0.061% | 0.034% | 0.054% |
| Avg. STC cost | 24456 | 18431 | 20586 | 22836 | 31590 | 15897 |
| Avg. total Cost (in STC games) | 38039 | 27931 | 50640 | 67677 | 83276 | 51844 |

What this table lacks : a proper simulation of how speculative LTC affect elo gaining and resource usage at fishtest, right now. I wouldn't be surprised if the elo gain ratio with spec. LTC is around 1.5 or 1.6 already, but with a worse resource efficiency for this result.

Something that isn't taken into account in all those computations is how valid elo transitivity is for very small gains. I suspect (but can't prove) that 100 patches scoring each +0.1 against the previous master (assuming real +0.1, in practice we can't know the real value precisely enough) would usually give less elo than 10 patches scoring +1.0 against the previous master.

This all also assume that a patch performs identically at all TC, which is an incorrect approximation. This is the main reason I'm skeptical about my own suggested 3-stage testing, besides the fact that complicating the process may create some difficulties. Otherwise, multi-stage testing is unbeatable in the simulations for elo/resource.

Another element that isn't evaluated : the STC results often guide further attempts. Hence, an inaccurate lucky results means more resource will be put into variants of a bad idea, while an inaccurate unlucky result may prevent from trying more to find a gainer. So, generally speaking, of two solutions with similar elo passing and resource usage characteristics, one with tighter STC bounds will be better in practice.

Also, the total costs ratio shouldn't be considered as the total resource usage of fishtest : LTC tunes, and less importantly regression tests, are also a significant part of the load.

@Alayan-stk-2 : very nice and clear table !
can you please add {STC [-0.5,4.5] + LTC [-0.5,3]} or something like this to see the effect of making a negative bound @LTC ?

Long STC runs will be awfully slow. We want to speed up the tempo not lower it. Many versions of the same idea are required to find the sweet spot and many people have many ideas at the same time. That's just impossible with 150K runs, its common sense. I recon @snicolet is fully at this direction and @noobpwnftw also expressed his strong dissatisfaction with tests taking forever to terminate. Many long yellows were retested with [-1,5 , 4.5], none passed and the longest one was a 57K yellow. This indicates its working well. The progress will come through the sheer amount of tries, its impossible and naive to judge the quality of an idea based on an STC result, there is no way of knowing if the idea is good/bad or if the guessed parameters for it are off.

Also the quality of 10+0.1 with so low depths is awful, and so often unrelated to LTC. Many tests pass the difficult [0.5 , 4.5] and fail the easier [0 , 3.5]. By upping the STC the results will be more meaningful. With higher level of chess its more likely that fixes for the spotted strategic flaws of SF show. Its very resource hungry environment, hostile for quality enhancements that require some extra computation.

The change to 15+0.5 is not expensive, the increased quality is more than worth it. Performance will be much more indicative of LTC one. The likes of [-1.5 , 4.5] , [-2 , 4.5] , [-2 , 5 ] , [-1.5 , 4 ] will enable a cheap, swift and continuous selection of high class positives out of a much wider spectrum of tries. In the end it does not matter if a few good ones got lost or a few "bad" ones get promoted. As long as the flow to LTC is kept steady the elo will be rising.

Regarding the LTC, at [0 , 3.5] it feels very solid and secure. Its very expensive but its our highest quality confidence and final call, so its money well spent. It makes sense to make it a tad easier, as everyone noted that progress now comes from small gains, but [0 , 3] would be the heaviest thing we ever saw, no way it could support an increased number of promotions. [-0.5 , 3.5] and [-0.5 , 4] are sensible options.

http://tests.stockfishchess.org/tests/view/5dac6b920ebc590eca43e237

0.03 LLR at 70K games with [-1.5 , 4.5]

Question: why is it so important to spend possibly another 150K games (with narrower bounds) to identify if its closer to -1.5 or 4.5 elo? Obviously its very near the middle, 1.5 elo. Its a close call, hence promoting it or not to LTC utterly trivial. Alas, both options are fine and the biggest misfortune is extension of the test.

I don't want 70K games and still being undecided, this makes me think that [-1.5 , 4.5] is not wide enough

it's 30% wider than current test so it's obviously converging faster.

Question: why is it so important to spend possibly another 150K games (with narrower bounds) to identify if its closer to -1.5 or 4.5 elo? Obviously its very near the middle, 1.5 elo.

Not that I like 200K STC runs, but your assumption about the test being "obviously" near the middle is wrong. Plain wrong.

For the linked test which got a +0.96 perf when the test stopped yellow : according to the error bars, there is a 17% chance of it being below -0.25 elo and a 2.5% one of it being below 1.5 elo.

But in practice, we know than when a test has positive results, if these are wrong, it's much more often an overestimated bad test than an underestimated good test. This is because most tested patches are bad/neutral. So actually, the above odds are too optimistic for a random patch showing this performance.

A poor patch is significantly less likely to sustain a +1 elo performance over 150K or 200K games than 70K games.

Wide bounds also make it more likely for an actual good patch to fail, because wide bounds make it more noisy.

That said, because of different behaviors at different TC, I'd prefer not being too strict on STCs, this is the issue the change is supposed to fix after all. I'd like to see [-1, 4] @ 15+0.15. The higher STC would increase testing cost by 15%, which considering the amount of tunes is probably less than 10% resource usage.

@Vizvezdenec Yes obviously, and I am happy to see that it terminated at 75K while being 0LLR at 70K, which is a very good sign for (-1.5 , 4.5) tempo.

@Alayan-stk-2 Yes I agree with the data of your analysis but not so much with the interpretation.
You say that "it's much more often an overestimated bad test than an underestimated good test"
In this context "good" and "bad" is relative to the target. We could for instance consider any patch which is above -0.5 elo at STC as a good candidate for LTC. In our strategy by setting the middle ground on +1.5 elo STC, it is given that most of the tested samples that pass will be lower than that, due to them being more common. But that is perfectly ok and desired. The STC purpose in this strategy is to be used as a filter, for saving resources, and not to attract any confidence by it. If we could afford to test everything straight at LTC it would be better, but as is we opt to do STC selection for cutting out useless stuff. Based on how hard it is to make improvements, I would not categorize a +0.5 , a neutral, or even a -0.25 elo STC patch as obviously useless. I am fine with any elo range I get by setting a practical bar in regards to resource economy and a reasonable % *(I'd say between 1/5 and 1/10 out of all STC tests) of LTC promotions.* This can be adjusted as desired, raising or lowering the STC filter bar means moving the bottleneck from humans to hardware. Higher STC bar is more work for humans and lower is more work for cpus.

I also realize that for group projects, how important it is for everyone to be happy and in the same boat. Displeased people can disappear, like atumanian who could not digest contempt. Often I question myself if I should stay silent instead, but when I have a strong opinion I have trouble keeping it in. I hope this process is helpful.

I have an idea, what if we become more liberal and allow some flexibility of choice to the users?

Suitability of bounds depend a lot on the usage. For example someone who wants to test many versions of guessed values, he could use wider bounds soas to not deprive resources from others. This also suits people with limited time and a lot of creative ideas. Similarly someone who is reserved in testing, with considerative temperament of few and well studied tries could opt for more certainty.

I don't see why the optimal methodology of one strategy should limit the optimal methodology of another.

This also means bestowing trust and responsibility to the users.
As long as everyone gets his share of resources, he can be free to use as he considers best from similar concepts. This could make everyone happy.

@snicolet What you think of this option? [-1.5 , 4.5] , [-1 , 4] , [-0.5 , 3.5] are essentially the same regarding ease of promotion to LTC

For example someone who wants to test many versions of guessed values, he could use wider bounds soas to not deprive resources from others.

If you try many guessed values, and have wide bounds, you get garbage data, because the real elo variations between the different guesses will be typically significantly smaller than the error bars of the test with wide bounds.

If one change values by very small amounts, this is similar to just trying the same STC test several times to get one passing, which goes right against the point of having testing bounds. If one change values by bigger amounts, the big error margin will defeat the attempt to pinpoint the parameter interval where the attempted eval term does best (scaling concerns also hurt here).

@Alayan-stk-2 The STC data is not very credible anyway. If the values are too close to each other, I agree that its futile to test many of the same. But the likelihood of the better patch passing is still higher. Adding variance to STC means cutting on STC resources + allow more LTCs. Compare row 3 and 6 of your table. STC [-0.5 , 4.5] to [-1 , 5] both paired with [0 , 3.5] LTC. Similar average total cost resource but much less STC cost for the wider bounds. LTC data are much more credible. In other words, I find that up to an extreme point of randomness, the transfer of resources to LTC is beneficial for the overall quality. I don't think that this point of variance is nearly reached with [-1.5 , 4.5] or even [-2, 5]

Having said that, I understand that [-1 , 4] is a very positive change compared to [0.5 , 4.5] and middle-ground compromise point. But it may not be resolving fast enough to allow a raise of STC. Between [-1 , 4] 10" and [-1.5 , 4.5] 15", I would consider the latter as more promising for LTC success.

@NKONSTANTAKIS can you try to write concise comments backed up with data? Otherwise, your contribution is not so helpful, at least to me. Statements like 'The STC data is not very credible anyway' really are pointless unless they are made precise and backed up with data.

@Alayan-stk-2 Seeing your analysis of the drawbacks of the 3-staged system, with which I agree, I am thinking that it partially also extends to the 2-staged. If we were to think of an ideal middle ground 1-staged system, how would that compare resource wise, confidence wise and scaling wise? I think a study on this is interesting.

@vondele Out of the top of my mind the data of a patch STC negative and +12 elo VLTC and the recent V1 patches are not credible? With such an extreme elo gaining example, is it so hard to imagine that we miss many scalers due to STC suppression? If STC data was very credible we wouldn't need LTC's at all, just extreme confidence STC's. This was actually the common methodology of development in the past, to run extreme amounts of games at very short TC's. But now they all turn to higher TC testing. Also in SF there were many representatives of this dogma, advocating that STCs are adequate for everything and LTCs a waste. Luckily mcostalba was a believer of quality in games and raised the LTC from 40" to 60". His words were "The sole purpose of STC is to act as a filter, for saving resources". Now lets also use chess related logic. I don't know if you closely watch SF games and if you adept enough at chess to understand that SF misplays terribly certain positions. Numerous attempts were made to solve them but how is it possible for them to show on game quality of average depth 12, when the flaws of SF's play are very deep? I understand your theoretical/academic/scientific approach and I admire it, it is to step only on solid, well-tested and proven foundations. My personality lies on the other side of the spectrum, for all my life I have taken calculated risks on everything I was doing. Experimenting instead of studying, doing things my own way, rebellious and anti-conformist. Having crossed so often the limits of beneficial risk taking, I developed a feel for it. For me theory is empty, I use the knife to cut the cake. By applying ideas in practice I check their value. I hate conservativism. If I was in charge of SF I would have tried many things, but fortunately I am allergic to tedious work. Due to my love of chess and chess engines, but most of all for my own enjoyment, I am watching and contemplating long hours. Also the stuff that I write take me many hours. I am fully aware that some are annoyed, probably mostly from my temperament, but present them anyway as I have taken positive feedback from people which I value. Its a close call for me, I tend to expose myself emotionally too much, and I have often taken long time off commenting. All in all I am thinking if someone doesn't like it he can ignore it.

I love your passion @nkonstantakis ... no need to change a thing ...
‘... For me theory is empty ...’
Remember , ”In theory , practice and theory are the same, in practice they are different” 😊

sorry I understand your opinions etc but honestly I would like to hear more people that actually write patches. It's pretty easy to give advices when you have neither responsibility nor experience nor data to justify them.

I understand. I wish had the motivation, but they say its hard to teach an old dog new tricks. Maybe... But also in a football team its the players, the coach, the managers, the scouters, the technical advisors, the psychologist, the gymnast, the masseur. Everybody does his thing. I am not a coder so I only voice opinion where I have. Like in the fight for default contempt that we won vs the purists, insisting for multicore pursuit of green for vondele's randomize draw eval (a great patch) which was about to be abandoned after the VLTC fail. In some other things I was proven wrong, like my disagreement with you for the amount of the contempt elo curve. In general because few things can be proven but often decisions have to be made with incomplete information, more abstract elements are used, like visualisation, inspiration, intuition, holistic thinking. What I like about stephane is that he is progressive but also takes also into account elements like human psychology, group morale etc. Now he has a tough decision to take. I think all this helps, a battle of arguments, its a creative process, like brainstorming, also enjoyable as long as it is kept within limits. I admit sometimes I cross them. If everybody just tests patches and keeps the communication typical, it feels very soulless and empty. Like a job that doesn't pay you. Why not to play, have fun, promote enthusiasm. Its all being done about a game after all, not for career or self-accomplishment.

@Vizvezdenec I would also like to hear more people in general. This is a discussion, we have so many active devs but its up to them to voice an opinion. Maybe they don't have one (or a strong one), like I don't have in so many areas. Or they are just too busy writing patches :P

The things that bother me most in the current system are :

  • stc and ltc just play differently, several times I've had patches perform
    differently at stc and ltc. Resource limits mean we can't do all testing at
    ltc, but I think any increase in the stc definition would be very useful. I
    would like to see a move to 15+0.15 if at all possible, 20+0.2 would be
    even better. I think this would be very helpful for improving play in high
    level games.

  • stc tests can be quite long, but we still don't "believe" the results and
    often run speculative ltc. This suggests to me that we should make them
    easier to pass, either by lower and/or wider bounds.

I merged https://github.com/glinscott/fishtest/pull/420 to collect in the next weeks some data about framework/developers behavior and, if necessary, to open a new PR to adjust the bounds.

The lowering of middle-ground elo target from 2.5 to 1.5, compensates for the lowered confidence only partially on the upper side. In other words, we are more prone to accidentally missing a good patch. I think that this prospect, despite being relatively rare, is very unfortunate.

Hmm, could you explain us on which calculations this assertion is based?

I have made some hand-simulations with the graphical Chess SPRT Calculator at http://chess-sprt-calc.azurewebsites.net (there is a link to it in our wiki, in the page "Creating your first test").
Using this simulator, I think that using DrawElo=220 is a good estimator of the observed pass rate of patches at STC during the last months. In other words, DrawElo=220 gives

• a pass probability of ~50% for 1.4 Elo patches using [0 , 4] bounds
• very slightly >50% pass probability for 1.8 Elo patches using [0.5 , 4.5] bounds

This is a very good match of what we have observed in fishtest during the lasts months, so I'll use drawElo=220 for the rest of my post.

Now, fixing drawElo=220 and comparing the two curves with bounds
• A = [0.5 , 4.5]
• B = [-1.5 , 4.5]
I see that the curve for B is dominating curve A, in the sense that the pass probability for any Elo in curve B is greater or equal to the pass probability in curve A.

Thanks for fast response. Yes you are correct. But as we were testing tons of upper range STC through spec LTC which we intend to limit, we were covering this vulnerability through the back door (and with wasting resources by duplicate testing). By including a chunk of those old spec LTCs through bounds , we save all the STC excess games.

Lets allow more data in, and see how our STC pass rate goes.

By the way.
What about parameter tweaks?
Now we have them being harder to pass in both STC AND LTC which seems to be completely illogical.
Maybe unify bounds for this?

Yes this issue has been long neglected. In fact by every reasoning they should be easier and less costly.

@Vizvezdenec getting 3 green STC passes out of 4 attempts for the same general concept (his extStatsRespin serie) and on the way for 3 LTC red is another argument to move from 10+0.1 to 15+0.15 for non-speedup STC (even 5+0.05 might be fine for most speedups test).

That something like #2371 produces a measurable playing strength effect also just show how big the pressure is on raw nps to not miss shallow tactics. 15+0.15 isn't out of the hyperscaling zone; but it should be better.

Yes I agree, and now we spare resources to afford it. On the other hand tunes & parameter tweaks use a lot of resources but all fail LTC (probably narrowly). In general LTC seems harder than ever to pass. I attribute this to SF being stronger than ever + enough long TC to not blunder + the drawish nature of chess. We could help it with [-0.5 , 3.5] , [0 , 3] , or something in between like [-0.3 , 3.3], but imo the only solution lies in the opening book. Need to remove low resolution openings, as improvements get diluted into the high drawrate. Contempt helps tons here. Maybe its time to try >24, SF has become so solid that I bet it can support it.

Analysis based on the data sample:

STC green rate: 20/322
STC yellow rate: 21/322
Avg STC length: ~20K (not counting respins)
LTC green rate: 1/18
LTC yellow rate: 8/18
spec LTC attempts: 0

  1. Spec LTCs at current pass rate are very profitable.

I am counting on average that 1 LTC test will use the resources of 12 STC's (since they run shorter). The STC pass rate is 6%, meaning that 1 green STC comes every 17 STC's. Then it also needs to pass LTC, passing it currently at 5.5% (1/18). Its highly possible that we are having an unlucky run, so lets use a very high expectation of 20% (1 LTC green for every 5 STC). Every test is also using resources for 12 STC's.
A spec LTC by using 12/(17+12)= 40% of LTC pass worth resources, would only require a pass rate of just 0.4 * 5.5% = 2.2% (!) current and 0.4 * 20% = 8% optimistically expected (worst case for spec LTC), in order to be profitable.

Just 1 out of 12 spec LTC would need to pass to break even, even if 1 out of 5 normal LTC passes.
If we also take into account our low STC confidence and "belief" of low STC-LTC correlation, it makes it even more appealing.

This means that if we want to continue with the same strategy, we can safely improve by making bounds easier so that a % of current yellow STC's will be green. Spec LTC is by default a waste of STC resources.

  1. The STC/LTC resource analogy is very high (around 65%-35%). This is very alarming for our low-confidence STC strategy. I elaborate: We took notice of low STC-LTC correlation, hence regarded high STC confidence as wasteful. This was done in order to increase the volume of STC's and also transfer resource usage from STC to LTC. As this is by far not happening, we need to adjust or change strategy.

After SF won the TCEC cup, we had a discussion with john dart, alayant, noobpawn and others. A point was made that if there is low STC-LTC correlation its probably better to skip STC altogether, as its results will not be trustworthy. The riddle to be cut instead of solved. This is feasible, jdart said that he develops his engine only on 60" and 120" tests. With our data it would translate into at least 60 same length LTC tests to our current 20. But in practice it would mean much more, probably around 100-120, as bad ideas would end faster. noobpawn supported this idea, (promising 10K cpus) and I also consider it as best, under the circumstances (hard to prove elo-gain even at [0 , 3.5], low resolution).

Another plan would be to increase the value of the STC by doing it 15" or even better 20". This way its confidence could be "trusted". Our old high STC confidence had the blind spot of suppressing scalers, but it was focused. With 0.25-0.5 lower than LTC target elo bar around and medium-high confidence I think that the STC to LTC pass ratio would be around 1/3 to 1/4.

Need to go, will recap later

In chess they say that a properly executed bad strategy beats a poorly executed good one.
I see 3 distinct strategies and will explain my viewpoint:

  1. The current one, initiated with the STC change to [-1.5 , 4.5].
    Allows a high volume of tests at a given time with an intentional low confidence as the STC to LTC correlation is regarded unsatisfactory. Its purpose is to cheaply create a statistically favorable sample for LTC promotion. Ideal when human creativity is in abundance and bottlenecked by the framework. The results are too fragile for conclusions.

First 8 days seem underwhelming, but small sample, dry seasons are typical. So I am not yet regarding this strategy as bad. However it its very clear to me that an immediate adjustment is required. When you are using 60-65% of all resources on a filter which is used in order to save resources, something's definitely wrong. The fruit gets collected only with LTCs, more LTCs=more fruits. And also more quality LTCs = more fruits.

Safe improvements to this direction are [-2 , 4] and [-1.5 , 4]
Currently (with drawelo 220) 20% of +2 elo patches fail.
With [-1.5 , 4] it becomes 15% and with [-2 , 4] 13%
The latter seems more in sync with the strategy. Also by keeping game count low the transfer to 15" is easier.
The former has the benefit of more trustworthy conclusions. Also spending more at STCs but economizing at LTC's by cutting on low range positives.

  1. A plan with adequate STC-LTC correlation, hence mixed confidence. Here the STC tests will be regarded as solid enough, and could relieve some burden out of LTC's.
    Suggested: [-1 , 4] 20" + [-0.5 , 4] LTC

  2. LTC only approach. Slow and steady, with high quality results. No worries about scaling or missed chances.

I would like to hear opinions

Tbh current bounds suck.
I'm more discouraged in running 10 LTC/week and none of them even close to passing than having one LTC/week but which I know is most likely reasonably good because it passed quite strict STC bounds.
With current bounds I can't even interpret results. Does passed STC + failed LTC indicate scaling problems? Luck? Anything? Usually I have 0 idea what it means. Mainly because STCs pass on basically everything and they fail all LTCs pretty shortly.

Also because now we run enormous amounts of LTCs everyone runs less patches, so tries less ideas.
And pass rate of this LTCs is what, 5% now? At least I have 0/7 LTC passed over the course of one week (!).

It has been mentioned by others, but part of the problem imo is the large difference in time between STC and LTC - increasing STC to 20+0.2 or even 10+0.2 would make STC tests more representative of what matters - performance at LTC (of course at a cost).

With current bounds I can't even interpret results. Does passed STC + failed LTC indicate scaling problems? Luck? Anything? Usually I have 0 idea what it means.

This seems important so maybe it's worh rerunning 2 or 3 of those STC's, or running them at fixed 20k games, to see if there's an answer? [ Edit: 20k games]

It seems that we went from one extreme to the other, from too much STC confidence to too little.

Lets aim at the middle with [-1 , 4] ? And improve correlation with TC increase?

The current resources looks fine to keep the queue length reasonable except when there are LTC tunes and multi-threaded tests.

I think that we should have STC and LTC centered around the same average.
So like [-1; 4.5] or [-0.5; 4]
I prefer latter one tbh for STC being more confident about being not a regression.

I previously suggested STC [-0.5, 3.5] and LTC [0, 3], which has this property (i.e. midpoint interval at +1.5 Elo).

The first question that needs answering should be really independent of resource usage (i.e. if we had infinite resources), what should be the Elo gain of patch to accept it in trunk? Once we have an answer to that, it is easier to adjust bounds to approximate that 'ideal situation' with the finite resources we have.

Note that the answer is not >~0 Elo, as we want the code to remain clean and insightful. The simplification bounds allow for some Elo loss for a good reason.

Sure, but since sf9 I tried to count - we average smth like 0,6-0,7 elo/gainer.
All this bounds have probability of such patches passing well below 50%. Sure, there are really good patches, but most of patches are some 0,2-0,3 elo being lucky.

Right, I agree that most 'green' patches are less than 3.5/2 Elo, which is why I proposed to change bounds also for LTC. Our bounds should reflect that reality.

Note there is a small difference between logistic and bayes Elo, and we use one for sprt and one for the normal Elo measurements, that accounts for part of the difference.

The point of STC having lower midpoint interval is to offer a prophylactic pillow to scalers, as with infinite confidence between 2 equally scoring LTC alternatives we would always choose the worst scoring STC one. The high 1/6 TC analogy intensifies this effect, with 1/4 or 1/3, not only we would be less worried of scaling suppression, but also save resources by improved correlation:
patches that pass STC will be more likely to pass LTC, and
patches that fail STC will be more likely that they wouldn't have passed LTC, if tried

Hence, a 50% STC increase to 15" will mean much less than 50% STC resource increase, and that is just a part of total resource usage.
For those reasons I consider this top priority.

Regarding confidence, I agree that [-1.5 , 4.5] has probably too low, but [-0.5 , 3.5] will be slower than [0.5 , 4.5] which was already slow.
I am not counting off that this ultra-methodical way could be the best way forward, but only paired with like 20" STC, so that all this confidence has high quality. That would require a lot of patience.
In that direction [-1 , 4] would be half the distance from where we are now, we can try it & reevaluate.

@snicolet Do you fancy [-1 , 4] and/or 15", or you'd like more time/data with current?

I think having the STC middle-point slightly below the LTC middle-point is alright. It makes the STC more of a filter rather than a main decider, and over the long run favors good scaler.

With current bounds I can't even interpret results. Does passed STC + failed LTC indicate scaling problems? Luck? Anything? Usually I have 0 idea what it means. Mainly because STCs pass on basically everything and they fail all LTCs pretty shortly.

Yes, that comes back to what I said earlier about STC results being used to guide future patch attempts and [-1.5, 4.5] being too wide.

Using more narrow bounds like [-1, 4] or [-1, 3.5] would make the STC more useful to orient further attempts, especially if combined with a 15" STC that would better predict longer TCs results.

@vondele Going from [0, 3.5] to [0, 3] LTC is good for passing rate, but is quite harsh on resources. Fishtest with 1K cores already seems clogged discouraging patch attempts. Maybe that would be the way forward, but I'd rather try first to find a good STC balance before re-evaluating LTC.

@vondele @Alayan-stk-2 @snicolet if these STC bounds are too wide, please open a PR with new bounds: in this way we can have the developers's feedback.

well in my opinion they pollute fishtest with useless LTCs.
I mean currently we have > half of test being LTCs which is not really good because
1) you probably wouldn't run new test on idea since you already have running LTC;
2) it's an indication that stc filter lets too much through;
3) usually it's good to test LTCs against the latest master but most LTCs are not testing against it because there are too much LTCs and they test against what was the latest master at times of it submission and not now.

I transfer updates here for easier use. Parameter tweaks not included.

19-11-12 | 31m | kingRing_a4h4

STC green rate: 38/730
STC yellow rate: 45/730

LTC green rate: 2/37
LTC yellow rate: 10/37

Avg STC length: ~20K (not counting respins)
spec LTC attempts: 2

STC aside, its obvious that LTC is struggling anyway with [0 , 3.5]. The proposed [0 , 3] is logical but will require on average around 25-30% more games.
The main reason for this is that high level chess is so drawish that less good play is often adequate to hold the draw. SF is so developed that the improvements are small and hardly distinguishable.
Further indication is that STC's pass easier due to the much lower quality of chess at ~depth 15.

The only solution I can see to this is increasing resolution by a book specialized at magnifying elo differences. That would also free up a lot of resources everywhere so imo it should be top priority.

A mild but easily implemented remedy is to use a contempt around 30 for testing, and 18-20 as default. This would promote complexity and combative chess, decrease drawrate and increase resolution.

I don't think it's "LTC struggling"
I think it's "STC is really easy to pass so basically any elo-neutral test has decent chance to go to LTC and fail there"

But if you remember even with [0,5] [0,5] you had 20 STC greens in a row that failed LTC. Whats the chance for this?

Your point is valid though, as 1 in 5 neutral tests will pass current STC. That's a lot

@snicolet Its difficult times, the pass rates are extremely low. Huge amounts of ideas were tried in short time, almost 800 patches and we got just 1 elo gainer for C=0 and 1 for C=24. If you don't feel like initiating a TC and bound change yet, we should at least run spec LTC's as they help with progress. We have 45 yellows in the bank, and atm the framework is idle.

We could queue them all and take 1 week vacation :)

I have known to be improper at times, but sincere. My persistence in this case derives from the responsibility that I feel (for the first time), being the only one advocating for these bounds out of the offered options. But soon after watching it in action I reconsidered, putting myself in an awkward position. This role makes me unease, so I'll just step out and enjoy some chess :)

@NKONSTANTAKIS Bounds are not the problem but the difficulty of finding new elo gaining ideas (the input side). Last time the elo prior for STC was measured it was a normal distribution with mu=-1 and sigma=1.1 (http://talkchess.com/forum3/viewtopic.php?f=2&t=71253#p804895 ). Probably it has gotten worse since then.

You can discuss endlessly about the trade-off between ressources and the probabilities of false positives and false negatives on fishtest but this will not change the input side.

EDIT.

You also mention the book and contempt. The 2moves book already seems to be quite selective (no one has come up with a better one). Contempt also appears to improve selectivity. Some evidence for this is given here

https://github.com/official-stockfish/Stockfish/issues/1853#issuecomment-451907259

(beware of the possibility of selection bias though). It might be worthwhile to do a similar experiment with higher values of contempt.

The pass rates being extremely low is related to the low amount of patches being good, this is true, but bounds can help or hinder in finding the good patches among those, and just as importantly to guide further attempts in a good direction.

Let's take Viz's bishopPsQueen1. Passed STC easily : Elo | 2.11 [-1.35,5.32] (95%)
But the LTC is on the way to fail poorly : Elo | -4.08 [-7.83,-0.32] (95%)

How to interpret this ? Lucky STC ? Unlucky LTC (unlikely) ? Bad scaling to LTC ? All of this ?

Based on those results, should the idea be abandoned entirely, or should further tweaks be attempted ?

With [-1.5, 4.5] @ 10+0.1, the STC elo estimates are especially unreliable, with scaling suspicions on top adding to the low confidence.


As for the book, there has been no serious effort to change it. 2moves_v2 was proposed, tested on fishtest dev, shown to be just as good but with more "normal" lines, and what's the result ? Because who is supposed to host the book where is a matter of contention, it has gone nowhere and the change was never done.

I pushed for noobpwnftw to produce an opening book for fishtest out of his DB. He did so, offering 2 moves, 3 moves and 4 moves set. But once again, it didn't go anywhere.

He didn't filter out drawish openings, but in my limited local testing sensitivity was slightly better, and the lines were more sensible. However, to actually make one of those fishtest's opening book (3 moves might be the best balance, 2 moves hasn't enough positions), we still need someone with the hardware resources to do proper sensitivity testing on those, comparing it with 2moves_v1. Methodology for such testing might require standardization.

Filtering out the lines with eval closest to 0 should also be tested for sensitivity (maybe noob could put eval in an info field of the .epd to allow different filtering tests based on the full set without him having to do multiple exports). Actually playing games is ideal for sensitivity measurement, but it takes huge amount of resources and isn't realistic.

In the end we might get a measurable boost from a better book, but we don't know because the required steps to actually know haven't been taken.

I have uploaded the collection of books in the "books" repository of official-stockfish:

https://github.com/official-stockfish/books

@snicolet we should also fork this branch https://github.com/mcostalba/FishCooking/tree/setup in order to preserve the history and update the books/cutechess-cli for fishtest.

Improving STC-LTC correlation with a longer TC for STC and not as wide bounds would be a big help.

Viz's KFQuadRetry series gave yet another blatant example of how this low correlation can waste CPU resources and developer time. Normal noise isn't good enough to explain away those results.

STCs
KFQuadRetry1 : LLR | 2.97 --- Elo | 2.56 [-1.23,6.13] (95%) https://tests.stockfishchess.org/tests/view/5dd9ca29ac5c08470858acfa
KFQuadRetry2 : LLR | 2.96 --- Elo | 1.28 [-1.51,3.82] (95%) https://tests.stockfishchess.org/tests/view/5dda8c3ee75c0005326d20e2
KFQuadRetry4 : LLR | 2.96 --- Elo | 6.04 [0.28,11.71] (95%) https://tests.stockfishchess.org/tests/view/5ddb4a80e75c0005326d215c
KFQuadRetry5 : LLR | 2.95 --- Elo | 3.13 [-1.02,7.12] (95%) https://tests.stockfishchess.org/tests/view/5ddb4ac4e75c0005326d215e

LTCs
KFQuadRetry1 : LLR | -2.95 --- Elo | 0.19 [-1.51,1.99] (95%) https://tests.stockfishchess.org/tests/view/5dd9f8fbac5c08470858ad0b
KFQuadRetry2 : LLR | -2.20 --- Elo | 0.61 [-0.78,2.11] (95%) https://tests.stockfishchess.org/tests/view/5ddabe79e75c0005326d210d
KFQuadRetry4 : LLR | -1.74 --- Elo | 0.07 [-1.91,2.07] (95%) https://tests.stockfishchess.org/tests/view/5ddba445e0b4af579302ba83
KFQuadRetry5 : LLR | -2.96 --- Elo | -0.69 [-2.94,1.62] (95%) https://tests.stockfishchess.org/tests/view/5ddbbf23e0b4af579302ba8d

I think it's more of an example of patch that scales like trash.

Yes, that's the point. Patches that scale poorly would be less likely to get STC greens (especially impressive ones) with 15+0.15. Conversely, this would slightly increase chances for good scalers to get greens.

Improving STC-LTC correlation means that a STC green has higher chances to lead to a LTC green.

it will also eat even more resources. It's a tricky case since usual sample of some decision showing it effect is like half a year + even then we will have troubles correlating elo gains with bounds because of natural instability of elo gains.

Looks like pentanomial statistics has been implemented!

There are currently some interesting opening book tests going on. It is of course much too early too draw any conclusions but "Drawkiller_balanced_big.epd" is performing quite well so far.

Each test now has a "raw_statistics" page. For the above test it is http://tests.stockfishchess.org/tests/stats/5dfc6a56e70446e17e45102f .

Interesting data is "RMS bias" which currently stands at 35 elo (more or less the side to move bias) and (pentanomial) "Sensitivity" which stands at 0.25. Sensitivity only has relative meaning, as it depends strongly on the test conditions, but currently 0.25 it is the highest among all running tests. Observe that since Drawkiller_balanced_big.epd is a new book it cannot benefit from selection bias.

I am quite curious if these results will stand. If so it would be the first example of an extremely balanced book with high sensitivity. Note that for an extremely balanced book there is actually no benefit in using the pentanomial model but of course it doesn't hurt either.

@vdbergh yes very interesting... once results become more solid, I really would like your opinion on the statistics.

I had a look at the first few hundred positions in the Drawkiller book, and I'm somewhat sceptical we should have a book that contains positions that are 'not reachable' from the starting position. The requirements on books might be more than just sensitivity or balance.... I'd like to see improvement in 'standard opening play', which is presumably best trained on 'realistic' opening moves (with all trickiness in defining that).

I'm curious to see if the Elo difference between noob's 2moves and 3moves book will become statistically significant.

In a fast test the old cutechess-cli binaries (linux and windows) hang loading the "noob_3moves.epd" and "noob_4moves.epd", the new cutechess-cli binaries from the "books" repo are working fine.
I suppose that the problem extends to all *epd books.

What about Drawkiller3.1 EloZoom books? Are they going to be tested?

@vondele 4 moves noobbook seems to be not working

@Vizvezdenec : the @vondele 's test with "noob_4moves.epd" is working fine on DEV server with linux, windows and wsl workers using the latest cutechess-cli binaries
https://dfts-2.pigazzini.it/tests/view/5dfc6aa0e70446e17e451035

All the CPU contributors should update asap the cutechess-cli binaries:
https://groups.google.com/forum/#!topic/fishcooking/UhiGjf7BQi8
https://github.com/glinscott/fishtest/pull/472

@vondele

Sensitivity is not the only thing that is important, but sensitivity is what allows patches to pass. Without it there is no progress.

Personally I feel that a patch author should be free to select whatever opening book he wants (within reason) to prove that a patch has some benefit. But if the book does not have wide coverage (e.g. an endgame book or maybe even drawkiller) then a patch which is successful in this way should also be tested for non-regression on an uncontroversial book such as 8moves_v3.pgn.

Interesting comments.

Do we have any thoughts on how significant the differences are? If one book
gives 39 elo and another 42 is that a negligible difference or a useful
gain?

As I think Vondele said, I am also interested in move choice in the very
early moves, so I think a 2 or 3 move (4/6 ply) book should also include
some 0/1/2/3 ply openings. Is that the case here?

Another question, and apologies for my weak stats knowledge; is it possible / meaningful to measure the consistency of a test? e.g. results for every 1000 games, or 1/10 of the total test, or something like that. I'm thinking it might be interesting to see if some tests are very consistent until they pass/fail and some have results that vary a lot through the test.

I think what @vdbergh said is the grande key to progress. At the point we are now with generic books its almost impossible for specialized improvements to pass.

For example an OCB patch that gains 10 elo for OCB endgames would mean just 0.5 elo if they occur 5%.

So with this method of proving big elo gains in specialized book sets and passing non-regression in generic ones, SF can develop in all areas, even rare ones.

So we could have a wide area of books (different endgames, middlegames, closed positions, zugzwang positions, studies etc) and when a patch targets a specific area its effect will be visible.
This way some glaring weaknesses of SF's play could be easily cured, like those french defence games that SF likes to play with a forever dead rook.

But for this better to have higher specialized elo requirements, else SF could end up clumped with too much specialized code.
Also to guarantee progress a stricter generic non-regression test would be appropriate, like [-2 , 2] or [-1 , 3].

I would also like to note that expensive specialized functions are bound to scale well as its proven that speedups scale bad. Hence a slowdown with +10 specialized elo and 0 elo generic would be great. This way every corner case that is unsolvable by SF because it appears in 1% of games or less can be targeted.

@vdbergh sensitivity is what allows passes to pass, but this statement is relative to the SPRT bounds. So, while I agree we should use sensitivity as a way to optimize the system as a whole, we can use appropriate testing bounds as well.

From the looks of it, the noob books (e.g. 3 or 4 moves) have good sensitivity and are by construction easily reachable (with few moves) from the starting position. I think that's a positive thing.

@vondele You are right of course that one can play with bounds. However the amount of resources required for tests with the same power to separate engines is quadratic in the sensitivity. This is "proved" (somewhat heuristically) in section 5 (Theorem 5.1.4) of this document http://hardy.uhasselt.be/Toga/normalized_elo.pdf (in this document I refer to sensitivity as normalized elo).

yes, agreed, that's what I implied with 'optimize the system as a whole'.

But the point I'm making is that if we would like an engine that is playing better in the catalan, we shouldn't be training it on a book that's only king's gambit. Even if that latter book might be better at differentiating engines. While we're obviously far from doing that, I think we now have a reasonable opportunity to improve on what we have.

As a side-comment, I think we should also be aware of biases introduced by opening books we choose.

As an example, the drawkiller book feels very artificial. It forces opposite-side-castling-like positions as described in https://www.sp-cc.de/drawkiller-openings.htm . Logically, it could affect things like trapped rook themes, and other positional themes like one side waiting for the opponent to pick a side to castle on before choosing one for themselves (for defensive or aggressive purposes). So while it may seem like a good idea (less draws, etc.), I'd rather avoid using it extensively.

I'd be fine with a healthy mix of "close to real" positions and "artificial" positions, to ensure less bias but then it will be difficult to decide the correct proportions.

In my opinion, for testing purposes the book should always honor natural distribution of potential outcome as each side makes a reasonable move, that is, there are more "drawn" openings than "sharp" ones statistically. Sensitivity is not about making more test games "decisive", but to make our tests cover more different positions, while their occurrence is universally sound, and have as much ELO difference as possible like it went from my 2moves to 4moves book tests.

For that I will try to provide a 5moves book from the same method with 1% sample rate, keeping its size small, see if the trend holds.

EDIT:
Done(455,287 positions).
https://www.chessdb.cn/downloads/5moves_1pct.zip

@snicolet @vondele @noobpwnftw should be interesting to add a wiki page with info/recipes/tools for each book.

@vondele there are some really strange results there.
http://tests.stockfishchess.org/tests/view/5e01c06ac13ac2425c4a9aa8
Look at this residuals, sometimes reaching 15 without crashes (?????)
If anything I would've expected games from startpos have high correlation between workers but this looks like a huge mess.

120 | 44 | 86
45 | 80 | 125

yes, I've seen that. I assume that, since the book contains just the startpos, results are somehow correlated within a batch but different between batches, so leading to very 'unlikely' results (residuals)

The purpose of the test is to figure it out... ~unfortunately, no pgns saved for STC games~. Need to look at pgn stats.

But tbh that makes close to 0 sence. Why should results correlate within a batch and not correlate between batches? Up to 100+ elo difference between them?
Even drawrate is vastly different, one group of workers has it being 60%, other group (which gives +100 elo perf to master) is like 30%

one batch runs on the same machine... so similar speed, and thus similar moves? That's just a guess of course.

Tbh all noob machines are "the same" machines since they are built from the same hardware.
When ncm was testing from startpos I've never seen such a high variation between their workers.
Also I remember when we had 1 or 2 workers not loading book in regression tests and playing from no book, they always produced some strange results, one was always +100 elo to master (for multiple regression tests, 5+), other -70 elo to master.
I think there is smth wrong of how fishtest operates with no book start.

Somehow we should urge to take a screenshot of the test with all the red residuals
in http://tests.stockfishchess.org/tests/view/5e01c06ac13ac2425c4a9aa8 , because
it will be auto-purged (the click box is on) and the information will be lost.

Le 24 déc. 2019 à 10:05, Michael Chaly notifications@github.com a écrit :

Tbh all noob machines are "the same" machines since they are built from the same hardware.
When ncm was testing from startpos I've never seen such a high variation between their workers.
Also I remember when we had 1 or 2 workers not loading book in regression tests and playing from no book, they always produced some strange results, one was always +100 elo to master (for multiple regression tests, 5+), other -70 elo to master.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

I set it to prio -1 for now.

I've all games downloaded, they look reasonable (i.e. start from startpos etc). About machines being the same... well, there are always differences. For example, looking at the (scaled) TC field: 7.35+0.07, 10+0.1, 10.01+0.1, 10.06+0.1, 10.06+0.1, 10.07+0.1, 10.08+0.1, 10.09+0.1, 10.12+0.1, 10.12+0.1, 10.12+0.1, 10.12+0.1, 10.12+0.1, 10.13+0.1, 10.14+0.1, 10.15+0.1, 10.16+0.1, 10.19+0.1, 10.76+0.11, 10.78+0.11, 10.78+0.11, 10.8+0.11, 10.8+0.11, 10.81+0.11, 10.81+0.11, 10.82+0.11, 10.83+0.11, 10.84+0.11, 10.84+0.11, 10.84+0.11, 10.84+0.11, 10.84+0.11, 10.85+0.11, 10.86+0.11, 10.86+0.11, 10.86+0.11, 10.87+0.11, 10.88+0.11, 10.88+0.11, 10.89+0.11, 10.89+0.11, 10.89+0.11, 10.9+0.11, 10.91+0.11, 10.91+0.11, 10.91+0.11, 10.91+0.11, 10.92+0.11, 10.92+0.11, 10.92+0.11, 10.93+0.11, 10.93+0.11, 10.93+0.11, 10.94+0.11, 10.95+0.11, 10.95+0.11, 10.95+0.11, 10.95+0.11, 10.95+0.11, 10.95+0.11, 10.95+0.11, 10.95+0.11, 10.95+0.11, 10.95+0.11, 10.96+0.11, 10.96+0.11, 10.97+0.11, 10.97+0.11, 10.97+0.11, 10.97+0.11, 10.97+0.11, 10.99+0.11, 10.99+0.11, 10.99+0.11, 10.99+0.11, 10.99+0.11, 10.99+0.11, 10.99+0.11, 10.99+0.11, 11+0.11, 11+0.11, 11+0.11, 11+0.11, 11+0.11, 11+0.11, 11+0.11, 11+0.11, 11.01+0.11, 11.01+0.11, 11.01+0.11, 11.02+0.11, 11.02+0.11, 11.02+0.11, 11.04+0.11, 11.04+0.11, 11.04+0.11, 11.05+0.11, 11.51+0.12, 11.51+0.12, 11.55+0.12, 11.61+0.12, 11.63+0.12, 11.64+0.12, 11.68+0.12, 11.7+0.12, 11.71+0.12, 11.86+0.12, 11.86+0.12, 11.89+0.12, 11.91+0.12, 11.95+0.12, 12.06+0.12, 12.1+0.12, 12.18+0.12, 12.66+0.13, 12.79+0.13, 13+0.13

I've started a test with 1+0.2 time control on start position.
It should have in theory less movetime variance.
http://tests.stockfishchess.org/tests/view/5e01e6e7c13ac2425c4a9abe (Auto-purge is set to off to examine residuals)
Alternative Time Controls are from this:
https://github.com/glinscott/fishtest/issues/338

For the games from startpos at 10+0.1TC, the following is how the games decorrelate

#ply #NdifferentPos #frequencyList
1   1 [38237]  (e4)
2   2 [19121, 19116] (e5 or e6)
3   2 [19121, 19116] (d4 or Nf3)
4   3 [19116, 18921, 200] (...)
5   7 [18940, 12242, 6671, 200, 168, 8, 8]
6   8 [18936, 12242, 6671, 200, 168, 8, 8, 4]
7   9 [18936, 12242, 6671, 200, 168, 8, 8, 3, 1]
8   10 [18936, 12242, 6605, 200, 168, 66, 8, 8, 3, 1]
9   15 [18936, 11922, 6604, 252, 200, 168, 68, 66, 8, 5, 3, 2, 1, 1, 1]
10   19 [18934, 8484, 6604, 3438, 252, 200, 152, 66, 66, 16, 8, 5, 3, 2, 2, 2, 1, 1, 1]
11   24 [18497, 6795, 6589, 3395, 1941, 437, 200, 152, 66, 66, 43, 16, 15, 5, 5, 3, 2, 2, 2, 2]
12   27 [18489, 6778, 6589, 3395, 1941, 437, 200, 152, 66, 66, 43, 17, 16, 15, 7, 5, 5, 3, 2, 2]
13   32 [18489, 6589, 6566, 3372, 2082, 437, 200, 152, 71, 66, 64, 56, 16, 15, 15, 10, 7, 5, 5, 3]
14   46 [17117, 6589, 6544, 3355, 1794, 1370, 334, 200, 161, 143, 120, 103, 67, 64, 54, 47, 22, 17, 16, 15]
15   64 [17117, 6544, 6535, 2396, 1794, 1370, 968, 334, 190, 143, 103, 95, 90, 66, 64, 54, 54, 47, 45, 43]
16   82 [17007, 6544, 3556, 2584, 2324, 1794, 1036, 968, 391, 334, 300, 190, 127, 110, 95, 72, 72, 72, 66, 64]
17   103 [17006, 6539, 3494, 2577, 2320, 1070, 1036, 1018, 745, 391, 332, 300, 146, 135, 125, 110, 72, 67, 64, 62]
18   126 [14665, 6535, 3391, 2577, 2341, 2320, 1068, 1031, 1018, 744, 391, 332, 300, 146, 135, 123, 109, 103, 72, 67]
19   152 [14655, 6534, 3386, 2577, 2341, 2320, 1068, 1029, 926, 744, 389, 332, 300, 146, 135, 121, 109, 103, 92, 72]
20   186 [14655, 6530, 2340, 2320, 2069, 1317, 1282, 1181, 1068, 926, 622, 604, 381, 332, 300, 185, 165, 145, 140, 114]
21   228 [14637, 6033, 2320, 2283, 2053, 1282, 1259, 1151, 1032, 926, 604, 520, 497, 380, 332, 290, 185, 165, 145, 115]
22   301 [14635, 5769, 2320, 2053, 1724, 1282, 1259, 1122, 1029, 817, 604, 558, 497, 413, 372, 332, 264, 249, 176, 145]
23   348 [13818, 4587, 2053, 1718, 1315, 1282, 1259, 1182, 1120, 1005, 995, 815, 813, 603, 553, 497, 413, 249, 205, 188]
24   401 [13817, 4324, 1708, 1301, 1282, 1259, 1196, 1120, 998, 995, 970, 858, 813, 594, 551, 545, 497, 413, 264, 263]
25   459 [13817, 3186, 1708, 1301, 1259, 1196, 1137, 1120, 998, 995, 970, 944, 853, 813, 594, 545, 497, 309, 287, 273]
26   518 [13817, 3186, 1708, 1301, 1259, 1196, 1112, 1099, 995, 970, 891, 879, 852, 813, 594, 545, 497, 309, 268, 262]
27   593 [13817, 3186, 2220, 1708, 1259, 1187, 1112, 1089, 995, 891, 879, 850, 813, 592, 545, 497, 264, 261, 247, 202]
28   680 [13817, 3186, 1708, 1287, 1187, 1125, 1089, 995, 978, 948, 933, 894, 850, 592, 545, 407, 405, 335, 261, 249]
29   788 [13817, 3186, 1708, 1286, 1184, 1089, 1078, 995, 933, 850, 814, 592, 551, 545, 541, 427, 407, 405, 348, 335]
30   903 [13817, 3185, 1708, 1283, 1184, 1089, 986, 966, 933, 850, 714, 592, 551, 545, 538, 427, 405, 373, 348, 335]

Can you calculate these decorrelation numbers separately for odd games and even games too?

Added a test to re-check if 1+0.2 benefits ELO resolution with random book moves(hybrid book beta).
http://tests.stockfishchess.org/tests/view/5e01eef6c13ac2425c4a9ac1 +42.5ELO
original:
http://tests.stockfishchess.org/tests/view/5dfc7ba9e70446e17e45104d +57ELO

@snicolet sure, you mean, one set where white is always New, and one set where white is always base ?

@vondele yep :)

@snicolet:

game.headers["White"] == "New-b648247"  (Black Base-b4c239b)
1   1 [19131]
2   2 [19121, 10]
3   2 [19121, 10]
4   3 [18921, 200, 10]
5   6 [12242, 6671, 200, 8, 8, 2]
6   6 [12242, 6671, 200, 8, 8, 2]
7   7 [12242, 6671, 200, 8, 8, 1, 1]
8   8 [12242, 6605, 200, 66, 8, 8, 1, 1]
9   13 [11922, 6604, 252, 200, 68, 66, 8, 5, 2, 1, 1, 1, 1]
10   15 [8484, 6604, 3438, 252, 200, 66, 66, 8, 5, 2, 2, 1, 1, 1, 1]
11   19 [6795, 6589, 3395, 1941, 200, 66, 66, 43, 15, 5, 5, 2, 2, 2, 1, 1, 1, 1, 1]
12   20 [6778, 6589, 3395, 1941, 200, 66, 66, 43, 17, 15, 5, 5, 2, 2, 2, 1, 1, 1, 1, 1]
13   24 [6589, 6566, 3372, 2082, 200, 71, 66, 64, 56, 15, 15, 10, 5, 5, 2, 2, 2, 2, 2, 1]
14   34 [6589, 6544, 3355, 1794, 200, 161, 120, 67, 64, 54, 47, 22, 17, 15, 12, 9, 9, 7, 6, 5]
15   51 [6544, 6535, 2396, 1794, 968, 190, 95, 90, 66, 64, 54, 54, 47, 45, 43, 21, 11, 9, 9, 7]
16   64 [6544, 3556, 2584, 2324, 1794, 968, 391, 190, 95, 72, 72, 66, 64, 56, 54, 47, 36, 22, 21, 19]
17   80 [6539, 3494, 2577, 2320, 1070, 1018, 745, 391, 146, 135, 67, 64, 62, 56, 54, 44, 36, 28, 21, 21]
18   94 [6535, 3391, 2577, 2320, 1068, 1018, 744, 391, 146, 135, 103, 67, 56, 56, 54, 43, 36, 28, 25, 23]
19   116 [6534, 3386, 2577, 2320, 1068, 926, 744, 389, 146, 135, 103, 92, 56, 54, 43, 36, 30, 27, 26, 24]
20   143 [6530, 2320, 2069, 1317, 1282, 1181, 1068, 926, 604, 185, 165, 145, 140, 114, 103, 84, 71, 64, 56, 50]

game.headers["Black"] == "New-b648247"
0   0 []
1   1 [19106]
2   1 [19106]
3   1 [19106]
4   1 [19106]
5   2 [18938, 168]
6   3 [18936, 168, 2]
7   3 [18936, 168, 2]
8   3 [18936, 168, 2]
9   3 [18936, 168, 2]
10   5 [18934, 152, 16, 2, 2]
11   6 [18497, 437, 152, 16, 2, 2]
12   8 [18489, 437, 152, 16, 7, 2, 2, 1]
13   9 [18489, 437, 152, 16, 7, 2, 1, 1, 1]
14   13 [17117, 1370, 334, 143, 103, 16, 9, 7, 2, 2, 1, 1, 1]
15   13 [17117, 1370, 334, 143, 103, 16, 9, 7, 2, 2, 1, 1, 1]
16   18 [17007, 1036, 334, 300, 127, 110, 72, 34, 31, 16, 16, 9, 7, 2, 2, 1, 1, 1]
17   23 [17006, 1036, 332, 300, 125, 110, 72, 33, 31, 16, 15, 9, 7, 2, 2, 2, 2, 1, 1, 1]
18   32 [14665, 2341, 1031, 332, 300, 123, 109, 72, 31, 31, 16, 14, 9, 7, 4, 2, 2, 2, 2, 1]
19   36 [14655, 2341, 1029, 332, 300, 121, 109, 72, 31, 31, 16, 14, 10, 9, 7, 2, 2, 2, 2, 2]
20   43 [14655, 2340, 622, 381, 332, 300, 107, 94, 72, 31, 31, 27, 20, 16, 15, 10, 9, 7, 6, 2]

Another test to re-examine residual values and ELO(auto-purge off):
http://tests.stockfishchess.org/tests/view/5e01f4f6c13ac2425c4a9ac4 Result:+4 ELO
So far SF10 has a lead(was 4elo) in 10+0.1 vs -57ELO loss in 1+0.2
Update: SF10 started losing, master has +3ELO now.
I will run a third test after with -1 prio, just to be sure it wasn't a fluke.
http://tests.stockfishchess.org/tests/view/5e01ff1ec13ac2425c4a9ac7 Result:+ 9.66 ELO
(below results as test progressed)
ELO: -4.07 +-4.6 (95%) LOS: 4.3%
Total: 8212 W: 1810 L: 1887 D: 4515
Ptnml(0-2): 135, 1068, 1721, 1027, 108

ELO: -0.84 +-3.9 (95%) LOS: 33.9%
Total: 11272 W: 2489 L: 2513 D: 6270
Ptnml(0-2): 177, 1434, 2373, 1449, 156

ELO: -0.57 +-3.5 (95%) LOS: 37.6%
Total: 14034 W: 3118 L: 3127 D: 7789
Ptnml(0-2): 228, 1774, 2964, 1805, 201

ELO: 0.31 +-3.3 (95%) LOS: 57.2%
Total: 15826 W: 3528 L: 3494 D: 8804
Ptnml(0-2): 258, 1988, 3338, 2054, 232

ELO: 0.89 +-3.2 (95%) LOS: 70.7%
Total: 17292 W: 3867 L: 3806 D: 9619
Ptnml(0-2): 275, 2167, 3656, 2249, 256

ELO: 1.37 +-3.0 (95%) LOS: 81.0%
Total: 19147 W: 4301 L: 4209 D: 10637
Ptnml(0-2): 302, 2404, 4033, 2495, 294

ELO: 1.88 +-3.0 (95%) LOS: 89.3%
Total: 20277 W: 4571 L: 4441 D: 11265
Ptnml(0-2): 315, 2545, 4264, 2648, 318

ELO: 2.21 +-2.9 (95%) LOS: 93.4%
Total: 21523 W: 4852 L: 4700 D: 11971
Ptnml(0-2): 332, 2692, 4535, 2818, 337

ELO: 2.36 +-2.8 (95%) LOS: 95.0%
Total: 22631 W: 5125 L: 4955 D: 12551
Ptnml(0-2): 355, 2827, 4759, 2972, 359

ELO: 2.53 +-2.7 (95%) LOS: 96.7%
Total: 24483 W: 5543 L: 5360 D: 13580
Ptnml(0-2): 384, 3056, 5159, 3212, 395

ELO: 3.09 +-2.6 (95%) LOS: 98.9%
Total: 25457 W: 5799 L: 5565 D: 14093
Ptnml(0-2): 399, 3164, 5372, 3348, 420

ELO: 3.53 +-2.6 (95%) LOS: 99.6%
Total: 26133 W: 5992 L: 5710 D: 14431
Ptnml(0-2): 405, 3240, 5503, 3449, 433

ELO: 4.00 +-2.6 (95%) LOS: 99.9%
Total: 26914 W: 6210 L: 5871 D: 14833
Ptnml(0-2): 414, 3343, 5634, 3580, 450

ELO: 4.31 +-2.6 (95%) LOS: 100.0%
Total: 27459 W: 6352 L: 5977 D: 15130
Ptnml(0-2): 421, 3399, 5753, 3657, 462

ELO: 4.59 +-2.5 (95%) LOS: 100.0%
Total: 27803 W: 6437 L: 6051 D: 15315
Ptnml(0-2): 426, 3439, 5813, 3715, 471

ELO: 4.95 +-2.5 (95%) LOS: 100.0%
Total: 28419 W: 6617 L: 6177 D: 15625
Ptnml(0-2): 432, 3511, 5943, 3797, 491

ELO: 5.41 +-2.5 (95%) LOS: 100.0%
Total: 28895 W: 6739 L: 6260 D: 15896
Ptnml(0-2): 435, 3563, 6031, 3880, 501

ELO: 5.79 +-2.5 (95%) LOS: 100.0%
Total: 29567 W: 6922 L: 6399 D: 16246
Ptnml(0-2): 439, 3639, 6171, 3980, 514

ELO: 6.08 +-2.4 (95%) LOS: 100.0%
Total: 30274 W: 7085 L: 6540 D: 16649
Ptnml(0-2): 448, 3718, 6323, 4089, 527

ELO: 6.47 +-2.4 (95%) LOS: 100.0%
Total: 30945 W: 7268 L: 6666 D: 17011
Ptnml(0-2): 458, 3785, 6463, 4194, 541

ELO: 7.00 +-2.3 (95%) LOS: 100.0%
Total: 33438 W: 7888 L: 7179 D: 18371
Ptnml(0-2): 498, 4062, 6972, 4562, 584

ELO: 7.71 +-2.3 (95%) LOS: 100.0%
Total: 35345 W: 8374 L: 7562 D: 19409
Ptnml(0-2): 529, 4260, 7366, 4861, 620

ELO: 8.09 +-2.2 (95%) LOS: 100.0%
Total: 36309 W: 8618 L: 7744 D: 19947
Ptnml(0-2): 535, 4365, 7576, 5007, 636

ELO: 8.31 +-2.2 (95%) LOS: 100.0%
Total: 38429 W: 9145 L: 8203 D: 21081
Ptnml(0-2): 567, 4616, 8015, 5305, 681

ELO: 8.78 +-2.1 (95%) LOS: 100.0%
Total: 39441 W: 9397 L: 8396 D: 21648
Ptnml(0-2): 579, 4721, 8228, 5474, 700

For reference, a similar analysis for the 20k games (white == New) using noob_5moves book (455k positions). Mostly unique games, still about 8% are duplicate games up to ply 39, ~10% start from the same book position (all positions are unique in the book, but picking randomly should lead to duplicates) :

0   18263 [3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
1   18265 [3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
2   18267 [3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
3   18279 [3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
4   18287 [3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
5   18300 [3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
6   18308 [3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
7   18321 [3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
8   18327 [3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
9   18346 [3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
10   18353 [3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
11   18359 [3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
12   18372 [3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
13   18385 [3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
14   18399 [3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
15   18412 [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
16   18419 [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
17   18430 [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
18   18440 [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
19   18452 [3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
20   18462 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
21   18469 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
22   18476 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
23   18481 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
24   18488 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
25   18495 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
26   18502 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
27   18507 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
28   18512 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
29   18517 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
30   18525 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
31   18528 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
32   18525 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
33   18530 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
34   18527 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
35   18524 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
36   18521 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
37   18521 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
38   18515 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
39   18514 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

On DEV with a couple of workers (just restarted to have a bigger number of workers) the test with starpos has ELO 63 after 4k games. High residuals (3).
https://dfts-2.pigazzini.it/tests/view/5e01dabf9613c94fedf1aa6c

@vondele what ELO are you expecting?

@ppigazzini I think the testing shows that one can widely different elo based on the testing conditions. This is not too much of a surprise, now that we know the correlation is so strong. If by luck (i.e. TC, other things etc) one picks regularly a variant that is winning/losing the elo result at the end will reflect this.

So, in my opinion, we've done enough testing on the startpos book :-) ....

If the noob_5moves.epd book has d=450.000 distincts positions and we are picking randomly n=20.000 positions from it, then the number of positions being chosen at random at least twice is given approximately by r = n^2 / d (see the wikipedia page on the birthday paradox, section 'Average number of people' in https://en.wikipedia.org/wiki/Birthday_problem ). Plugging the numerical values, this gives a value of r = 880, or 4.4% repeated games among the 20k games (white == New).

Questions:
1) Why is this different from the 10% repeated games reported by Joost from experience?
2) Do these 4.4% or 10% repeated positions invalidate our independence hypothesis for SPSA?

It might however be interesting to consider decorrelation within SF itself

@vdbergh Yes, I was thinking along this way that we could randomize a very little bit the evaluation during the first 5 moves of the game (say) to get more opening variations and alleviate the correlation problem.

I would prefer not to randomize evals... at least not till we have exhausted the other options.

It should be possible to use the opening book to solve it... for example, cutechess has options to go through the book in a different order, which could be used to make sure we play unique openings. I'll have a look.

Stockfish has more move diversity if it runs with more threads(e.g. 4) on opening position(similar effect with multipv>2).

With startpos master and sf10 are playing over and over 1. e4 e5 2. Nf3 (master) and 1. e4 e6 2. d4 (sf10)

The dev server test seems to be close to +56ELO that time control 1+0.2 test.
http://tests.stockfishchess.org/tests/view/5e01e6e7c13ac2425c4a9abe
and +57 ELO on the hybrid book beta
http://tests.stockfishchess.org/tests/view/5dfc7ba9e70446e17e45104d
The 10+0.1 test is the anomaly.

@Chess13234 perhaps the ChessDBCN workers have too little noise for the highly correlated positions originated from startpos.

They seem to have no such issues with 1+0.2 TC, so the problem has to be in time management and only in DBCN workers - they also seem to be running some ancient kernel from 2013, so their thread time scheduling is not as good as modern linux.

Linux kernel 3.10.0-1062.9.1.el7.x86_64 is dated as Fri Dec 6 15:49:49 UTC 2019, probably the most recent version from the CentOS 7 branch. I'm quite sure they are not going to have some "thread time scheduling“ problems that are only fixed in modern linux.

3.10.0-1062.9.1.el7
Its a patched version of 2013 kernel.
https://kernelnewbies.org/Linux_3.10
Reached end of life November 4th, 2017 after 108 maintenance releases.
https://wtarreau.blogspot.com/2017/11/look-back-to-end-of-life-lts-kernel-310.html

And what exact patch it does NOT contain might affect us here?
For me it's good enough as it still gets updated within this month after it's 2017 "end of life", so whoever patched it must have ignored all important improvements and only changing version numbers.

By the way, this kernel has Spectre mitigation support from last year, so if it is really that ancient I'd suspect it may actually run a little faster in a multi-threaded environment.

Spectre mitigation overhead can reach up to 17%
on newest kernels.
https://www.phoronix.com/scan.php?page=article&item=linux50-spectre-meltdown&num=6

@vondele If you mean by "randomizing evals" adding a random number to the evaluation function then it is not as drastic as that I think. I think one basically adds a small random number at root to the result of searching a move. So SF's search would not be affected but there would be some variability in the moves that are selected at root.

Although to be honest I do not very much like this idea of introducing extra noise myself.

What happened to DBCN machines between this 192MB test
http://tests.stockfishchess.org/tests/view/5dbec3830ebc5925b64f11aa
and recent 128MB/192MB tests
http://tests.stockfishchess.org/tests/view/5e043975c13ac2425c4a9bdd
http://tests.stockfishchess.org/tests/view/5e020576c13ac2425c4a9aca
http://tests.stockfishchess.org/tests/view/5df9054dcde01bf360ab78db
In first one,192MB is joined by DBCN workers, the recent ones seem to be ignored?
Is there something changed with memory requirements in fishtest/cutechess ?

I think it's because of different throughput.

From recent testing we know that since SF10, SF gained more elo at LTC than STC (58 elo vs 45 in the 2moves_v1 tests) . This is true for both search and eval patches (vondele tests showed ~14 elo at LTC and ~11 elo at STC for eval patches).

We've also merged yet another search tune patch, which did poorly at STC but managed to get a VLTC green.

10+0.1 is right in the zone where SF gets enormous amounts of elo with more time and isn't as great of a predictor of how things go beyond this zone.

Those add to the argument for a longer STC (15+0.15 or even 20+0.2), besides the beneficial stronger STC-LTC correlation that we could expect from such a move.

If STC predicted LTC results, the fishtest LTC page would have lots of greens, but its actually <5%. STC obviously is a poor filter that doesn't predict much.

@Chess13234 You cannot have it both ways. If STC has strict bounds (trying to be a predictor for LTC) then people complain and resort to "speculative LTC". This is encouraged by the maintainers' policy that LTC (or even VLTC now) is the definite result and STC does not really count.

So the STC bounds were relaxed to get more greens to make people happy. But then you have to accept that STC is not a predictor for LTC, but only serves to quickly weed out the really bad stuff.

@vdbergh it would then make more sense to have a Medium TC like 30+0.3/20+0.2 to have a "predictor for LTC" instead of STC->LTC 5% chance of success.
If e.g. we have STC-MTC-LTC chain then most of STC patches will be rejected at MTC, instead of wasting LTC resources. Bonus: it would ensure the patch scales to 3 time controls.

If the STC was increased to 15+0.15 or 20+0.2 we would expect the pass rate at LTC to increase from the current <5% as the play at the two tc's would not be so far apart in ability. Surely this would be a good thing?
Speedups would become less likely to pass stc and fail ltc because of the large tc difference, and refined eval terms more appropriate to higher quality games would stand a greater chance of passing stc.

Speedups would become less likely to pass stc and fail ltc because of the large tc difference, and refined eval terms more appropriate to higher quality games would stand a greater chance of passing stc.

I assume that by speedups, you mean functional simplifications ; not non-functional speedups (which imho we could test at VSTC, as long as it proves to be faster and is functionally identical...).

If so, I fully agree.

I can't speak for the maintainers. But I think nobody will object if someone does a "predictive" STC of SPRT{0,2} at say TC 20+0.2, on condition that if the predictive STC fails, one doesn't do a "speculative LTC" anyway. While this has the potential of saving some resources, it also substantially reduces the probability of a 1 Elo patch passing since it now has to jump three hurdles. One may look down on 1 Elo patches but the Elo prior was N(-1,1) some time ago. Probably it is worse now. So even 1 Elo patches should be considered very rare.

FWIW here is the new SPRT calculator which uses standard Elo.

http://hardy.uhasselt.be/Toga/SPRTcalculator.html

The corresponding python script is here

http://hardy.uhasselt.be/Fishtest/sprta.py

Draw ratio's at STC and LTC are currently 56% and 68%. One can interpolate this for other untried TC's. Some scaling information is available in this thread #2459 but no one knows how this reflects on the scaling behavior of individual patches (the testing procedure probably has influence on the scaling of the end product, due to selection bias).

Since it is mainly a question of taste perhaps users should be given the choice between a "filter STC" of SPRT{-1,3} at TC 10+0.1 or a "predictive STC" at SPRT{0,2} at TC 20+0.2. Of course calculating the resource consumption of such a procedure only works if one does not resort to speculative LTC if STC fails.

Personally I feel that speculative LTCs should be disallowed (i.e. aborted), unless they are pre-approved by the maintainers.

A patch that loses speed passed STC recently with ~2ELO .
http://tests.stockfishchess.org/tests/view/5e0e28b187585b1706b683d3
It also passed 1+0.2 with +3.39 ELO.
http://tests.stockfishchess.org/tests/view/5e0e460a62fb773bb7047ea2

This doesn't feel right...
http://tests.stockfishchess.org/tests/view/5e0cfbc487585b1706b68346
http://tests.stockfishchess.org/tests/view/5e0d2c6f87585b1706b6835b
http://tests.stockfishchess.org/tests/view/5e0d2cdc87585b1706b6835e
resulted in 3 red LTCs.
I retried with non-negative SPRT on LTC and it was also red
http://tests.stockfishchess.org/tests/view/5e0e29e487585b1706b683d6
idk how to interpret this STC results at all now. I can't understand if idea is any reasonbale even if I have 3 green STCs on it.

Yes STC bounds are too easily passed by luck (~20% chance for a 0 Elo patch).

I do start to have some local data that shows that there is rather strong TC dependence in Eval as well (in particular for a tuning of the complexity parameters).

@Vizvezdenec @Chess13234

With current STC bounds there is 19% probability of a 0 Elo patch passing. So that causes a lot of spurious greens at STC. And there is Murphy's law...

Concerning the Elo estimates. First of all one has to take the confidence interval into account. For example in this test

http://tests.stockfishchess.org/html/live_elo.html?5e0e460a62fb773bb7047ea2

one sees that the confidence interval is [-0.90,7.52]. So it still contains negative numbers. One cannot ignore this.

Finally: the Elo point estimates are _median unbiased_. This means that _over all patches_ (passing and failing) they have 50% probability of being lower than the actual value and 50% probability of being higher.

The key phrase here is _over all patches_. If you take the subset of all passing patches the Elo estimates are biased. This is called selection bias and it is impossible to correct for unless one knows an Elo prior.

The issue of course is that there is no true Elo prior (Fishtest development is very chaotic, with a lot of dependence between tests). However perhaps this is ok. In Bayesian statistics the probabilities are not empirical probabilities, but rather "degrees of belief". So the prior does not have to correspond to reality.

@vdbergh How can i have confidence when i'm not passing STC tests, that the patch is bad, without speculative LTC? I do have suspicions many patches fail because of STC randomness(move time variance, thread time allocation and bad time management).
This may also extend to 20+0.2 tests to a lesser degree, as this TC is barely above STC.

I understand all of this, what am I trying to say is that judging by 2 STC passes and 2 LTC fails I can't say that idea just doesn't scale at all (while previously I could make this claim with pretty big probability) and I can't fully ditch it because 2 LTCs is a norm of fail for patch of let's say 0,5 elo, which is actually our average elo gain/patch.
Maybe it's only a personal preference, but I find this kinda annoying. I would propose to move lower bound to -0.5 to lessen false positive number of patches and make STC at least slightly more reliable.

@Chess13234 when a patch doesn't pass STC it is likely bad at that time control. The bounds are such that it is easy to pass. You can see pass ratios here: http://hardy.uhasselt.be/Toga/SPRTcalculator.html

(even though one can argue that having ~50% chance for a 1Elo patch is low as well).

11 passed STC(LEminroot11 pull request)
10 fails STC
http://tests.stockfishchess.org/tests/view/5e0c26339d3fbe26f672d52a
12 fails STC
http://tests.stockfishchess.org/tests/view/5e0c27329d3fbe26f672d52e
9 passed STC
http://tests.stockfishchess.org/tests/view/5e0c25a59d3fbe26f672d528
6 passed STC
http://tests.stockfishchess.org/tests/view/5e0bba3b9d3fbe26f672d500
7 failed STC
http://tests.stockfishchess.org/tests/view/5e0bbacb9d3fbe26f672d502
All these patches are nearly equal in strength. It doesn't explain how depth 9 and 11 win while 10 and 12 lose. (I mean by that they all should either pass or fail or have a strict pattern like failing below or above certain depth - there isn't any pattern here)
When i test like 4-5 variants of same patch i want to know if i found a good one.I'm not running several STC tests on same patch to establish statistical confidence in some bounds - some confidence before starting a long LTC test.

That's what is to be expected if the pass rate is around 50% (i.e. 1Elo patch).

That's what is to be expected if the pass rate is around 50% (i.e. 1Elo patch).

The STC tests(LEMinRootX to establish rootDepth at which LazyEval can be safely disabled) show wildly different ELO

test name/depth | ELO | Bound | LOS | WLD
---------------- | --------- | --------- | --------- | ---------
LEMinRoot16 | -1.32 | -5.42,2.92 | 26.6% | w:21.8%,l:22.3%,d:55.9%
LEMinRoot15 | -2.64 | -7.51,2.34 | 14.8% | w:21.2%,l:22.1%,d:56.7%
LEMinRoot14 | -7.38 | -14.38,-0.36 | 2.0% | w:20.8%,l:23.1%,d:56.1%
LEMinRoot13 | -2.24 | -6.89,2.53 | 17.6% | w:21.6%,l:22.4%,d:56.0%
LEMinRoot12 | -3.14 | -8.27,2.10 | 11.8% | w:21.0%,l:22.1%,d:56.9%
LEMinRoot11 | 4.64 | -0.34,9.52 | 96.6% | w:22.8%,l:21.2%,d:56.0%
LEMinRoot10 | 0.39 | -2.38,3.44 | 60.8% | w:21.9%,l:21.9%,d:56.2%
LEMinRoot9 | 6.50 | 0.64,12.31 | 98.5% | w:23.2%,l:21.1%,d:55.7%
LEMinRoot8 | 4.53 | -0.40,9.34 | 96.4% | w:22.5%,l:21.0%,d:56.5%
LEMinRoot7 | -4.50 | -10.31,1.36 | 6.6% | w:20.7%,l:22.2%,d:57.1%
LEMinRoot6 | 2.57 | -1.19,6.13 | 91.3% | w:22.2%,l:21.3%,d:56.5%
LEMinRoot5 | -1.83 | -6.24,2.71 | 21.2% | w:21.4%,l:22.1%,d:56.5%

(9 has the best bounds and LOS%, 11 cuts the most of early depth)

Elo estimates must be considered with their uncertainties e.g. LEMinRoot12 : -3.1368 [-8.2741, 2.0962], so all are consistent with 0 Elo (or 1 Elo for that matter).

The SPRT tests are designed to give the quickest possible answer to the question asked (i.e. closer to bound 1 or bound 2), not to be accurate in Elo. That's just a fact that should not be ignored.

http://talkchess.com/forum3/viewtopic.php?f=6&t=72741
About 8 moves book, it seems that one of the openings is basically a forced win, can anyone check some stats of it in regression tests?

If the normal STC bounds gives 20% for a 0 ELO patch,
maybe this is not what we need when measuring speed-ups
What STC bounds would be more acceptable ?

I think the first thing to do is to measure an actual speedup with perf as described here:
https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#speed-optimization

Afterwards, I would suggest our current LTC bounds {0,2} but with STC TC.

good. i respin http://tests.stockfishchess.org/tests/view/5e14941961fe5f83a67dd85f with the suggested bounds, because IMHO 5000 games is not enough for any conclusion.

On abrok site, the ELO (displayed in red) for tuned null search STC is +4.34
however the value is different when we look at
http://tests.stockfishchess.org/html/live_elo.html?5e0ba4159d3fbe26f672d4e6

So the ELO calculation has changed...
Does it means that latest regresion ELO cannot be compared with previous regression.

@Rocky640 fishtest switched to logistic Elo with this PR https://github.com/glinscott/fishtest/pull/479

@vdbergh could confirm (just in case something wrong sneaked in the conversion to logistic)... but I would expect that abrok uses a simple calculation that is only correct for fixed number of games results, while live_elo is using the correct Elo calculation in the case of a SPRT run.

@Rocky640 Yes what had been said is correct. You cannot use the same calculation for SPRT tests as for fixed length tests. I can give references to the literature, but it is a bit complicated.

The elo estimates and error bars given by live_elo.html have been checked by simulation.

ok I found that abrok always displayed a different ELO than the live elo calculator,
and it had been like this for months. so this is a non-issue.

However can we compare today's regression test (which is using new logistic-elo and pentanomial)
http://tests.stockfishchess.org/tests/view/5e1472da61fe5f83a67dd84f
http://tests.stockfishchess.org/tests/view/5e14734d61fe5f83a67dd851

with the previous (which was using old bayes-elo and trinomial)
http://tests.stockfishchess.org/tests/view/5def70363cff9a249bb9e4c5
http://tests.stockfishchess.org/tests/view/5def70ae3cff9a249bb9e4c8

fixed number of games runs have always been using logistic Elo ... and I believe Elo calculation for fixed number of games is essentially identical between tri- and pentanomial?

I believe the same as vondele on this. There is no adjustment made from the elo derived from the raw results with fixed games, and elo is a direct function of wdl results so pentanomial is not changing anything here afaict.

@vondele The Elo is the same but the error bars will be 5% smaller using pentanomial due to the fact that 8moves_v3 is still somewhat unbalanced (various tests show an RMS bias of around 60 Elo).

Thank you for those explanations ! it clarifies a few things.

I think it's about time to respin this discussion after quite dissapointing regression test (it's not finished but it's quite obvious that it most likely wouldn't finish positive).
So, we made STC bounds really loose and now probability of patch being not a regression became 18% * 5%, so like 0.9%, it seems that it's too much - 7 elo gainers result in what seems to be slightly negative elo gain.
I guess we should do smth with this.
1) most obvious is that we probably should do simplification attempts for all 7 passed patches that made it into master since sf11 release. Probably just at LTC;
2) it seems that 0,9% is too high of a chance for negative patch to pass. Probably since we want loose STC bounds to give more patches shot at LTC we should slightly stricten LTC bounds themselves since a lot of patches that are negative go to LTC and each of them has decent chance to pass.
My proposition will be to change LTC SPRT bounds to {0.25, 2.25} or {0.5, 2.5} - second one is closer to 0,25% of regression chance we had like forever, first one will allow more patches to pass.
I guess it's all from me for now, your opinion is really appreciated :)
I think that the stronger engine gets the stricter should be % of non-regression (yes, it's sad, because less patches will pass) because % of passed patches becomes lower and lower thus more and more patches are tested and more and more patches lie in "slightly negative" zone.

Maybe we can also slightly move lower bound of STC.
I think good compromise between everything can be
STC {-0.5; 3}, LTC {0.25; 2.25}.
Chances of negative patch will be smth like 0,26% - more or less what we used for years, we will have slightly less LTCs (which is, imho, a good thing, nowadays we run infinite LTCs most of them are not even close to passing), overall game number wouldn't increase this much, STC-LTC correlation will be slightly more reliable.

@Alayan-stk-2 @vondele @snicolet @xoto10

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GBeauregard picture GBeauregard  ·  7Comments

rayoh123 picture rayoh123  ·  5Comments

NightlyKing picture NightlyKing  ·  7Comments

fun8 picture fun8  ·  4Comments

d3vv picture d3vv  ·  4Comments