Stockfish: SPRT parameters improvement

Created on 9 Dec 2018  路  132Comments  路  Source: official-stockfish/Stockfish

After a very long and interesting discussion https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-445429885 I think it's time to summarize in quantitative way what people have found and proceed in a practical / executive fashion.

The topic is the improvement of SPRT parameters aimed at:

  • Improve precision, i.e. discard zero or negative patches and accept positive ones with higher margin

  • Manage resource consumption.

It is immediately clear that the 2 goals go in opposite directions, anyhow from the discussion it seems there is some consensus to allow a reasonable increase of test resources in exchange for stricter parameters.

To allow the discussion to be quantitative and on the point, I'd like people post here only simulation results in a simple and standardized format:

| Limits | SPRT[0, 5] STC + SPRT[0, 5] LTC |
| --- | --- |
| 0 ELO pass prob | xxx |
| 1 ELO pass prob | xxx |
| <= 0 ELO pass prob | xxx |
| >= 1 ELO fail prob | xxx |
| Avg. STC cost | xxx |
| Avg. STC + LTC Cost | xxx |

The first 2 records give a taste of the sensibility of the scheme, the 3 and 4 give a taste of the failure rate of the scheme and the last 2 of the cost of the scheme.

Now, on the cost. Patch's ELO is not uniformly distributed, namely the biggest part of tested patches are within [-2, 0] ELO (neutral or slightly negative). We have to consider this to compute a standardized cost.

I propose the following. Given:

ag(x) = Average number of games for pacthes with x ELO

We define the cost as:

STC Cost = 10 * ag(-2) + 35 * ag(-1) + 40 * ag(0) + 25 * ag(1) + 5 * ag(2)

For the STC + LTC cost we have to consider that:

LTC Cost = STC Cost * 6

Given that all the above is not trivial to implement, I would split this project in 2 phases;

  1. First find a standard tool to reliably compute all the above. Publish it and let people refer to the same measuring tool.

  2. Once we have an open, shared and validated tool then compute the different schemes.

The standard tool should be sound. I would prefer a tool based on simulations more than on formulas, it sounds to me simpler and easier to review for a wider audience of people, and also more flexible. Everybody is encouraged to submit their script / simulator. Once the process has stabilized on a shared and validated single tool, then we will proceed to the second phase.

It won't be quick, but this is an engineering approach to the problem: solid, slow and boring :-)

Most helpful comment

All 132 comments

@mcostalba did you have a look at the actual tool I made available https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-442254913 ? Just click the mybinder icon, and have a look.

yes, it took about two weeks to write, so at least the slow bit is there ;-)

@vondele Great job! The tool is here: http://nbviewer.jupyter.org/github/vondele/Stockfish/blob/tools/tools/Fishtest_SPRT_opimization.ipynb

Some questions:

  • You use Gaussian distribution for the ELO of the tested patch, but this is not a god model because ELO is clearly asymmetric, more patches in [-2, 0] than in [0, 2] ELO intervals. So could your tool be changed to account for real distribution in [-4, 4]? Outside of this we don't care in practice.

  • You lost me on total cost calculation, indeed I understand the cost for a patch of a given X ELO, but I don't understand if you weight the total cost along real ELO distribution.

  • Would be possible to fill the above table (at the top of the post) with the data of your tool? (for the cost it is ok to use avg. number of games as you have defined it).

In addition to changing the bounds, it would probably also make sense to increase the time control of the STC to make the difference between STC and LTC smaller, and to reduce the need and surprises with "speculative LTCs". I would propose 20 + 0.2.

@mcostalba Having done tweaks and bound testing with vondele's tool, I think I can give decent answers to your questions :

  • The gaussian distribution is indeed not a good model because elo losing patches exceed elo-gaining patches. The tool can absolutely be updated to use a different model. As I wrote here : https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-442378770 ; I could easily replace the simple gaussian with a function returning the gaussian value for x/3 when x<0. This quick hack helped to better approximate real distribution,and the rest of the tool adapted fine to the change. I think we can devise a better formula than this for the patch elo distribution, and then it's very easy to put it in the code and have everything use the new distribution.
  • The total cost in vondele's tool depends on the elo distribution of the patches. Doing my quick hack mentionned above to get a better approximation of the elo distribution of the patches did change the total cost result, and in the expected way. I also had a look at that code then and I noticed nothing wrong.
  • All the values in the table can be computed with it.

Also notice that the tool focus on the cost of 5xSTC + LTC more than STC + LTC (LTC only cost resources if the STC pass, the tool accounts for this) ; because it is observed in practice that several similar variations of a tweak often get submitted to fishtest. But this can be changed easily.

@Yery While longer TC for testing is attractive in reducing low-depth flukes, it is also more costly in resources. As much as I'd like fishtest to use longer STC and LTC, we have to be careful with this. A ratio of only 3 between STC (which is supposed to act first as a filter) and LTC is likely not a good compromise between resource use and result quality. STC 15/LTC 75 would likely consume similar or smaller resources than 20/60 with better overall results.

While discussing TC increase fits here (the optimal bounds depend directly on the time ratio between STC and LTC), the increased precision which is the main focus here already will increase resource usage. Adding another multiplicative resource usage increase on top of it may be too much.

@Yery thanks for your comment but I'd like firstly to land on a validated and shared tool, one we have it we can do all the discussion we want, but for the moment, to avoiding lacking of focus, please let's stick on the tool implementation details.

@Alayan-stk-2 for the real distribution I would suggest a real array of values (we just need less then 10 of them) that correspond to the real distribution. Easy and explicit.

@vondele could you please copy / paste your tool and simplify it down to just compute the values of a single SPRT combination? We will call it the official simulation tool, and it should be simple: no need to compute both new and old SPRT parameters, just one set, whose results we will report here.

@Alayan-stk-2 for the real distribution I would suggest a real array of values (we just need less then 10 of them) that correspond to the real distribution. Easy and explicit.

Vondele's tool work using continuous functions at many places. Turning it into the evaluation of a few discrete elo patch value would likely be more work for a worse result.

The array idea based on observed fishtest data is still applicable, however. We can get a table with the values hardcoded, then do a linear interpolation between the two closest value in the table.

Trivial example in pseudocode to get the value of the patch elo probability function at x :

array[] = {0.1,0.15,0.08} // values at -1, 0, 1 elo

if (x <  -1 ||聽x > 1)
    return 0

x++
for (i=0;i<3;i++)
{
    if (x > i)
    {
        return ((x-i)*array[i] + (i+1-x)*array[i+1])
    }
}

@mcostalba

let me have a look, the table can be 'easily' computed with the notebook.

your: STC Cost = 10 * ag(-2) + 35 * ag(-1) + 40 * ag(0) + 25 * ag(1) + 5 * ag(2)
is quite close to a slightly different Gaussian model (mu = -0.21 sigma = 1.112) of patch distributions, which I will use (it is quite realistic, even if optimistic). It is easier to work with distributions.

What are arguments against tighter time controls like 1+0.2 and 2+1? They should in theory use less time for shorter games and have less variation in searched depth(more reliable results for SPRT).
https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-445472297

I'm currently running this test on low throughput to see how variable win-loss records lead to SPRT distortion(random depths searched due time slice in 10+0.1 being widely different)
http://tests.stockfishchess.org/tests/view/5c0c43c10ebc5902bceed541
and its comments here:
https://github.com/Chess13234/Stockfish/commit/23c7a4101afd6e407ced0b2e92ecd5002cc19434

@vondele thanks. Actually my numbers are just out of my hat. I'd guess the best ones are the histograms you plotted in the "ELO estimates of STC tests" graph, in case you got them from real tests (indeed I don't know the source of them).

@vondele I am not an expert in statistic but maybe we could find a distribution that better resembles the data, for instance a skewed one: https://en.wikipedia.org/wiki/Skewness

@mcostalba yes, I got them from the fishtest analysis (elo estimates of the past X-thousand tests or so). However, there is no need to be overly concerned to fit that data (it is easy), the part below zero matters little (it is quickly all rejected with the testing schemes we have, most resources go to positive part of the elo spectrum).

The most important aspect is that the distribution of patches tested on fishtest is not absolute, but depends on the test scheme anyway (i.e. unavoidably, patches close to passing will be tweaked and resubmitted). Therefore the notebook is just testing reasonable assumptions and can guide a policy decision, but can't predict the output of a bounds change (that's more a social experiment than natural science).

I have the notebook with the table implemented in a couple of minutes, but will avoid the simplification of it, that's more work, and as you mentioned the graphs are IMO more informative than the table.

I've updated the notebook, at the end there is the table Marco asked for. (the links posted previously should still work, but note that these are cached for a few minutes) Feel free to mention possible errors in the notebook.

Now, the table for a couple of bounds is below.
(row three and for slightly redefined, and all based on 1 STC try, and Marco's patch distribution)

Limits | [0, 5] STC + [0, 5] LTC | [1,4] + [0,3] | [0.5, 4] + [0, 3]
-- | -- | -- | --
0 ELO pass prob | 0.0025 | 0.00036 | 0.001
1 ELO pass prob | 0.0743 | 0.08051 | 0.145
+1 ELO rejectance ratio | 0.705 | 0.613 | 0.532
-0 ELO acceptance ratio | 0.0004 | 3.5e-05 | 0.0001
Avg. STC cost | 23622 | 47822 | 41507
Avg. STC + LTC Cost | 47505 | 80060 | 90712

@vondele How did you compute the elo estimates? I am asking since there are basically three options: the MLE one (which is the naive one), the median unbiased estimator (which is the one currently used and which is too high/low in 50% of the cases) and the mean unbiased estimator (whose expectation value is equal to the true elo value). For a fixed length test these three estimators coincide but they are (quite) different for SPRT tests.

The median biased estimator (which you proposed!) is intuitively quite satisfying and easy to compute. However for further statistical processing a mean unbiased estimator is in principle more desirable. I have a tool which computes the mean unbiased estimator for SPRT tests but I have not released it yet since currently it only works robustly for small elo differences. However it could already be useful in this case.

To have an idea of what I am talking about here are estimates for the expected values of the three estimators for an SPRT(0,5) with elo=-1 (they were obtained by simulation so they are not completely accurate).

mean=-0.99 (2.27) median=-1.52 (2.36) MLE=-2.04 (2.56)

(the values between parenthesis are the RMS errors between the estimator and the true value, tautologically the mean unbiased estimator has minimal RMS error).

Here is a sample for elo=2. Now the difference between the median/mean unbiased estimators is less dramatic.

mean=2.00 (2.00) median=2.15 (2.31) MLE=2.33 (2.66)

@vondele thanks. Interesting results.

I see that we double the cost in terms of Fishtest efficiency. This is not an easy choice.

For complex patches +1 ELO is really border line because we may not want to add a lot of complex stuff for a +1 ELO increase. This is just my point of view of course.

Also going from STC[0, 5] to STC[1, 4] doubles the STC requested resources, so maybe makes sense to keep STC[0, 5] and then going for a stricter windows for LTC.

@vondele could you please:

  • I don't understand what it means: 1 ELO pass prob = 0.0743. It means 7%? Could you please print % instead of floating point values?

  • Compute also with the follwing cases: STC[0, 5] + LTC[1, 5] and STC[0, 5] + LTC[2, 5]

  • Add a row in the table:
    +2 ELO pass prob

Indeed the +2 ELO patches are the threshold between good stuff and almost useless but burdening stuff. So I am curious of what happens there.

@vdbergh, the elo estimates are based on your script (i.e. sprt_elo.py simply run on all results), so the median unbiased estimator... IIUC. Note that the challenge is even more difficult than the estimation of the elo value of a given test. We would like to get the underlying distribution of all tests.

@mcostalba the notebook is there to do all kind of experimentation... there are unlimited possibilities. With the mybinder link, it will open in your browser, no need to install anything.

concerning 0.0743 yes converting to percent means 7.43 %.

BTW, I think there are essentially no 2 Elo patches that are still being found.

@vondele I agree it is an interesting challenge. Do you still have the raw data (or is there an easy way to get it)?

EDIT. With whatever estimator we use we can calculate the transformation

true elo distribution -> (expected) observed elo distribution

So the challenge is to invert this. I wrote earlier that it is a convolution but this is not true. But in any case it seems to be just linear algebra (since we can discretize the problem) which numpy can do very efficiently.

Less naive and probably more robust would be to compute the MLE for the elo distribution directly. I have not thought how hard this would be.

@vondele thanks. Interesting results.

I see that we double the cost in terms of Fishtest efficiency. This is not an easy choice.

You have to divide the elo gains by the resource costs.

Also, as was pointed out in the other thread, the elo-losers have practically no importance when it comes to the elo impact of passing test, but they have an importance when it comes to resource usage.

The proposed new bounds would reject elo-losing patches as fast or faster than now, so the relative increase in total resource cost is smaller than in vondele's data in the table.

You can get about +50% elo passing for +80% resource usage (and good patches are the bottleneck much more than resources at fishtest currently, especially as resources sitting unused push people to try random things).

For complex patches +1 ELO is really border line because we may not want to add a lot of complex stuff for a +1 ELO increase. This is just my point of view of course.

This is certainly a valid reasoning, but what defines "complex patch" ? Is a one-liner a complex patch ?

I think it is a better approach to ask for more tests when a complex change gets submitted and passes testing (which is not often) than to make fail many simple good patches.

7,43% of +1 elo patches passing is quite low, and this also means a majority of +1.5 and +2 elo patches failing. In practice, we will often get multiple minor variations, which increase the likelihood to pass but also the resource usage at testing (which isn't more efficient at the end).

And as SF improves, elo gains get harder. Regression tests are showing that the average elo-gaining patch is much closer to +1 elo than to +3 elo (count the number of elo-gainer during SF10's development cycle, substract some for simplifications, and compare to the total elo gain).

This indicates that elo-gaining patches merged are very often +1/2 elo gainer which gets lucky during STC/LTC rather than outstanding +3/4 elo gainers. Having a lottery to decide which good patches get merged and which get rejected isn't great.

Also going from STC[0, 5] to STC[1, 4] doubles the STC requested resources, so maybe makes sense to keep STC[0, 5] and then going for a stricter windows for LTC.

See my points above. The elo/resource ratio goes down with the newer bounds proposal, but not as much as a division by 2, far from it.

* Compute also with the follwing cases: STC[0, 5] + LTC[1, 5] and STC[0, 5] + LTC[2, 5]

This will reject even more good patches all the while not improving resource usage (closer bounds usually means more, but higher means quicker rejection for patches below the lower bound so not sure which effect dominates). It's 100% guaranteed that those bounds would be awful.

@vdbergh https://github.com/vondele/Stockfish/blob/tools/tools/stc_tests_elo is the raw elo data for 6719 tests.

@vondele Thx. This is in principle already useful. But you did not keep the original (W,D,L) data?

EDIT. I guess I can collect this data myself but some indication how you did it might be useful (then I do not have to think).

here we go https://github.com/vondele/Stockfish/blob/tools/tools/stc_tests
that was obtained from downloading all results and grepping for the useful part

@vondele Great! Thx.

EDIT: This is fascinating data. I did some preliminary processing. It appears that the average submitted patch is -2.5 elo (this should be a statistically correct result but it is probably heavily influenced by outliers). Next weekend or so I will try to get an estimate for the true underlying elo distribution (I am not yet sure if it is feasible).

BTW, I think there are essentially no 2 Elo patches that are still being found.

+1

I don't think there are many "complex" patches, either, although different people might have different views on what counts as "complex".

well, we just had 2 elo gaining patches resulting in +4.4 elo in RT, so at least one of them is 2 elo :)

@Vizvezdenec ... hopefully, but don't forget the error bars on the RT. It is 4.4 +- 2

I've just counted actually.
We had 80 elo gainers in sf9->10 that resulted in 55 elo on 8 moves.pgn and probably slightly more than 60 on 2 moves.pgn.
So, IMHO, borderline of garbage patch should be 1 elo or maybe even lower - since on average our elo gainers from latest times are passing with this elo or less.

Limits | [0, 5] + [0, 5] | [0.5, 4.5] + [0,3.5] | [1,4] + [0,3] | [0.5, 4] + [0, 3]
-- | -- | -- | -- | --
0 ELO pass prob | 0.0024 | 0.0012 | 0.00036 | 0.001
1 ELO pass prob | 0.0744 | 0.1229 | 0.08051 | 0.145
+2 ELO rejectance ratio | 0.3101 | 0.2034 | 0.1423 | 0.1224
-0 ELO acceptance ratio | 0.0004 | 0.0001 | 3.5e-05 | 0.0001
Avg. STC cost | 23622 | 32401 | 47822 | 41507
Avg. STC + LTC Cost | 47505 | 67451 | 80060 | 90712

I would say that STC[0.5, 4.5] + LTC[0,3.5] is a good compromise between resources used and performance. It increases average test time by a 42% (that is not small) but detects 4 good patches out of 5 instead of the currently 2 out of 3 (reject ratio 0.2034 vs 0.310).

The tool is still using a gaussian curve to simulate elo patch distribution

I went ahead and used the array method for patch elo distribution :

def patch_elo_dist(x):
    #linear array from -5.25 to 5.25, every 0.5
    elo_array = [140, 150, 200, 230, 260, 300, 330, 410, 530, 570, 600, 680, 610, 490, 80, 45, 35, 30, 25, 20, 15, 10]
    z = (x+5.25)*2
    for i in range (0, 22):
        if (z > i and z <= i+1):
            return ((i+1-z)*elo_array[i] + (z-i)*elo_array[i+1])/5690

    return 0

The granularity is quite problematic around 2 elo, the result is still interesting.

| Limits |聽[0,5] + [0,5] | [0.5, 4.5] + [0,3.5] | [1,4] + [0,3] |聽[0.5, 4] + [0, 3] |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| 0 ELO pass prob |聽0.0025 | 0.00123 | 0.00036 | 0.0011 |
|聽1 ELO pass prob |聽0.0744 | 0.1002 | 0.0805 | 0.1451 |
| +1 ELO rejectance ratio | 0.6647 | 0.5972 | 0.5950 | 0.5207 |
| +1.5 ELO rejectance ratio | 0.3705 | 0.2866 | 0.2563 | 0.2116 |
|聽+2 ELO rejectance ratio |聽0.1589 | 0.0968 | 0.0620聽| 0.0553 |
|聽total ELO gain ratio | 1.0 | 1.1354 | 1.1436 | 1.2841 |
| -0 ELO acceptance ratio |聽0.00022 |聽8.1e-05 | 2.0e-0.5 | 0.0001 |
| Avg. STC cost | 10107 | 13799 | 20293 | 17595 |
|聽Avg. STC + LTC Cost |聽20332 | 28959 | 34261 | 38942 |

Note that while elo-loser acceptance ratio goes down, the number of elo-loser goes up with this distribution.

It appears that while most of the potential elo gain in submitted patch is in +1 elo patches, the bulk of the "easy to grab" elo gain is still in the +2 elo patches.

@Alayan-stk-2 thanks. It confirms [0.5, 4.5] + [0,3.5] as the most balanced IMHO.

On one hand with these new parameters we will have more green LTC, as is the main target behind this change.

I am a bit worried of +1 ELO big patches that not always are worth committing, but this is counterbalanced by simplifications that will eventually remove later a weak complex patch. Instead weak simple patch will stay.

Indeed the interplay between new patches and simplifications is a strong one. It is a kind of dialectic process, a live dynamic that allows to strengthen SF in the long term in a good way: it values more than the sum of the single parts.

The previous suggested bounds got most of their increase from patches between 2 and 3 elo, but with the array distribution we get a sharp drop of patches at 2+ elo (the numbers in the arrays are eyeballed to correspond to vondele's measured elo estimates graph with some smoothing for +2 elo-gainer). [1,4] + [0, 3] and [0.5, 4] + [0, 3] are then too costly.

I've tried some bounds tweaking. but it's hard to get a significant improvement, and not worth it considering another set of data for patch distribution may change where the optimal point lies. It would likely be better to just apply the [0.5, 4.5] + [0, 3.5] bounds and gather new data about fishtest usage in a few months to evaluate if another tweak may be worth it.

Meanwhile, maybe we can also get improvements to the opening book used in fishtest as suggested in #1853 and in other elements which add noise to the testing data.

@snicolet what do you think?

I'm fine with the proposed [0.5, 4.5] + [0,3.5] : indeed we can evaluate the change in a few months to estimate if it works in practice.

The most impressive feature in the discussion has been the quality of the argumentation, as usual! :-)

I'm also fine with [0.5, 4.5] + [0,3.5] .

I'd like to make one remark on speculative LTC (or testing how this scales, etc) in this context... these will be the most expensive tests to run, and have more risk to regress. As such I would propose a) to approve the only exceptionally, and if they pass, require a second LTC to pass.

I assume we keep the simplifications at [-3, 1].
Do we change bounds on param tweaks? [0, 4] + [0, 3.5] could make sense.

@vondele I'd suggest one change at a time, let's start with this one as a "beta", in some weeks we can eventually promote to "production" or, in the worst case, to revert.

@snicolet how do you suggest to deploy this change? Should we contact fishtest devs and ask them to change the defaults? Maybe even better is to open a PR on fishtest repo.

It appears a skew normal distribution

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewnorm.html

is a reasonable model for the elo prior. We can use this prior to compute the theoretical
distribution of the median unbiased estimator used by Vondele and compare it with the empirical distribution obtained from Vondele's raw data. I did some visual fitting and it appears parameters a=-4,scale=2.4,loc=0.7 yield a good visual fit (I did not actually calculate the theoretical distribution but used simulation instead). The graph of this prior is here

http://hardy.uhasselt.be/Toga/skew_norm.png

The moments of the prior are given by mu=-1.158 sigma=1.520 skewness=-0.784 excess kurtosis=0.632. Note that this graph is less spread out than Vondele's graph since it aims to correct for the noise in the elo estimator (which is quite large for SPRT tests).

Of course none of this should be used since visual fitting is unreliable. I will make a more formal fit later. One thing I am worried about though is that the fitting will place too much emphasis on the less important negative elo part since there is so much more data there. We'll see.

@vdbergh in addition to the prior would be interesting to see how the simulation with that prior matches the data. For a more formal fit you could consider weighting with e.g. the cost of the combined sprt simulation. That should put the emphasis where it is needed most.

@vondele Today I am busy the whole day but I will post the simulation result tomorrow. Note that this is just a proof of concept and I did not try to push it very far since running the simulations is very time consuming. For formally matching the theoretical and empirical distributions of the observations I was planning to use this method https://en.wikipedia.org/wiki/Maximum_spacing_estimation since it seems easiest.
Some form of weighing might indeed be desirable (informally: we look for a model predicting costs and we want to give more weight to the correct prediction of higher costs). I have to think about it.

Here is the result of a simulation of the adhoc visual fit

http://hardy.uhasselt.be/Toga/adhoc_model_sim.png

(blue is the model, green is the data). The simulation was done with this elo prior

http://hardy.uhasselt.be/Toga/skew_norm.png

which is a skew_normal distribution

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewnorm.html

with parameters a=-4,scale=2.4,loc=0.7.

To get a feel for the predictive power of such a model I also compared the average length/pass rate of the data with the simulated ones. The result is (with 95% confidence intervals)

data_length=18270(17884,18657) 
sim_length= 18526(18128,18924) 
data_pass=   0.031(0.027,0.036) 
sim_pass=    0.042(0.037,0.047)

So the prediction of the expected length is surprisingly good. The pass rate prediction is a bit less good but still not dramatically bad.

Of course I know it is not theoretically correct to validate a model on the same data that was used to fit it, but since the fitting did not use these two particular bench marks (length and pass rate) there may be some value in it. Once I have done a proper fit I will use theoretically correct cross validation

@vdbergh very nice results and excellent fit... I'll see if I can update the notebook to use this form later tonight.

The elo prior also confirms the statement I've made above, that >2 elo patches are very few.

A suggestion for the color-coding; we could halve the size of the existing yellow band, so instead of everything with W>L being yellow or green, only the top half of the current yellow band would be yellow, the bottom half would either join the red "fail" band, or we could create a new "neutral" result (no color?) for results where W~=L.
We could store the result in the db as well, which would be useful for later retrieval.

Notebook updated with the Elo prior by @vdbergh

Notebook updated with the Elo prior by @vdbergh

Thx! I am doing the formal fitting now. It is an interesting learning experience.

EDIT: Well after some experimenting it seems that formally fitting a skew normal elo prior to the observations (using some reasonable loss function) does not decisively improve its predictive power (usually the cost prediction is quite good, but the pass prediction is overly optimistic compared to the date we have). So it seems the adhoc, visually fitted, elo prior might be good enough for now (given that fishtest is too erratic anyway to expect a rigid underlying model governing it).

I ran the tool updated by the efforts of @vdbergh and @vondele ; the skewed normal distribution is a nicer model than the very rough linear array interpolation. Though, for some reason, it still shows much less patches losing more than 2 elo than the linear array interpolation - is it because of the bias of the raw SPRT elo results ?

Once again, with a normal-like distribution, we get a smaller number of high-elo gain patches represented and so a higher elo-gain ratio.

I get those results :

| Limits | [0,5] + [0,5] | [0.5, 4.5] + [0,3.5] | [1,4] + [0,3] | [0.5, 4] + [0, 3] | [0.5, 4] + [0, 3.5] | [0.4, 4.4] + [0, 3.2] |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 0 ELO pass prob | 0.0025 | 0.00123 | 0.00036 | 0.0011 | 0.0011 | 0.0014 |
| 1 ELO pass prob | 0.0744 | 0.1002 | 0.0805 | 0.1451 | 0.1183 | 0.1276 |
| +1 ELO rejectance ratio | 0.8201 | 0.7547 | 0.7625 | 0.6718 | 0.7049 | 0.7137 |
| +1.5 ELO rejectance ratio | 0.6310 | 0.5160 | 0.4865 | 0.3968 | 0.4249 | 0.4679 |
| +2 ELO rejectance ratio | 0.3936| 0.2709 | 0.2026 | 0.1700 | 0.1819 | 0.2374 |
| total ELO gain ratio | 1.0 | 1.3192 | 1.2373 | 1.7596 | 1.5664 | 1.5523 |
| -0 ELO acceptance ratio | 2.5 e-04 | 9.1e-05 | 2.2e-0.5 | 1.0 e-04 | 7.77 e-05 | 1.0 e-04 |
| Avg. STC cost | 18431 | 24456 | 34574 | 30936 | 30936 | 25110 |
| Avg. STC + LTC Cost | 27931 | 38039 | 45286 | 50604 | 46093 | 42990聽|

[0.5, 4.5]聽+ [0, 3.5] is clearly superior to [1, 4] + [0,3], consuming significantly less resources while getting more elo (albeit by passing somewhat inferior patches on average).

Meanwhile, [0.5, 4] + [0,3] gets an awesome elo gain ratio with this patch distribution, albeit at the highest overall cost.

This pushed me to try some other variations. [0.5, 4] + [0, 3.5] reduces the overall cost by around 10% but gives a big hit to elo gain ratio compared to [0.5, 4] + [0, 3]. Instead, [0.4, 4.4] + [0, 3.2] strike a better balance. Compared to [0.5, 4.5] + [0, 3.5], it increases costs by +12% while the expected elo gain with this patch distribution is +17% (and usually, squeezing out more elo out of the same patches require a disproportionate increase in resource use).

I still think it's fine to do [0.5, 4.5] + [0, 3.5] now to see how the behavior of people submitting tests adapt to this and collecting new and more relevant data on patch distribution. But if this proves successful, then it will likely be worth it to push further.

Though, for some reason, it still shows much less patches losing more than 2 elo than the linear array interpolation - is it because of the bias of the raw SPRT elo results ?

It is not caused by bias but by the noise of the elo estimator (fixed lengths tests would have the same issue).

Assume every patch would be exactly -1 elo. The elo estimator will return values according to a certain probability distribution whose median is -1. If one pretends this observed distribution is the true underlying elo distribution (which is this example is concentrated at -1) one gets incorrect results.

So the challenge is to invert the transformation underlying distribution -> observed distribution.
This is what I tried to do.

So the challenge is to invert the transformation underlying distribution -> observed distribution.
This is what I tried to do.

Great, I had concerns about this very issue, it's nice that you was able to address it. This means that those latest estimates and the resulting table I produced above are the most reliable we have for now.

@vondele @Alayan-stk-2

In the end it turns out that a simple normal distribution is sufficient to explain the observations!! The very skewed nature of the elo measurements is probably simply due to the behavior of SPRT tests and not to the underlying elo distribution. Here is my proposal for the new prior.

Elo prior

As indicated it has mu=-1.013735, sigma=1.101212. I have done the fit using the cross entropy statistic after first filtering for outliers (although this turns out not to affect the result in the end).

Here are the results of a simulation for the median unbiased elo estimator.

Simulation

Of course a slightly better fit can be obtained by using a skew normal distribution with a non-zero shape parameter, but the optimal value for the shape parameter seems to be rather unstable. I have verified this using resampling https://en.wikipedia.org/wiki/Resampling_(statistics) . In other words fitting a skew normal distribution to the data is not so much fitting but rather _over_ fitting instead. And of course Occam's razor also dictates that one should go for the simplest model that fits the data.

Finally the prediction the prior predicts gives as expectation values 18220, 0.037 for respectively the length of a test and its pass rate. This is very close to the data which has 18270(17884,18657)
and 0.031(0.027,0.036).

EDIT. I succeeded in calculating the theoretical distribution of the median unbiased elo estimator. Here is the comparison of the theoretical distribution (assuming the above prior) with the observed data

Theoretical comparison

I think this is even more convincing than the simulation!

EDIT2. Here is a graph illustrating the filtering effect of STC(0,5) and LTC(0,5). The resulting distributions seem to be very close to normal (which is nice).

Patches submitted to LTC

In principle it is testable if the LTC prior corresponds to reality but the common use of speculative LTC obscures the available data.

Note that the model predicts that the expected elo for a patch which passes STC(0,5)+LTC(0,5) is equal to 1.44. This is quite a bit higher than the figure of 55/80=0.69 mentioned above by @Vizvezdenec which is obtained from SF9->SF10 development. One should note however that many things are unmodeled.

  • The use of speculative LTC.
  • The use of SPRT(0,4).
  • The 8moves book versus the 2moves book.
  • Scaling when going from STC to LTC.
  • The extent to which the additivity of the elo model holds.

Finally the figure 0.69 comes with its own uncertainty which I would not know immediately how to quantify.

(Here I am recording a thought, perhaps for later reference.) With a reasonable elo prior the Bayesian approach for estimating elo actually becomes feasible (with the usual - unrealistic - uniform prior it gives heavily biased results for SPRT tests). The Bayesian approach has the potential to be substantially more accurate than the currently available methods which are purely frequentist and hence do not use historical information. Note that as the elo prior is likely to change over time it would have to be continuously updated based on the last N tests. Another complicating issue is that the elo prior depends of course on the type of test (simplification or elo mining).

Estimating elo using an elo prior can be done on the client side (it is trivial mathematically for a normal prior) but updating the hyper parameters of the elo prior(s) (mu,sigma for the normal model) is something that would have to be done on the fishtest side (one can use a Bayesian method for this as well).

When will the bounds changes take place?

For the record: I am not terribly convinced that the new bounds are better. The elo_gain/STC_game ratio actually seems to go down a bit (see e.g. the data by @Alayan-stk-2 above). Of course if one considers fishtest resources to be infinite then the situation is different.

EDIT. A good argument for the change though (proposed by @vondele I think) is that narrower bounds give more accurate elo estimates. This is of scientific value.

EDIT2. Also fewer very low or negative elo patches will make it through. This should also be a good thing since apparently it can be done without affecting the metric elo_gain/STC_game very much.

EDIT3. For those who are wondering. For the normal prior the current elo_gain/STC_game is (if I did not make a mistake) 5.7e-6. After the change it would be 5.5e-6.

Actually my 2 cents, but improtant two (imho).
Right now we are able to utilize fishtest hardware mostly because I'm writing a crapton of patches.
http://tests.stockfishchess.org/users/monthly
Vizvezdenec 718 - it's more than third of total number of patches on fishtest at this time.
With this amount of written patches I'm basically trying almost random things - mainly because I think that testing something is better than testing nothing.
With tighter bounds I'm perfectly fine to write like 3-4 times less and free up a lot of hardware for other devs - right now this way fishtest will go idle pretty often... Which wouldn't be the case with tighter bounds.
So, imho, tighting bounds will also have side effect of increasing % of elopositive patches :)

The elo_gain/STC_game is in my opinion a poor metric, because the optimal point for this ratio are large bounds which let some easy gains through while quickly rejecting the rest.

For example ; with [0. ; 6.0] + [-0.5 ; 5.5] ; the tool gives me 79% of the elo gain for 68% of the resources usage. The ratio is better, but the bounds are worse, because they reject more good patches and create more uncertainty about the quality of patches.

It's hard to model what are the optimal bounds because it depends a lot on how people are using fishtest.

We have two fundamental "resources" : human finding ideas to test, and CPUs to test them.

When the bottleneck is on CPU resources, we want to increase the quality of patches submitted/reduce random tests/random tunes, and/or do quicker but less accurate tests.

When the bottleneck is on human finding ideas, we want to increase the testing quality, in order to 1)have a better idea of what a patch is worth, and if the idea may warrant further exploration ; 2)accept more "good but not outstanding" patches.

As @Vizvezdenec is pointing out above, what is bottlenecking SF progress is much more finding good ideas than testing resources, which is a situation where SF would benefit from increased testing accuracy.

EDIT : Also, let me remind everyone that when you want to extract more elo from a pool of ideas, each additional elo becomes harder to get. Hence, bounds with a similar/slightly lower elo/cost ratio but with a noticeably higher elo acceptance ratio are much superior.

It seems to me that everybody agreed upon those bounds and is just waiting for the implementation.

I'm fine with the proposed [0.5, 4.5] + [0,3.5] : indeed we can evaluate the change in a few months to estimate if it works in practice.

The most impressive feature in the discussion has been the quality of the argumentation, as usual! :-)

@vondele I'd suggest one change at a time, let's start with this one as a "beta", in some weeks we can eventually promote to "production" or, in the worst case, to revert.

@snicolet how do you suggest to deploy this change? Should we contact fishtest devs and ask them to change the defaults? Maybe even better is to open a PR on fishtest repo.

I think its very simple, just everyone to use the new bounds. What are we waiting for? @snicolet @mcostalba

For example ; with [0. ; 6.0] + [-0.5 ; 5.5] ; the tool gives me 79% of the elo gain for 68% of the resources usage. The ratio is better, but the bounds are worse, because they reject more good patches and create more uncertainty about the quality of patches.

A surprising fact nonetheless! I too get 6.3e-6 elo/STC_game which is much higher than the current 5.7e-6. But the average elo of an accepted patch drops from 1.44 elo to 1.32 elo. Of course softening the current hard barrier at 0 elo is not a good idea.

EDIT. Indeed it seems one can get very high elo/STC_game by using very permissive bounds. This is achieved by the tests finishing very quickly. So it is only a meaningful metric if CPU resources are the bottleneck (which apparently they are not currently).

I am wondering about scaling issues. With the normal prior I was using (which seems to fit the STC data very well) the expected elo of a patch that passes STC is 0.70. I have now downloaded the results of 286 LTC tests from fishtest and it seems the average elo of a patch submitted to LTC is only 0.27. It could however be that this is simply the effect of speculative LTC... The LTC data covers a longer period than the STC data which may also make the results less reliable.

EDIT. I redid the calculation with STC and LTC data covering the same period and the results were the same. I wonder now if it is a modeling issue (this has some relevance for the current discussion). There were some scaling tests in the past

http://tests.stockfishchess.org/tests/view/5910497a0ebc59035df34402
http://tests.stockfishchess.org/tests/view/59100d7c0ebc59035df343ef
http://tests.stockfishchess.org/tests/view/590ffa180ebc59035df343e7

So the scaling STC->LTC appears to be good. Sadly the tests were for the 8moves book. I think I will do a similar test for the 2moves book (only STC and LTC). I hope it is ok.

EDIT2. Removing the 17% of "Speculative LTC" tests (by filtering on the descriptions) raises the average elo of patches submitted to LTC a bit, but not spectacularly. The discrepancy with the model remains. Next thing to try is the nTries model by @Vondele (people submit several trivial variations of the same patch until one passes STC).

EDIT3. There appear to be no scaling issues for logistic elo for the 2moves book.

http://tests.stockfishchess.org/tests/view/5c1aa9610ebc5902ba127aed
http://tests.stockfishchess.org/tests/view/5c1ac93d0ebc5902ba127eba

(I checked that the scaling happens to be perfect for "normalized elo" http://hardy.uhasselt.be/Toga/normalized_elo.pdf but that is a different discussion).

@NKONSTANTAKIS we should open a PR against Fishtest repo.

@mcostalba so new bounds should be [0.5; 4.5] and [0; 3.5]?

I guess so, if someone does not anticipate me, I will open a PR around Christmas time

Please someone anticipate him :D
By the way, what will the bounds for tuning patches become?

Tuning patches have not been discussed here, but it wouldn't make sense to have harder to pass bounds than now or than the new bounds for more complex patches.

[0, 4] is easier to pass than [0.5, 4.5] already, and the somewhat higher probability of a patch being only a very minor gain isn't a concern for value tuning. But LTC should be dropped to [0, 3.5] at least.

Since we will use [0,3.5] LTC for code adders, I think its not logical to use exactly the same for tuning.
[0,3] LTC seems an easy choice to make, safe and conservative.
For STC tuning I would recommend [0.5,4] as a safe and conservative option.
This way we keep an analogy between coder adders and tuning, like in the past:
Old ones same lower bound and -1 upper, new ones same lower bound and -0.5 upper.

We have said to make one step at a time, but let it be a full step not some sloppy change.
Indeed it makes no sense to use at the same time harder bounds for tuning than for adding code, it has to move along, at least at the slightest possible.

@mcostalba @snicolet What do you think?

@NKONSTANTAKIS I think that the discussion on possible new tuning patch bounds, currently SPRT[0, 4] should proceed in the same way of the previous one.

Namely, to post the simulation tables using @vondele / @vdbergh tool and quantitatively comparing current specs against proposed ones for the people to see and evaluate.

I have now checked that the nTries model (nTries is the number of trivial variations of the same patch) by @vondele can resolve the apparent contradiction between the expected and observed elo of patches submitted to LTC which I reported on above.

Using the data of 199 LTC tests (manually filtered for speculative LTC) the mean unbiased estimator (MUE) for elo of SPRT test gives an estimate for the average elo of patches submitted to LTC of 0.31. The variance of the MUE (as measure from the sample) is 1.67 which yields a 95% confidence interval [0.08,0.54] for the expected elo.

With nTries=1 the model (with normal elo prior with mu=-1.148511, sigma=1.161688) predicts the expected elo of patches submitted to LTC to be 0.66 which is outside the above confidence interval (this number does not appear to be very sensitive to the prior). However repeating the calculation with nTries=5 we find the expected elo to be 0.37 which fits comfortably in the confidence interval.

So at this point it seems more reasonable to me to use a model with nTries>1. Sadly there is probably not enough data to determine the "best" value for nTries.

EDIT. My estimate of the confidence interval was incorrect (I had neglected the variance of the underlying elo distribution). However we can simply measure the standard deviation from the sample. I have edited the above text. The confidence interval is now a bit wider but not enough to make nTries=1 work.

EDIT2: Inspecting the descriptions of the LTC tests manually it appears there are many more speculative LTC tests than those that I had filtered out mechanically. The final list contains only 199 regular entries out of the original 282. I also failed to account for the conversion STC elo->LTC elo. I am using the normalized elo model for this (multiplication by sqrt((1-d_LTC)/(1-d_STC)) ) but I do not have a lot of empirical evidence that this model is reliable. Anyway I have updated the text. With all the adjustments the "contradiction" in the case of nTries=1 becomes less and less severe and it could be just a case of bad luck.

EDIT3: Scaling tests (see below) indicate that the normalized elo model is not suitable for predicting scaling STC->LTC. The correct scaling model is anyone's guess but for now just assuming 1:1 does not contradict the data. I have once again updated the above text using this assumption. The contradiction for nTries=1 now becomes more severe again. Note that assuming an average scaling ratio of 1:1 says nothing about possible variation in scaling so the model is incomplete in this respect.

interesting ... yes, I've picked nTries=5 in the first version of that model, since that's roughly what I observe on submitted patches. It certainly is a reasonable guess. I guess there is no single optimal version of the number it correlates probably a bit with the author name ;-)

Here is the result of some scaling tests with the 2moves book. 40000 games each.

| | sf7->sf8 | sf8->s9 | sf9->sf10 |
| - | - | - | - |
| elo STC | 95.91 +-2.3 | 58.28 +-2.3 | 71.03 +-2.4 |
| elo LTC | 100.40 +-2.1 | 68.55 +-2.1 | 65.55 +-2.2 |

So we see that the common wisdom that increased TC causes elo compression (I also believed this) is not always true (look at the columns sf7->sf8 and sf8->sf9). One possible model seems to be to assume that there is a lot of variation in scalability of individual patches, so much that it totally dominates the average behaviour. Unfortunately it is impossible to quantify this as at this point the available data is totally insufficient. Note that @NKONSTANTAKIS points out below that sf8->sf9 may be an outlier since sf9 uses contempt and sf8 does not. I will test this theory by creating a contemptless version of sf9 and redo the test.

The tests are here.

http://tests.stockfishchess.org/tests/view/5c1fc1540ebc5902ba12b7c6
http://tests.stockfishchess.org/tests/view/5c1fc01a0ebc5902ba12b790
http://tests.stockfishchess.org/tests/view/5c1e0f050ebc5902ba129abc
http://tests.stockfishchess.org/tests/view/5c1e0e570ebc5902ba129aaa
http://tests.stockfishchess.org/tests/view/5c1ac93d0ebc5902ba127eba
http://tests.stockfishchess.org/tests/view/5c1aa9610ebc5902ba127aed

Its crazy how the scaling trend is in opposite direction and at such elo amounts. I have always been a believer of scaling abnormalities but I would expect both LTC's to shrink the elo difference. My explanation (and this is justified by early tests) is that contempt scales well, or another way to put it is "contempt is much more riskier at very short TC's". This makes total sense since at low depths you can hardly get a foothold in draw so neglecting depth at drawish evals backfires more. Hence contemptless SF8 closes the gap at STC.

A way to verify this is to do sf8 vs sf10 tests and compare STC to LTC. This test could provide useful information about future actions. A smaller STC gap would indicate that my observation is accurate, while a smaller LTC gap would indicate that SF10 has somewhat worse scaling. If the latter is true, then we should probably consider ways to accept more positive scalers and reject more negative ones.

@NKONSTANTAKIS Another test of your theory would be sf7->sf8. Since these are both contemptless they should again exhibit the expected scaling behaviour.

@vdbergh Indeed, and the expected elo gap there will be smaller at LTC. But it would be contemptless vs contemptless which could behave to LTC in different analogy than contempt vs contempt to LTC, probably with increased drawrate and compression. Would be interesting. Through the SF8->SF10 we would also be comparing the relative scaling behavior of SF10 to SF9 to the same opponent. In general I think its a good idea to try and net in the positive scalers, mostly by not cutting them out at harsh STC.

Regarding the probable backfiring of contempt at low times I opened a thread:
https://groups.google.com/forum/?fromgroups&pli=1#!topic/fishcooking/r_Nv8J34VoA

@NKONSTANTAKIS I replied to it - we have quite a lot of contempt data actually which I measured in fishtest idle times ;)

@vdbergh your elo with TC data has been added to the wiki here:

https://github.com/glinscott/fishtest/wiki/UsefulData

I'm curious about how many games would be needed to conclude this test:
http://tests.stockfishchess.org/tests/view/5c210ab60ebc5902ba12dc20

If the trend is to have single patches providing lesser elo gain on average, I wonder if this would happen quite often at LTC: each of them taking 4x more resources than regular regression tests and still growing.

Looks like we will need to tune the process - my thought is that is to cap it at x games. Not sure what the right number - but somewhere near 150,000. After that , the Elo gain would not be worth the resources. Just my $.02. Others may and will differ.

with old bounds this test will already red, because wide bounds require less games on average

The test is exactly doing what is expected. Somewhat smaller Elo gaining patches (~1Elo) have a chance to pass, and patches that are close to that bound will need significantly more games to be resolved. Overall, if all models discussed in depth above are relevant, we'll need on average 50% more CPU time per submitted patch... which is will presumably just imply people will submit a little less patches. The additional gain we have is a more precise elo estimate of all patches, which could help direct ideas and tests.

The point is that we may have more than enough confidence to conclude that this patch is a > +0.5 elo gainer, yet it might still fail to pass in the end.

The patch is almost certainly an elo-gainer at STC and LTC - the STC passed twice quite easily (once with old bounds, once with new bounds), while the LTC is struggling but clearly positive. But there can be a scaling concern : does the gain hold at longer TC considering the STC>LTC trend ?

@Alayan-stk-2 If scaling is of concern, why are people using a smaller bound for LTC? It implies that now it is even more easier to pass a patch at LTC than before compared to STC changes.

I wonder what can we get out of a same patch running the equivalent of 7 regression tests, maybe a max number of games like 160k should be applied to prevent future unlucky runs like this.

I think we need a couple of months to judge the new scheme.

People have been calling for truncated SPRTs since forever but they are mathematically a bit more complicated (enough tools exist to handle them though, see e.g. http://hardy.uhasselt.be/Toga/wald.py, with a bit of hacking one can get it also out of http://hardy.uhasselt.be/Toga/sprt_elo.py).

When I was working on GnuChess I used the 2-SPRT (see https://projecteuclid.org/euclid.aos/1176343407) which is optimal for elo=(elo0+elo1)/2 (recall that the SPRT is optimal for elo in {elo0,elo1}). Like the truncated SPRT the 2-SPRT has the property that the number of games is bounded.

Here is a tool that can design and analyze 2-SPRTs http://hardy.uhasselt.be/Toga/sprt2.py.

Example

$ python 2sprt.py --help

This script computes the continuation region for the triangular
approximation to a 2-SPRT sequential test. It also computes
the corresponding approximate probability of rejecting H0 (OC)
and the expected running time (ASN).  Finally it can also compute 
the (OC,ASN) exactly (very slow) or by simulation (slow). 

A game is scored as 1,1/2,0 depending on whether it is a win, draw or loss.
If the upper bound of the continuation region is reached, H1 is 
accepted. If the lower bound is reached then H0 is accepted.

  --alpha       probability of making the wrong decision under H0,H1
  --elo0        elo corresponding to H0
  --elo1        elo corresponding to H1
  --draw_elo    draw parameter from BayesElo model
  --expected    compute the approximate expected running time of the test (ASN)
  --power       compute the approximate probability of rejecting H0 (OC)
  --exact       compute the OC and ASN exactly
  --simulate    compute the OC and ASN by simulation
  --sample      sample size for simulation
  --real_elo    elo at which to compute the OC and ASN 
  --help        print this message

$ python 2sprt.py --elo0 0 --elo1 3.5 --draw_elo 282

Test parameters:
alpha=0.05
beta=0.05
elo0=0.00
elo1=3.50
draw_elo=282

Continuation region:
1.00207932*N-136.827688 < S < 1.00069311*N+136.827688

Worst case expected running time: 83049
Maximal running time: 197413

In fact the 2-SPRT _may_ be more suitable for LTC with the new parameters. The model (with nTries=5) predicts that with the new parameters the expected elo of a test submitted to LTC is 0.67 (Bayes Elo= 1.2). If we ignore variation and assume it is exactly 0.67 then the 2-SPRT(0,3.5) is more efficient than the SPRT(0,3.5) (I have not done the calculation taking into account variation). With nTries=1 it would be even better.

I'm not sure this patch is an elo gainer at all tbh.
Sure, it passed STC twice with different parameters but STC with intermediate and bigger parameter (and bigger parameter means just less difference thus elo positivity should be removed with it) failed and pretty fast, also 1 LTC failed pretty fast.
Maybe it actually was a fluke and patch is elo neutral or like 0,2 elo worth.

@Vizvezdenec There is a problem: many patches will prove out to be elo-neutral or slightly positive/negative(near-yellow). This is my understanding of the problem:

1.Move time variability creates random "deep moves" that sometimes prove to be critical(amplified by large base time allocation for time managment) or conversely steal time from "average moves" to lower ELO. i.e. time is extremely unequally distributed creating random wins/losses.
This is the primary random factor.

2.The opening book is fairly unbalanced and not that good for testing engines:
The initial idea of making lots of unbalanced positions to detect flaws in eval isn't working anymore, just measures noise(forced wins/draws) since both engines are smart enough to know what position is a forced win/draw is in most cases, unless the patch is particularly bad(though it would lose in more neutral positions as well).

3.The hash table pressure of having only 4MB for STC also introduces a random factor(there is a PR for fishtest to fix this, though i'd prefer 16MB hash vs 8MB for patches(and especially patches that need depth or involve pruning changes)). STC reputation for unreliability needs to be gone, or people will not consider STC a true filter.

4.The usefullness of most positions:
The actual percent of useful positions that detect subtle flaws in eval/pruning is not that large and the SPRT is dependent on them getting randomly selected from the book to create meaningful results: however, there if there are lots of forced win positions the speed at which SPRT converges to bound is reduced.
Forced draws reduce the convergence slightly, but more importantly they prolong tests(LTC in particular is very drawish).
This synergizes with random move times to makes the weaker engine randomly win a few games and break the SPRT gain streaks.

5.The machine factor: Since fishtest doesn't run on ideal "standard" computers, the engine time will be affected by CPU speed and thread scheduling.

A.Another task/thread may occupy the CPU shared with an engine to have an impact on its performance, giving the impression that a stronger engine is weaker due time lost at scheduling and other threads there will be a random factor of reduced performance. cutechess-cli timings and resource use for managing games may also have an underrated factor which changes actual speed.

B.The core CPU speed creates a limit to search depth, which makes some machines search deeper than others and consequently provide more accurate eval. The patches which win at speed with slower CPUs start losing at depth once on faster CPU because both engines reach similar higher depths(e.g. mid-20). This means "speed gain"/"simplification" patches will look better on slower CPUs but revert to neutral score once run on faster CPU(equivalent to extending time control),which is also why no LTC tests of speed patches are run.

i.e. Fishtest actually doesn't run on "time controls", it runs on different nodes/second speeds dependent on CPU speed: even consideting the bulk of fishtest is near 1.5mnps ChessDB cpus they too demonstrate speed differences leading to different search depth/"time controls".
This is particularly noticeable at LTC, since any speed variability drastically changes the allocated nodes per move.
Example:
Machine A searches 1400knodes/second
Machine B searches 1600knodes/second
For each 100ms(STC increment) move the B machine searches 20knodes deeper.
For each 600ms(LTC increment) move the B machine searches 120knodes deeper.
For a 1200ms move the B machine will search 240knodes deeper.

  1. Depth factor:
    In effect this means game results are coming in from different depths.Thats wouldn't be that bad if depths were equally allocated, but depths are dependent on which machines are allocated for the test how the test changes pruning/eval and which book positions are randomly selected(due some positions having critical depths at which the PV changes: faster CPUs and good time management uncover these positions).

Same book position searched slightly deeper(faster CPU, more time allocated due different eval) suddenly change results for nearly identical engines: This is not due ELO gain, its purely speed difference and timing allowing the luckier engine to win.
Edit:disregard the "depth factor" rant, i've checked the fishtest code it fixes it by changing time control to account for NPS of the machine.

Near neutral patches slightly changing eval/search will make time allocation for identical book positions different because the PV is constructed from different searches/depths(especially at STC, where pruning is critical and speed at which the PV depth is reached is very important), this turns the test into "book tuning" competition, where patches rely on lucky machine allocation and book positions to pass STC.

  1. Neutral patch problem:
    A neutral patch will waste resources because the above random factors will create illusions of meaninful ELO changes but in effect these are random factors at work.
    These illusory "timing wins" being dominant in neutral patches force SPRT to fluctuate and not converge, wasting resources and prolonging tests(especially noticeable with new bounds).

@Chess13234

There is a problem: many patches will prove out to be elo-neutral or slightly positive/negative(near-yellow).

You could have stopped right there. Through the use of statistics (see above) we now know objectively and _independently of the testing procedure_ that the patches that gain enough elo to be detectable form a very small fraction of all patches submitted.

So yes the 'problem' you mention is there. And it is only solvable by throwing large amounts of resources at it. Nothing else will help.

@vdbergh Something should be done with increasing the data quality for SPRT and not brute-force solutions like adding more time/hardware.
Statistical probabilities are not objective facts and only represent a chance, and of course are dependent on testing procedures, bounds and testing conditions: you will have wildly different results with other time controls and especially other books(fishtest is essentially tuning Stockfish towards performance in 2moves book).
Ultimately of course there are factors beyond our control such as caches misses and branch prediction misfiring influencing timings, but there are factors that can be improved.

@Chess13234

Statistical probabilities are not objective facts and only represent a chance, and of course are dependent on testing procedures, bounds and testing conditions: you will have wildly different results with other time controls and especially other books(fishtest is essentially tuning Stockfish towards performance in 2moves book).

SPRT tests give indeed a very distorted view of the world and the distortion depends strongly on the chosen bounds. But the good news is that this distortion can be understood mathematically. So if you have enough data you can remove the distortion and look at the underlying elo distribution (for a fixed book and time control). This is precisely what was done. So yes, the conclusion of the statistical analysis is independent of the testing procedures that have been employed on fishtest.

As to scaling to different time controls. So far there is nothing to assume that SF on average scales badly. Indeed the tests are compatible with the assumption that the scaling ratio is simply 1 from STC to LTC which is still a 6 fold increase in TC.

As to books. If you want to claim that the 2moves book is bad in some way then you should give arguments (not hand waving) to back that up. The 2moves book was tested by various people and it turned out to be just as good or better than other books (the suitability of books for engine testing can be measured objectively, see here http://hardy.uhasselt.be/Toga/normalized_elo.pdf).

Actually, the better question is how much accuracy do you really need with LTC for those sightly <1 elo patches, SPRT or not.
So far, I see the demonstration of those worst case scenarios on a daily basis, let's see if it continues.

@vdbergh

  1. SPRT doesn't magically "remove distortion", it reduces distortion as long as the data is somewhat reliable.
    If the distortion is significant enough, which it is in neutral patches, it will measure the random win/loss probabilities.
  1. scaling to different time controls: this is easy to observe as most STC passing patches don't scale to LTC. This means all these engine variations behave differently at different TC. I.e. any minor variation of Stockfish is highly likely to behave differently in STC vs LTC.

3.". If you want to claim that the 2moves book is bad in some way then you should give arguments"
see
https://groups.google.com/forum/#!topic/fishcooking/oljs6EvQ6Iw
https://github.com/official-stockfish/Stockfish/issues/1853

As a part of improvement, what are the chances of those passing tests having also [-3, 1] passing to remove them?

@vdbergh

  1. SPRT doesn't magically "remove distortion", it reduces distortion as long as the data is somewhat reliable.
    If the distortion is significant enough, which it is in neutral patches, it will measure the random win/loss probabilities.

Obviously you are making no attempt whatsoever to understand what I write.

  1. scaling to different time controls: this is easy to observe as most STC passing patches don't scale to LTC. This means all these engine variations behave differently at different TC. I.e. any minor variation of Stockfish is highly likely to behave differently in STC vs LTC.

Actually this is very difficult to observe for individual patches (measuring a 1 elo difference takes 160000 games, for multiple measurements one needs many more games). But what one can at least do is to test the average behavior (see above). And these tests shows that at least on average there are no scaling problems.

3.". If you want to claim that the 2moves book is bad in some way then you should give arguments"
see
https://groups.google.com/forum/#!topic/fishcooking/oljs6EvQ6Iw

1853

AFAICS these threads don't mention any tests that show that the resolution of the "cleaned book" is actually better than the original one.

@vdbergh
1.I mean that probabilities derived from SPRT are based on purely random factors in neutral patches.
It doesn't make sense to ignore these random factors(mentioned earlier) of distortion, because they make SPRT meaningless and random game of chance, with only significant elo gain/loss patches being quickly resolved.
I'm not critiquing SPRT itself, but the testing procedure.

  1. " And these tests shows that at least on average there are no scaling problems."
    Greens are dominated by STC tests which don't scale.
    http://tests.stockfishchess.org/tests?success_only=1
    3.Its better to check some other books, like drawkiller one.
    I don't think a new 2moves version with <0.1% of positions changed would have much effect(unless testing period is huge to cover these <0.1% positions enough).
  1. " And these tests shows that at least on average there are no scaling problems."
    Greens are dominated by STC tests which don't scale.
    http://tests.stockfishchess.org/tests?success_only=1

Even if the random probability of a neutral patch passing is low, when you have a lot of tests, you'll get a big part of passing patches which are just flukes. This isn't unexpected by itself, though ofc the more noise the worse it is.

3.". If you want to claim that the 2moves book is bad in some way then you should give arguments"
see
https://groups.google.com/forum/#!topic/fishcooking/oljs6EvQ6Iw

1853

AFAICS these threads don't mention any tests that show that the resolution of the "cleaned book" is actually better than the original one.

The issue 1853 mention no such test (and considering it wants to remove 37 positions out of many thousands, it would be impossible to measure), but e.g. a +1.5 starting position is definitely creating signal noise.

Also, I know that the openings used in fishtest are selected randomly, I assume they are always played with reversed colors ?

@Alayan-stk-2 If they wouldn't be repeated with opposite colors thats would be very unfair, even if perfectly neutral positions are chosen. Here is the relevant code(see -repeat):
https://github.com/glinscott/fishtest/blob/master/worker/games.py#L474

Edit: I noticed that far more disturbing that fishtest uses draw adjudication at 20 centipawns from zero at 8 consecutive moves!!! How is this even allowed?
Edit: this prrameter problem might be more important than a book. It also resigns at -400 centipawns in three consecutive moves... fishtest is deeply flawed.
Added a separate issue for adjudication here;

https://github.com/official-stockfish/Stockfish/issues/1904

@vondele

@vdbergh your elo with TC data has been added to the wiki here:

https://github.com/glinscott/fishtest/wiki/UsefulData

Thx! It occurred to me that while these tests are _not incompatible_ with the hypothesis of a scaling ratio of 1:1, they are also _not a lot of evidence_ for it because of selection bias.

If we accept the hypothesis that there are well scaling and badly scaling patches then the well scaling patches are more likely to pass LTC after passing STC, causing the LTC rating to be inflated. For SF this is a good thing of course but it messes up the stats.

The only way I see to measure scaling somewhat objectively is to create a sf_STC which is some version of sf with a number of patches applied which passed STC regardless of whether they passed LTC or not. However to get a statistically significant result in reasonable time the number of patches must be quite large which is difficult because of conflicts. And this would only tell us something about the average behavior. The variation is still another matter.

@vdbergh That actually sounds cool, something like "Blitzfish" tuned for short time control, but the patches should have at least +1elo impact(preferably with largest elo gains) to actually measure something(STC elo is very unreliable).

The recent experience has not been very positive. We got the multi-cut patch passing properly, so elo gainer can absolutely pass, but the "small gain patches" (at +1 or +1.5 elo) have had a very hard time to clear STC even with flukes.

This experience at least confirms that SF's progress comes from small gainer, and that the idea that only +2 elo code adder are worth the complexity cost (which was behind the old bounds) is outdated in SF's current state. Going back to the old bounds would not make sense at all considering it.

But - and while the experiment has only lasted a few weeks so too short yet to reach definitive conclusions - it appears we can do better.

There is an unfortunate side effect of the new bounds which we overlooked during the discussions. With the old bounds, a patch gaining at STC/LTC 1/1.5 had the same chance to pass than a patch gaining 1.5/1+1 elo.

I just used the simulation tool to evaluate this situation. Now, the 1.5/1 case (bad scaler) has 19.4% chance of passing, compared to 13.87% for the 1/1.5 case (good scaler).

Of course, for individual patches the accuracy of SPRT is way too low to be able to properly know if a patch is a good scaler or not. However, in the long run over hundreds of patches, the way this will shape the prior distribution into the posterior distribution is quite clear, and it's in a way which harms SF's progress and undermines the expected benefits of the new bounds.

To save on resources, increasing passing rates of patches around +1.5 elo, keep bad patches in check and favor good scaling, maybe a 3-staged testing could be designed. First, a patch would have to clear a VSTC designed as a cheap filter, then an easier STC than now (which would act as a second filter, of higher quality than VSTC but costing more), then a LTC.

Below is a table with the results of a simulation using such a 3-staged approach, with a 4+0.04 VSTC, compared to the old bounds, the new bounds, and 2 other proposals. The "STC" and "STC+LTC" costs include the VSTC costs.

| Limits | [0,5] + [0,5] | [0.5, 4.5] + [0,3.5] | [0.5, 4] + [0, 3.5] | [0.4, 4.4] + [0, 3.2] | [-1.0, 3.5] + [-0.5, 3.5] + [0, 3.5] |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 0 ELO pass prob | 0.0025 | 0.00123 | 0.0011 | 0.0014 | 0.00081 |
| 1 ELO pass prob | 0.0744 | 0.1002 | 0.1183 | 0.1276 | 0.1443 |
| 1.5 ELO pass prob | 0.2460 | 0.3422 | 0.4185 | 0.3946 | 0.4667 |
| 2 ELO pass prob | 0.5135 | 0.6438 | 0.7442 | 0.6852 | 0.7538 |
| total ELO gain ratio | 1.0 | 1.3192 | 1.5664 | 1.5523 | 1.7646 |
| -0 ELO acceptance ratio | 2.5 e-04 | 9.1e-05 | 7.77 e-05 | 1.0 e-04 | 4.7 e-05 |
| Avg. STC cost | 18431 | 24456 | 30936 | 25110 | 31590 (?) |
| Avg. STC + LTC Cost | 27931 | 38039 | 46093 | 42990 | 33418 |

According to the simulation, this combines best elo/resource ratio ; best non-regression probability, best total expected elo gain, best pass probability in the [1, 2] elo range.

However, this setup does not completely address the scaling behavior, because it gives an additional weight to a very short TC. Let's take a 1.7/1.5/1.0 (vstc/stc/ltc) patch and a 0.8/1.0/1.5 one (so about the same as my previous example but with vstc added). The first one (1.7 vstc) gets 23,6% chance of passing while the second one (0.8 vstc) gets 13,5%. Of course, at equal vstc+stc+ltc elo sums, this setup favors better scaling behavior.

I used a modified version of vondele's tool to get the results. Below is the code I changed to compute pass probability and cost for the 3-staged system. It isn't very "clean", but it does the job.

vstc_lower = -1.0
vstc_upper = 3.5
stc_lower = -0.5
stc_upper = 3.5
ltc_lower =  0.
ltc_upper = 3.5

def vstc_prop(x):
    return sprt_pass_prob(x, 196.0, vstc_lower, vstc_upper, 0.05)

def cost_vstc_prop(x):
    return 0.4 * sprt_cost(x, 196.0, vstc_lower, vstc_upper, 0.05)

def stc_prop(x):
    return sprt_pass_prob(x, 226.0, stc_lower, stc_upper, 0.05)

def cost_stc_prop(x):
    return sprt_cost(x, 226.0, stc_lower, stc_upper, 0.05)

def ltc_prop(x):
    return sprt_pass_prob(x, 288.0, ltc_lower, ltc_upper, 0.05)

def cost_ltc_prop(x):
    return 6 * sprt_cost(x, 288.0, ltc_lower, ltc_upper, 0.05)

def combined_stc_ltc_prop(x, ntries):
    vstcPass = vstc_prop(x)
    stcPass = stc_prop(x)
    # for i in range(ntries):
    #    stcPass = stcPass + stc_prop(x) * (1 - stc_prop(x))**i
    return ltc_prop(x) * stcPass * vstcPass

def cost_combined_stc_ltc_prop(x, ntries):
    vstcPass = vstc_prop(x)
    stcPass = stc_prop(x) * vstc_prop(x)
    #for i in range(ntries):
    #    stcPass = stcPass + stc_prop(x) * (1 - stc_prop(x))**i
    cost_vstcPass = cost_vstc_prop(x)
    cost_stcPass = cost_stc_prop(x) * vstcPass + cost_vstcPass
    #for i in range(ntries):
    #    cost_stcPass = cost_stcPass + cost_stc_prop(x) * (1 - stc_prop(x))**i
    return cost_ltc_prop(x) * stcPass + cost_stcPass

I agree that small gainers seem to be rejected at STC now (as before) where we would like to see more of the 1.0 - 1.5 Elo patches getting passed.
Also agree that the current bounds appear to discourage patches that scale well. For example, I'm interested in Stephane's current "windmill" tests because I would have thought that was more of a search thing rather than eval ... and the current LTC does seem to be performing worse than the STC but could still pass. Now obviously Stephane is probably right and I am probably wrong regarding these particular tests (and the STC pass was strong), but it does seem to be an illustration of the potential issue.

Not sure what the solution is, but it seems to me that we're maybe over-complicating it. Perhaps a simple change from [0,5] to [0,4] for both STC and LTC is the way forward?
If we do use different bounds for STC and LTC, I would be more comfortable if the average (or lower?) bound was higher for LTC than STC because of the possible impact on scaling of the average merged patch.

Here is a new table with [0, 4] [0, 4] for reference.

| Limits | [0,5] + [0,5] | [0,4] + [0,4] | [0.5, 4.5] + [0,3.5] | [0.4, 4.4] + [0, 3.2] | [-1.0, 3.5] + [-0.5, 3.5] + [0, 3.5] |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 0 ELO pass prob | 0.0025 | 0.0025 |聽0.00123 | 0.0014 | 0.00081 |
| 1 ELO pass prob | 0.0744 | 0.1433 |聽 0.1002 | 0.1276 | 0.1443 |
| 1.5 ELO pass prob | 0.2460 | 0.4446 | 0.3422 | 0.3946 | 0.4667 |
| 2 ELO pass prob | 0.5135 | 0.7477 | 0.6438 | 0.6852 | 0.7538 |
| total ELO gain ratio | 1.0 | 1.7497 |聽1.3192 | 1.5523 | 1.7646 |
| -0 ELO acceptance ratio | 2.5 e-04 | 2.0 e-04 | 9.1e-05 | 1.0 e-04 | 4.7 e-05 |
| Avg. STC cost | 18431 | 27886聽|聽24456 | 25110 | 31590 (?) |
| Avg. STC + LTC Cost | 27931 | 46381 |聽38039 | 42990 | 33418 |

[0, 4] + [0, 4] performs actually quite well to get more patches passing, but at the cost of a big resource usage hit that we're trying to limit and without increasing confidence that the patch is better than neutral.

STC tests are meant to be fast and not accurate. We used to have spare time to run speculative LTCs if some potential good ones did not pass STC, and several patches had passed this way.
Now we no longer have extra resources to do that, there is one test waiting for 3 days and not yet getting many games done. In return, we now get a long list of 20k failing STCs just waiting for more games to conclude their SPRT. I don't think there are good chances that the tides can turn for them, even if they eventually turned into some 90k yellows, they'd still need to fight the chance for a speculative LTC run.

STC tests are meant to be fast and not accurate.

This is definitely true.

However if they are too inaccurate either we miss on too many good patches, or we fill up LTC with poor patches to test, so finding the best balance is not trivial. That's where the VSTC idea above come from, a really cheap filter with really poor accuracy, but good enough to reject a big proportion of useless patches, and allow a somewhat costlier STC filter before investing on LTC.

So to avoid the extra resource hit how about [0,4.5] for both STC and LTC?

one way to save resources is to submit less variations.. a quick grep/sort (apologize for mistakes) in the last 100 games shows:

 11 kingweak
  9 windmil
  9 ShDepthP
  8 rebaseBadBish
  8 outpost
  6 mcT
  5 PV_PosA

note that the current stc bounds [0.5, 4.5] are equivalent to the old [0, 5], but with higher confidence levels. We have not really changed the SPRT bounds on STC, just made it return less random answers. LTC is what has changed, making it significantly easier to pass.

note that the current stc bounds [0.5, 4.5] are equivalent to the old [0, 5], but with higher confidence levels. We have not really changed the SPRT bounds on STC, just made it return less random answers.

The higher confidence levels is a meaningful change. It appears that most of elo-gaining patches are below the 1.75 threshold which gives 0 for both old and new STC bounds. Hence, higher confidence means that the flukes which powered SF's progress have a harder time to pass STC (though they have an easier one at LTC).

Going down the rabbit hole, I decided to do some simulations to see if 4-staged patch validation has potential.

Obviously, such a setup would require an update to fishtest to allow the automatic scheduling of the higher tier test in case the previous tier test pass, to avoid the hassle of having to submit several times the tests (especially as the VSTC would pass quite often).

The TC for the simuls are 4+0.04, 10+0.10, 25+0.25, 60+0.60, so roughly 2.5x the TC between each stage.

The fact that a patch has to clear several stages allows to use reduced accuracy in individual stages, keeping resource usage in check while the pass probability of patches between 1 and 1.5 elo increases a lot. This yields the best elo/resource ratio I've seen so far using the simulation tool without sacrificing non-regression probability.

| Limits | [0,5] + [0,5] | [0,4] + [0,4] | [0.5, 4.5] + [0,3.5] | [0.4, 4.4] + [0, 3.2] | [-1.0, 3.5] + [-0.5, 3.5] + [0, 3.5] | [-1.5, 3.0] + [-1.1, 3.2] + [-0.9, 3.5] + [-0.9, 3.7] |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 0 ELO pass prob | 0.0025 | 0.0025 | 0.00123 | 0.0014 | 0.00081 | 0.00112 |
| 1 ELO pass prob | 0.0744 | 0.1433 | 0.1002 | 0.1276 | 0.1443 | 0.1945 |
| 1.5 ELO pass prob | 0.2460 | 0.4446 | 0.3422 | 0.3946 | 0.4667 | 0.5291 |
| 2 ELO pass prob | 0.5135 | 0.7477 | 0.6438 | 0.6852 | 0.7538 | 0.7919 |
| total ELO gain ratio | 1.0 | 1.7497 | 1.3192 | 1.5523 | 1.7646 | 2.1326 |
| -0 ELO acceptance ratio | 2.5 e-04 | 2.0 e-04 | 9.1e-05 | 1.0 e-04 | 4.7 e-05 | 5.8 e-05 |
| Avg. STC cost | 18431 | 27886 | 24456 | 25110 | 31590 (?) | VSTC : 11991 ; STC : 7820 ; MTC : 8096 ; LTC : 10114 |
| Avg. total Cost (in STC games) | 27931 | 46381 | 38039 | 42990 | 33418 | 38021 |

When it comes to measuring scaling, I'll use again my 1STC/1.5LTC and 1.5STC/1LTC example. Assuming linear scaling with TC, the first one would have 0.75/1/1.25/1.5 while the second would have 1.75/1.5/1.25/1. I'll also use a good scaler starting at 1.0 VSTC and ending at 1.75 LTC.

Good scaler (0.75 VSTC) : 25.4%
Good scaler (1 VSTC) : 41%
Bad scaler : 38.8%

Hence, the setup slightly favor good scaling but compared to now has a bias towards what work at a shorter TC due to the use of VSTC.

Considering that 4-stage do better than 3-stage in the simulation tool, I will also test 5-stage.

As I mentioned above, I did a 5-stage simul. Be aware that besides the uncertainty for the total results due to patch elo distribution, the elo compression with higher TC is not exact either. But this doesn't change the big picture.

The TC I used are 5+0.05 ; 10+0.1 ; 20+0.2 ; 40+0.4 ; 80+0.8.

It appears that this doesn't work well : testing various possibilities, I either get a too high resource cost, a too high regression probability, or an insufficient pass probability. Increasing the pass probability in that setup means increasing the number of too costly high TC tests, and if wanting to favor good scaling the first two stages can't filter effectively enough to control resource usage.

I don't know guys but my number of tests dropped by 150 in some weeks and queue is full. I'm pretty happy since it was pretty exhausting, now I have more time to think about what I write than just writing something to fill up the queue.

Having a full queue is not necessarily a bad thing, but using too much resources on the STC filter definitely is. Right now we have increased the consumption on both STC and LTC, very few patches pass STC, but STC tests in general eat up the majority of our resources. STC's are supposed to help us save resources, and currently they are not. At the same time its much better for an STC to fluke green and "waste" an LTC test than to fluke red and deprive ourselves from a good patch. So in my opinion for STC we should be looking for bounds with similar resource usage like our old (0,5), if not smaller. Hence what comes to mind is the likes of (-1,4), (-2,5), (-1,5) and such.

Why is having a full queue of crap better than letting it go idle?
One of the benefits of having spare resources is you can test something that you wouldn't try when there are something more useful to test with. If people are spending more time to prove crap is crap more confidently then this is going to the wrong direction.

For a multi step strategy (pre-filter + filter) it would be in our best interest to gradually increase confidence + narrow selection of the lower end. For example a nice 3 step plan would be:

  1. Prefilter, (-3 , 3) at 5"+ 0.05" // Cheaply cutting off about half of tested patches.
  2. Filter, (-2 , 3.5) at 20" + 0.2" // Gradual rise of requirements + quality.
  3. Selection, (-1 , 4) at 80" + 0.8" // High quality selection.

This scheme is also very healthy for scaling. Starting with a fast precondition of a requirement to just be more likely positive on low TC and gradually increasing requirement, it gives opportunity for the good scalers to show their worth. The filters are doing their job of saving resources but with minimal oppression. In the end the selection is of very high 80" quality with middle ground confidence.

In general this scheme seems to me both cheaper and better than the current one. I would welcome some validations with the tools. @vondele @Alayan-stk-2

For the current 2-phase scheme, I regard the new LTC bounds very appropriate, but the STC ones unnecessarily expensive, too thick filtering and hostile to scaling.

We are currently spending so much in order to have a very accurate estimation on very low quality STC information, why?

The STC would be better in the opposite direction, faster and less strict, to save us resources in general but to block as few potentially good stuff as possible.

But, to give credit where its worth, for neutral scaling stuff the derived confidence from the STCs is of equal worth and obtained cheaply. So the sheer amount of increased confidence gives a solid flukeless progress in this patch category.

The benefited target group are patches with positive elo + neutral/mildly negative scaling. The (0.5 , 4.5) is more demanding than the (0, 3.5) LTC, but the LTC exists, safeguarding us from negative scalers

It feels like the new STC is not at all a filter, but the basis of our selection process, being harder to pass than the LTC. In this viewpoint the LTC acts as a precaution measure, maybe overly expensive. So the change has been from: STC filter + LTC selection
to: STC selection + LTC verification

I am not aware of the distribution regarding neutral/negative/positive scaling, but we have observed far too many STC greens + LTC reds with the old bounds. Naturally, part of this inconsistency is due to statistical flukes and part is due to bad scaling. Our methodology was and is hostile towards good scalers. I believe that this category, even if uncommon, is the most important one. But how to spot it? Speculative LTC's has been a mild workaround STC suppression, the equivalent of lowering STC requirements. Now the combination of not using speculative LTCs + high STC confidence will completely block the slow-starting high-scaling category. This is very concerning. I can think of 2 solutions:

1. To lower the STC elo requirements. This will allow more patches to be tested at LTC, but it will also lower the elo value of STC-derived confidence.

2. To use a VSTC scaling check after mediocre STC results. Instead of costly speculative LTCs, a VSTC check can cheaply spot positive scaling. The more negative the result, the better for scaling. New STC offers a very solid base regarding confidence for comparison with the VSTC. Here not only negative results will be promising but also positive results would be alarming. Maybe its a good idea to check how often STC greens which fail LTC perform too good at VSTC. If this is the case (and I think its illogical for the scaling curve to behave chaotically) we can save ourselves from too expensive LTCs by locating scaling through cheap 2" VSTCs.

@noobpwnftw I agree that there is no big problem having the queue go idle, and there is no need to fill it with crap.. I think it is good (and fun!) to take some time to think about a patch.

However, the increase in time is relatively small for patches that are crap. Patches < 0 fail quickly also in the new scheme. The patches that now run (much) longer are those in the middle of the interval (i.e. near 2.5 BayesElo at STC). The hope is that the more careful testing will increase the certainty with which patches > 2.5 Elo pass.

What I think we need to do now is to look into how fishtest assigns its priorities. The current situation in which patches are in a pending state for days just because they need >80k games (which is now more common, and actually means they are likely good) is annoying. Fishtest should be more like a FIFO system.... possibly with a priority related to the number of patches by the same author that are running simultaneously.

Let me post a link to fishcooking where this discussion on priorities is taking place:
https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/BxWmrJ9yK64

From this, it looks like just increasing the throughput would weight the starting time more, and avoid the issues mentioned above. Still needs a closer look.

@vondele I agree on limiting resubmitting the (almost) same patch again and again.

I see many people is going wild with repeated attempts on the same idea. Limiting this misbehavior it should be the first target to both decrease the average waiting time and to avoid the people to abuse the donated and free CPU resources.

Sorry to be rude, I intend to be provocative here, but I don't care about complex schemes to handle the quantity of tests, I care about avoiding people to create useless tests in first place, the latter point is more valuable to me because it goes beyond a pure technical merit, and addresses the fairness and conscious usage of free donated resources. Many people seem to not care about this and this is bad.

There is a rationale for testing many times the same idea because if the idea is promising but just not enough to pass, the quest is to find the perfect scheme that will make it.

Inspiration is limited and if we are to abandon ideas only after a couple of tests, we will never know if there was elo gain in some form of the idea or not. Imo its much better to be thorough and sure.

Everybody is sharing the same goal, to make SF better, a lot of people are working 24/7 for this, and SF is indeed progressing in one way or the other. Not only the resources are free but people are also working for free, lets not forget that.

Having said that, there are of course cases where the resources are put into bad use. Those imo are not used maliciously but unconsciously, ie the intention is good but there is lack of judgment so for those its good for everyone to be addressed specifically and to the point, exactly when they take place. A generic call for all to behave is usually interpreted subjectively, everyone may think that he is using resources well :)

The problem isn't that people make lots of test of the same idea, the real problem is running them all at the same time.
There could be a mechanism to reduce priority of tests of a single user dependent on their number, so resources are shared equally.
I.e. current scheme:
user A runs 10 tests
user B runs 1 tests
90% of resources will go to A.
Instead of allocation resource by tests, it should allocate resource by users so A and B got both 50% resources, and
in case of N users they all get 1/N resources.

To not discourage patch writing, I think we can emphasis on proper priority when submitting tests: if they are just wild ideas, put them on -1, if you are trying a few variants, run one at a time. For those long promising tests, we can manually put them on high priority and finish them first.

Since we have more confident STC tests now, I would recommend all LTC tests get high priority by default.

Yes. I generally limit myself to 2 tests at a time and put later ones at
prio -1 so that they can get approved and then i can release them later.

-1 priority tests may take a lot of time to be tested, because as long as there is at least one 0 priority test in the queue, it won't be run.

A potential solution is indeed to run variants one after another to get insight from the first attempts. Otherwise, setting low throughput should work too.

I agree LTC tests should be run with a higher priority to avoid them being stalled for days (like for the voting2 patch).

I will rerun some more 3-4 stage simuls with more parameter tweaks attempts. The formula for the elo distribution of submitted patches is not perfect (and doesn't attempt to simulating scaling because we don't have data on this), but rather good, enough for the results to be relevant.

Here are a few simulations results.

The table contains :

  • The old and current bounds for comparison
  • [-0.5, 4.5]聽+ [0, 3.5] as requested by James_TCEC. As can be seen, this improves passing rates noticeably, but increase regression/useless patch probability and make LTC costs explode
  • My first 3-staged and 4-staged parameters (VSTC : 4+0.04 ; STC : 10+0.10 ; MTC (4-staged only) : 25+0.25 ; LTC : 60+0.6)
  • Two new 3-staged parameters tests where I sacrificed the non-regression probability in favor of reducing resource cost and increasing expected elo gain.

No 4-staged parameters new test yet as the table wouldn't fit here.

| Limits | [0,5] + [0,5] | [-0.5,4.5] + [0,3.5] (new) | [0.5, 4.5] + [0,3.5] | [-1.0, 3.5] + [-0.5, 3.5] + [0, 3.5] | [-1.3, 3.0] + [-0.9, 3.25] + [-0.5, 3.5] (new) | [-1.5, 3.2] + [-1.0, 3.5] + [-0.7, 3.8] (new) | [-1.5, 3.0] + [-1.1, 3.2] + [-0.9, 3.5] + [-0.9, 3.7] |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 0 ELO pass prob | 0.0025 | 0.00433 | 0.00123 | 0.00081 | 0.00374 | 0.00486 | 0.00112 |
| 0.5 ELO pass prob | ? | 0.03773 | ? | 0.01593 | 0.04674 | 0.04663 | ? |
| 1 ELO pass prob | 0.0744 | 0.1928 | 0.1002 | 0.1443 | 0.2551 | 0.2258 | 0.1945 |
| 1.5 ELO pass prob | 0.2460 | 0.4831 | 0.3422 | 0.4667 | 0.5928 | 0.5323 |聽0.5291 |
| 2 ELO pass prob | 0.5135 | 0.7320 | 0.6438 | 0.7538 | 0.8284 | 0.7790 |聽 0.7919 |
| total ELO gain ratio | 1.0 | 2.0706 | 1.3192 | 1.7646 | 2.5934 | 2.3489 | 2.1326 |
| -0 ELO acceptance ratio | 2.5 e-04 | 3.56 e-03 | 9.1e-05 | 4.7 e-05 | 2.4 e-04 | 3.5 e-04 |聽5.8 e-05 |
| Avg. STC cost | 18431 | 20586 | 24456 | VSTC : 10887 ; STC : 5803 ; LTC 16727 | VSTC : 12649 ; STC : 7537 ; LTC 21119 | VSTC : 11001 ; STC : 6707 ; LTC 16449 | VSTC : 11991 ; STC : 7820 ; MTC : 8096 ; LTC : 10114 |
| Avg. total Cost (in STC games) | 27931 | 50640 | 38039 | 33418 | 41306 |聽34157 | 38021 |

I have not done computations for the scaling behavior of the newly proposed bounds. I expect the "favors better scalers but has a bias towards shorter TC" to stay true. For 3-staged testing, 15+0.15 STC would only moderately increase the overall cost (it's the cheapest phase) and would improve scaling, but maybe not enough to justify the cost.

For scaling confirmation, it could be advantageous rather than running a SPRT at a very long TC (which can be very expensive) to run a fixed number of games and only reject patches going below a certain threshold. This step would be a safety net and would only reject a small number of patches/simplifications.

I think we have all experienced a painful resource bottleneck of the workflow: everything takes too long, while the creative human availability is in abundance.
It seems that @snicolet and @rocky640 where correctly concerned for this.

On the other hand the elo progress looks decent, at the same levels with before.

I think that the new LTC is nice, but the STC should resolve faster so that the developers are not limited.
Such high STC confidence imo is not worth it, because there is the higher quality LTC test as safeguard.
STC flukes can be a window for good scalers. An occasional double STC+LTC fluke will not be a disaster: 0.5 elo gainers are not generally wanted but are also not too bad of a deal, they will eventually be simplified away. Also with @protonspring that active on simplifications makes adding code for lower elo less dangerous, maybe even beneficial, while with a very high passed simplification to elo gainer ratio slow regression is a bigger concern.

With the old bounds framework was often idle and poor @Vizvezdenec had to throw whatever he could think off, but new LTC already uses much more resources than old one and its rightfully with lower requirements because with the old bounds the LTC greens to STC greens ratio was incredibly low.

Contemplating all this, and regarding that enough time has passed for evaluating the new scheme, the propositions are:

1. Rollback of STC to (0,5) + Keeping LTC to (0,3.5) and/or

2. Research for faster pacing STC, ideally with less suppression of low starting good scalers.

@vondele Could you add to your tool something to simulate patches not having the same elo at STC and LTC ?

Because we only submit to LTC patches having passed STC, we have very limited data on the actual distribution, but even with this caveat some parametrized formula allowing to test a few hypothesis would help in designing bounds better suited for good scalers.

Also maybe could use an update considering @vdbergh findings on correlation in game pairs which trinomial frequencies miss ? I'd expect e.g. that the confidence level for STC being in reality higher than predicted by the tool is reducing a lot the rate of "flukes" for patches between 1 and 1.5 elo and so the expected elo gain. This invalidates pretty much all the previous tables...

Ideally too, native 3-staged testing support (with 1 stage being easy to make inactive with guaranteed pass to support 2-staged) so I don't have to paste manually my code tweaks. :slightly_smiling_face:

@Alayan-stk-2 right now I want to focus on other stuff (cluster), but feel free to fork that branch. This should be relatively easy to add.

However, TC scaling of patches is not quite clear.. there is some data here:
https://github.com/glinscott/fishtest/wiki/UsefulData#elo-change-with-respect-to-tc
showing it can go either way.

And unfortunately these tests do not necessarily show that the average scaling ratio may be assumed to be 1:1 because of selection bias (well scaling patches are more likely to make it into master). So about scaling absolutely nothing concrete is known. It would require a dedicated effort consuming a large amount of ressources to get reliable information.

@vdbergh Actually what is helped to make it into master is neutral scaling. Both good scalers and bad scalers are blocked, the former at STC and the latter at LTC. For an equal LTC performance of 2 patches, the better scaler would most likely be the one which performs worse at STC, but from a statistical confidence perspective we prefer 2 positive results.

I find it very annoying that 2 beneficial concepts lie in opposite directions, scaling promotion and cheap confidence.

@NKONSTANTAKIS You make a good point but you are talking about something different than I was. I assume you are trying to compare the behaviour with respect to scaling of the new bounds versus the old bounds. I didn't say anything about this.

I was pointing out that within a given set of bounds a better scaling patch is more likely to pass STC+LTC. This seems a tautology to me. The pool of patches submitted to LTC is independent of the scaling (obviously) but then the LTC filter has a preference for the better scaling patches (again obviously).

I was pointing out that within a given set of bounds a better scaling patch is more likely to pass STC+LTC. This seems a tautology to me. The pool of patches submitted to LTC is independent of the scaling (obviously) but then the LTC filter has a preference for the better scaling patches (again obviously).

I think you are mistaken.

Yes, for patches having equal STC performance, better scalers will be selected by LTC. This is completely obvious.

But if you look another step back to the patches which are submitted at STC, you will find that patch scaling matter for passing STC. Basically, this is wrong :

The pool of patches submitted to LTC is independent of the scaling (obviously)

It is not.

A patch giving +1 elo at STC and +2 elo at LTC will fail much more often at the STC step than a patch giving +2 elo at STC and +1 elo at LTC. Hence compared to the patches submitted for STC (all fishtest patches), the pool of patches submitted at LTC will be biased towards better STC performance.

@Alayan-stk-2 I understand your point. It depends on the model. If you assume that STC and LTC elo are _negatively correlated_ (e.g. STC elo+LTC elo=contant as in your example) then you are right. But this is an _extremely pessimistic_ assumption. I find it more reasonable to assume that LTC elo=r*STC elo+(some random variable with expectation value 0) where r is some constant near 1.

So for me it is more realistic to assume that the elo distribution is something like (1,0.5),(1,1),(1,1.5),(2,1.5),(2,2),(2,2.5). Among those the ones with higher scaling are more likely to pass.

But I am ready to be proven wrong :)

If SE, LE are Stc and Ltc elo then the scaling model I would like to propose is

LE=r*SE+X

where X~N(0,sigma^2) and r is a constant near 1.

Assuming additivity of elo then given sigma2 and the elo prior one may get an estimate for r by doing STC+LTC tests of sf10 vs sf9 (as I have already been doing). However the difficulty lies in determining sigma^2. I don't see an approach which would not cost a large amount of resources.

In any case the total number of parameters in the "fishtest model" would be 5.

  • mu of the elo prior
  • sigma^2 of the elo prior
  • the nTries parameter (to compensate for the fact that many patches are trivial variations of each
    other).
  • the average scaling ratio r.
  • sigma^2(scaling), to account for the variation in scaling behaviour.

Sadly 5 parameters is too many to fit with the available data I think.

https://web.ma.utexas.edu/users/mks/statmistakes/ovefitting.html

Of course STC elo + LTC elo is not a constant. I've never assumed patches submitted at fishtest to follow this, and my point doesn't depend on this.

This was to give a quick example of how two patches with different scaling behavior get (with old bounds) equal STC+LTC passing behaviour with higher STC passing (and so higher proportion of LTC runs) for the bad scalers. With new (current) bounds, the STC/LTC 2/1 is actually more likely to pass STC+LTC than the 1/2.

Your fundamental assumption here seems to be :

"For any given STC level of performance, the average scaling behavior at LTC of patches submitted at fishtest is ~ neutral."

Let's suppose for the purposes of the demonstration that SF is at the ceiling of STC efficiency (best elo achievable given this amount of computations). We know that this will be different from the optimal code to reach LTC ceiling. Functional code change improving LTC behavior will hurt STC behavior, and the average scaling behavior assumption can't hold. Tons of things losing STC elo will lose LTC elo, but an handful of things losing STC elo will gain LTC elo.

Now, of course, SF is not at that ceiling, and likely still far from it. But it is also very far from a random mover, so the tension between the optimums exists. Tuning also commonly shows that 20+0.2 results can be noticeably different from 60+0.6. Protonspring recent serie of TM patch tries, with several passing STC before failing LTC, and one LTC tune result passing LTC clearly but failing STC, is a good example.

Things which can help SF spot something a ply earlier but not many plies earlier will be better at STC where each ply of search is critical, while things which help SF spot things many plies in advance (often allowing to spot it at all when shuffling can get the engine to push the truth beyond the horizon) get better with a longer TC.

I forked @vondele SPRT simul tool as suggested. So far, I have only added a few more pass probabilities in the final results to not have to add them manually, and basic support for up to 4-staged testing (vstc and mtc can be disabled by switching a bool so 2-staged and 3-staged are easy to test too, I actually set the default to the current fishtest bounds to confirm I get the same results as with vondele's tool in that case).

The binder is available here : https://mybinder.org/v2/gh/Alayan-stk-2/Stockfish/SPRT_tool?filepath=tools%2FFishtest_SPRT_opimization.ipynb

I will update it later to get more stats related to the different stages at the end and to have adapted graphs.

Recent patches being merged into Stockfish confirm the trend : progress is driven by small gainers. There was 5 elo gainers between the 22nd January regression test and the one of 3rd February, with an overall gain of ~4 elo (big error bars). A similar picture emerges with a longer time frame.

New LTC bounds have definitely helped to validate and merge some of these recent gainers, but the STC requirements continue to be very high with at least 1.67 elo with a high confidence degree.

A large majority of STC tests do not even manage to break even to get yellow result. Hence, laxer bounds at 5+0.05 TC would certainly not reject more good patches (especially good scalers) than current STC bounds and would be much cheaper resource-wise as a first filter. Then better STC and LTC testing could be done.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Technologov picture Technologov  路  3Comments

d3vv picture d3vv  路  5Comments

fun8 picture fun8  路  4Comments

nguyenpham picture nguyenpham  路  4Comments

rayoh123 picture rayoh123  路  5Comments