Data.table: New function: topn for efficiently doing sorted head/tail

Created on 30 Aug 2019  Â·  23Comments  Â·  Source: Rdatatable/data.table

Inspired by Matt's observation here: https://github.com/Rdatatable/data.table/pull/3604/commits/e1ac66373f7f3d20dfa0299f2accddc1b40a0e5d

DT[topn(score, 5L)] also looks nicer than DT[order(score)[1:5]] or DT[order(score)][1:5].

A quick search suggests two possible implementations which might be better in one situation or another:

https://stackoverflow.com/questions/4956593/optimal-algorithm-for-returning-top-k-values-from-an-array-of-length-n

Will take a look at feasibility to implement cleanly.

Just looked now and dplyr also has top_n but seems they implement it inefficiently:

dplyr:::top_n_rank
function (n, wt) 
{
    if (n > 0) {
        min_rank(desc(wt)) <= n
    }
    else {
        min_rank(wt) <= abs(n)
    }
}
feature request performance

All 23 comments

Questions:

  • I am wrong in saying that you want something more like DT[order(score, decreasing = TRUE)[1:5]] ?
  • order works for character vector..etc do you want the same or just numeric columns/vectors?
  • Do you have 4-5 examples of usage so I get a good idea of what you would like?
    Thanks

Yes, order(score, decreasing = TRUE)[1:5], however, it's inefficient to (1) sort and (2) return millions of observations when we're only after the top 5.

The link suggests either a tournament or a min-heap implementation

Ok thank you. I think I can suggest something different from what is in your link but not far. My initial experiment suggests a 76 times speed-up.

Awesome, that's exactly the sort of thing I had in mind! 🚀

and why topn and not head? we could easily optimize head(order(.), 5) or order(.)[1:5]

That looks wrong to me... head(x, by = col_spec) maybe looks more natural. But topn fits more naturally into i:

head(x, 5L, by = .(V1, -V2))
# vs
x[topn(V1, -V1, 5L)]

Looking at this, I'm not sure how safe it is to have ... signature, and topn(..., n=5L) reads funny, maybe top(..., n = 5L) is more natural.

but x has particular order already so doing head(x, by = cols) does not really asks for any new order.
topn sounds nice but it is always a new function for users to learn vs just base R known functions

Agree, but I think we're stretching head pretty far to accommodate here. Certainly I wouldn't understand what's going on if I read head(order(.), 5)

We can optimize it, but users would have to know that we're optimizing it to even try writing like that. Users knowing internal optimizations like this is I would guess harder than knowing a new function.

If we just document that then I think it should be fine, users who don't read documentation cannot expect to write efficient code. And because it is build base R way it should be easy to remember.

I disagree with that... already the manual on CRAN is 120+ pages:

https://cran.r-project.org/web/packages/data.table/data.table.pdf

Is reading this really required to get the most out of data.table?

Anyway, even with long-time familiarity with base R, I highly doubt I would write order(...)[1:5] as head(order(...), 5) -- I would have to know that's optimized, and change my code to take advantage.

When we do auto-optimizations, I would expect we catch common coding style, and optimize it, rather than making user change coding style to get optimization. Once user is required to change coding style to get optimization, I no longer see the benefit of avoiding a new function.

Could you clarify the signature of the function topn ? From the stackoverflow link you provided I understood it would be topn(vec, n) not with ... which I don't understand ? Do you have 4-5 examples of how you would use such a function? Do you want the function to work for character vector because I only have it for integer and double for now? Thank you.

My vote would be for a new topn function. Regarding the signature, I think it is reasonable to only allow one field for ordering. dplyr::top_n only accepts one field.

I think starting with (x, n) signature is OK, but I'm worried that it will break backwards compatibility later if we decide to move to (..., n) and people already rely on top(x, n) shorthand

If you are thinking about the future, reordering the arguments makes sense.
That is topn(n, vec).

On Fri, Jan 10, 2020, 7:57 PM Michael Chirico notifications@github.com
wrote:

I think starting with (x, n) signature is OK, but I'm worried that it will
break backwards compatibility later if we decide to move to (..., n) and
people already rely on top(x, n) shorthand

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3804?email_source=notifications&email_token=AN2OKKNPLPXWOPJZLZA5FEDQ5EKOPA5CNFSM4ISILYVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIVULRI#issuecomment-573261253,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AN2OKKJ56O2KBWPMTOC6ZDLQ5EKOPANCNFSM4ISILYVA
.

When discussing api it would be useful to provide complete example. It is not obvious if n and vec are numeric, character or symbol.

Here is some code:

topn = inline::cfunction(
  language = "C",
  sig = c(vec = "SEXP", n = "SEXP"),
  includes = "#include<stdint.h>",
  body = "
int nprotect = 0;
int64_t i, j, idx = 0;
const int64_t len0 = asInteger(n);
const int64_t len1 = xlength(vec);

SEXPTYPE tvec = TYPEOF(vec);
SEXP ans = PROTECT(allocVector(INTSXP, len0)); nprotect++;
int *restrict pans = INTEGER(ans);
int tmp;

switch(tvec) {
case INTSXP: {
    const int *restrict pvec = INTEGER(vec);
    int min_value = pvec[0];
    for (i = 0; i < len0; ++i) {
        pans[i] = i;
        if (pvec[i] < min_value) {
            min_value = pvec[i];
            idx = i;
        }
    }
    for (i = len0; i < len1; ++i) {
        if (pvec[i] > min_value) {
            min_value = pvec[i];
            pans[idx] = i;
            for (j = 0; j <len0; ++j) {
                if (min_value > pvec[pans[j]]) {
                    min_value = pvec[pans[j]];
                    idx = j;
                }
            }
        }
    }
    for (i = 0; i < len0; ++i) {
        tmp = pans[i];
        for (j = i; j>=1 && pvec[tmp] > pvec[pans[j-1]]; --j) {
            pans[j] = pans[j-1];
        }
        pans[j] = tmp;
    }
    for (i =0; i < len0; ++i) {
        pans[i]++;
    }
} break;
case REALSXP: {
    const double *restrict pvec = REAL(vec);
    double min_value = pvec[0];
    for (i = 0; i < len0; ++i) {
        pans[i] = i;
        if (pvec[i] < min_value) {
            min_value = pvec[i];
            idx = i;
        }
    }
    for (i = len0; i < len1; ++i) {
        if (pvec[i] > min_value) {
            min_value = pvec[i];
            pans[idx] = i;
            for (j = 0; j <len0; ++j) {
                if (min_value > pvec[pans[j]]) {
                    min_value = pvec[pans[j]];
                    idx = j;
                }
            }
        }
    }
    for (i = 0; i < len0; ++i) {
        tmp = pans[i];
        for (j = i; j>=1 && pvec[tmp] > pvec[pans[j-1]]; --j) {
            pans[j] = pans[j-1];
        }
        pans[j] = tmp;
    }
    for (i =0; i < len0; ++i) {
        pans[i]++;
    }
} break;
default:
    error(\"Type %s is not supported.\", type2char(tvec));
}
UNPROTECT(nprotect);
return ans;                  
")

x = rnorm(1e6)

microbenchmark::microbenchmark(
  head(order(x, decreasing = TRUE), 10),
  order(x, decreasing = TRUE)[1:10],
  topn(x, 10L),
  times = 10L  
)
# Unit: milliseconds
#                             expr            min        lq      mean    median       uq        max neval
# head(order(x, decreasing = TRUE), 10) 85.966673 86.035948 89.558628 86.155044 86.44733 117.405524    10
# order(x, decreasing = TRUE)[1:10]     85.876443 86.300224 93.401301 89.960299 93.78672 113.243412    10
# topn(x, 10L)                           1.105422  1.110553  1.488534  1.502047  1.54032   2.762056    10


# https://stackoverflow.com/questions/5569038/fastest-way-to-find-the-index-of-the-second-third-highest-lowest-value-in
# https://stackoverflow.com/questions/2453326/fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column

microbenchmark::microbenchmark(
  Rfast::nth(x, 10L, descending = TRUE, index.return = TRUE),
  topn(x, 10L)[10L],
  times = 10L
)
# Unit: milliseconds
#        xpr       min        lq      mean    median       uq       max neval
# Rfast::nth 21.402828 22.291441 24.167364 22.417378 25.57093 30.783297    10
# topn        1.104994  1.108415  1.128642  1.121244  1.14177  1.179829    10

microbenchmark::microbenchmark(
  Rfast::nth(x, 10L, descending = TRUE),
  x[topn(x, 10L)[10L]],
  times = 10L
)
# Unit: milliseconds
#       expr       min        lq      mean    median       uq       max neval
# Rfast::nth 15.721778 15.768818 16.808768 15.843653 16.24220 24.114424    10
# topn        1.109697  1.123382  1.222292  1.228792  1.31838  1.347887    10

Can I move on with a PR?

topn() would be a fantastically useful function. I feel like it should parallel frank(), but with two modes:

1) one for returning indices (e.g. order(c(3, 2, 10, 5, 1, 1, 2, 10), decreasing = TRUE)[1:5])
2) and one for returning values (e.g. sort(c(3, 2, 10, 5, 1, 1, 2, 10), decreasing = TRUE)[1:5])

Above examples also highlight the need for a ties.method param, as in frank(), perhaps with options {'first', 'last', 'random', 'all'}. NAs need to be managed as well.

I may be wrong but I don't think it is difficult to implement 1 and 2, just don't see much value in it. Dealing with NA slows down massively the function but it remains 20 times faster than order. I would prefer to leave the ties.method out of it for now.

Only tangentially related, dplyr is retiring top_n() and going to slice_max() and slice_min().

#' \Sexpr[results=rd, stage=render]{lifecycle::badge("retired")}
#' `top_n()` has been retired in favour of [slice_min()]/[slice_max()].
#' While it will not be deprecated in the near future, retirement means
#' that we will only perform critical bug fixes, so we recommend moving to the
#' newer alternatives.
#'
#' `top_n()` was retired because the name was fundamentally confusing as
#' it returned what you might reasonably consider to be the _bottom_
#' rows. Additionally, the `wt` variable had a confusing name, and strange
#' default (the last column in the data frame). Unfortunately we could not
#' see an easy way to fix the existing `top_n()` 

My use case similar with this is mostly done within groups and ordered by another column, e.g.

dt[order(-x), .(m = head(y, 3)), by = grp]

Or aggregate the top n y ordered by x grouped by grp:

dt[order(-x), .(m = sum(head(y, 3))), by = grp]

@renkun-ken , fyi and if you are interested, you will find the topn function in the package kit on CRAN.

@2005m Your kit::topn fits my use case nicely. Thanks!

@renkun-ken , you are welcome. Any issue please let me know. If you have any other need like this, please let me know.

Was this page helpful?
0 / 5 - 0 ratings