Data.table: New function: topn for efficiently doing sorted head/tail

Created on 30 Aug 2019 · 23Comments · Source: Rdatatable/data.table

Inspired by Matt's observation here: https://github.com/Rdatatable/data.table/pull/3604/commits/e1ac66373f7f3d20dfa0299f2accddc1b40a0e5d

DT[topn(score, 5L)] also looks nicer than DT[order(score)[1:5]] or DT[order(score)][1:5].

A quick search suggests two possible implementations which might be better in one situation or another:

https://stackoverflow.com/questions/4956593/optimal-algorithm-for-returning-top-k-values-from-an-array-of-length-n

Will take a look at feasibility to implement cleanly.

Just looked now and dplyr also has top_n but seems they implement it inefficiently:

dplyr:::top_n_rank
function (n, wt) 
{
    if (n > 0) {
        min_rank(desc(wt)) <= n
    }
    else {
        min_rank(wt) <= abs(n)
    }
}

feature request performance

Source

MichaelChirico

👍3

All 23 comments

Questions:

I am wrong in saying that you want something more like DT[order(score, decreasing = TRUE)[1:5]] ?
order works for character vector..etc do you want the same or just numeric columns/vectors?
Do you have 4-5 examples of usage so I get a good idea of what you would like?
Thanks

2005m on 8 Jan 2020

Yes, order(score, decreasing = TRUE)[1:5], however, it's inefficient to (1) sort and (2) return millions of observations when we're only after the top 5.

The link suggests either a tournament or a min-heap implementation

MichaelChirico on 8 Jan 2020

Ok thank you. I think I can suggest something different from what is in your link but not far. My initial experiment suggests a 76 times speed-up.

2005m on 8 Jan 2020

Awesome, that's exactly the sort of thing I had in mind! 🚀

MichaelChirico on 8 Jan 2020

and why topn and not head? we could easily optimize head(order(.), 5) or order(.)[1:5]

jangorecki on 9 Jan 2020

That looks wrong to me... head(x, by = col_spec) maybe looks more natural. But topn fits more naturally into i:

head(x, 5L, by = .(V1, -V2))
# vs
x[topn(V1, -V1, 5L)]

Looking at this, I'm not sure how safe it is to have ... signature, and topn(..., n=5L) reads funny, maybe top(..., n = 5L) is more natural.

MichaelChirico on 9 Jan 2020

but x has particular order already so doing head(x, by = cols) does not really asks for any new order.
topn sounds nice but it is always a new function for users to learn vs just base R known functions

jangorecki on 9 Jan 2020

Agree, but I think we're stretching head pretty far to accommodate here. Certainly I wouldn't understand what's going on if I read head(order(.), 5)

We can optimize it, but users would have to know that we're optimizing it to even try writing like that. Users knowing internal optimizations like this is I would guess harder than knowing a new function.

MichaelChirico on 9 Jan 2020

👍1

If we just document that then I think it should be fine, users who don't read documentation cannot expect to write efficient code. And because it is build base R way it should be easy to remember.

jangorecki on 9 Jan 2020

I disagree with that... already the manual on CRAN is 120+ pages:

https://cran.r-project.org/web/packages/data.table/data.table.pdf

Is reading this really required to get the most out of data.table?

Anyway, even with long-time familiarity with base R, I highly doubt I would write order(...)[1:5] as head(order(...), 5) -- I would have to know that's optimized, and change my code to take advantage.

When we do auto-optimizations, I would expect we catch common coding style, and optimize it, rather than making user change coding style to get optimization. Once user is required to change coding style to get optimization, I no longer see the benefit of avoiding a new function.

MichaelChirico on 9 Jan 2020

Could you clarify the signature of the function topn ? From the stackoverflow link you provided I understood it would be topn(vec, n) not with ... which I don't understand ? Do you have 4-5 examples of how you would use such a function? Do you want the function to work for character vector because I only have it for integer and double for now? Thank you.

2005m on 9 Jan 2020

My vote would be for a new topn function. Regarding the signature, I think it is reasonable to only allow one field for ordering. dplyr::top_n only accepts one field.

ColeMiller1 on 11 Jan 2020

I think starting with (x, n) signature is OK, but I'm worried that it will break backwards compatibility later if we decide to move to (..., n) and people already rely on top(x, n) shorthand

MichaelChirico on 11 Jan 2020

If you are thinking about the future, reordering the arguments makes sense.
That is topn(n, vec).

On Fri, Jan 10, 2020, 7:57 PM Michael Chirico notifications@github.com
wrote:

I think starting with (x, n) signature is OK, but I'm worried that it will
break backwards compatibility later if we decide to move to (..., n) and
people already rely on top(x, n) shorthand

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3804?email_source=notifications&email_token=AN2OKKNPLPXWOPJZLZA5FEDQ5EKOPA5CNFSM4ISILYVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIVULRI#issuecomment-573261253,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AN2OKKJ56O2KBWPMTOC6ZDLQ5EKOPANCNFSM4ISILYVA
.

ColeMiller1 on 11 Jan 2020

When discussing api it would be useful to provide complete example. It is not obvious if n and vec are numeric, character or symbol.

jangorecki on 11 Jan 2020

Here is some code:

topn = inline::cfunction(
  language = "C",
  sig = c(vec = "SEXP", n = "SEXP"),
  includes = "#include<stdint.h>",
  body = "
int nprotect = 0;
int64_t i, j, idx = 0;
const int64_t len0 = asInteger(n);
const int64_t len1 = xlength(vec);

SEXPTYPE tvec = TYPEOF(vec);
SEXP ans = PROTECT(allocVector(INTSXP, len0)); nprotect++;
int *restrict pans = INTEGER(ans);
int tmp;

switch(tvec) {
case INTSXP: {
    const int *restrict pvec = INTEGER(vec);
    int min_value = pvec[0];
    for (i = 0; i < len0; ++i) {
        pans[i] = i;
        if (pvec[i] < min_value) {
            min_value = pvec[i];
            idx = i;
        }
    }
    for (i = len0; i < len1; ++i) {
        if (pvec[i] > min_value) {
            min_value = pvec[i];
            pans[idx] = i;
            for (j = 0; j <len0; ++j) {
                if (min_value > pvec[pans[j]]) {
                    min_value = pvec[pans[j]];
                    idx = j;
                }
            }
        }
    }
    for (i = 0; i < len0; ++i) {
        tmp = pans[i];
        for (j = i; j>=1 && pvec[tmp] > pvec[pans[j-1]]; --j) {
            pans[j] = pans[j-1];
        }
        pans[j] = tmp;
    }
    for (i =0; i < len0; ++i) {
        pans[i]++;
    }
} break;
case REALSXP: {
    const double *restrict pvec = REAL(vec);
    double min_value = pvec[0];
    for (i = 0; i < len0; ++i) {
        pans[i] = i;
        if (pvec[i] < min_value) {
            min_value = pvec[i];
            idx = i;
        }
    }
    for (i = len0; i < len1; ++i) {
        if (pvec[i] > min_value) {
            min_value = pvec[i];
            pans[idx] = i;
            for (j = 0; j <len0; ++j) {
                if (min_value > pvec[pans[j]]) {
                    min_value = pvec[pans[j]];
                    idx = j;
                }
            }
        }
    }
    for (i = 0; i < len0; ++i) {
        tmp = pans[i];
        for (j = i; j>=1 && pvec[tmp] > pvec[pans[j-1]]; --j) {
            pans[j] = pans[j-1];
        }
        pans[j] = tmp;
    }
    for (i =0; i < len0; ++i) {
        pans[i]++;
    }
} break;
default:
    error(\"Type %s is not supported.\", type2char(tvec));
}
UNPROTECT(nprotect);
return ans;                  
")

x = rnorm(1e6)

microbenchmark::microbenchmark(
  head(order(x, decreasing = TRUE), 10),
  order(x, decreasing = TRUE)[1:10],
  topn(x, 10L),
  times = 10L  
)
# Unit: milliseconds
#                             expr            min        lq      mean    median       uq        max neval
# head(order(x, decreasing = TRUE), 10) 85.966673 86.035948 89.558628 86.155044 86.44733 117.405524    10
# order(x, decreasing = TRUE)[1:10]     85.876443 86.300224 93.401301 89.960299 93.78672 113.243412    10
# topn(x, 10L)                           1.105422  1.110553  1.488534  1.502047  1.54032   2.762056    10


# https://stackoverflow.com/questions/5569038/fastest-way-to-find-the-index-of-the-second-third-highest-lowest-value-in
# https://stackoverflow.com/questions/2453326/fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column

microbenchmark::microbenchmark(
  Rfast::nth(x, 10L, descending = TRUE, index.return = TRUE),
  topn(x, 10L)[10L],
  times = 10L
)
# Unit: milliseconds
#        xpr       min        lq      mean    median       uq       max neval
# Rfast::nth 21.402828 22.291441 24.167364 22.417378 25.57093 30.783297    10
# topn        1.104994  1.108415  1.128642  1.121244  1.14177  1.179829    10

microbenchmark::microbenchmark(
  Rfast::nth(x, 10L, descending = TRUE),
  x[topn(x, 10L)[10L]],
  times = 10L
)
# Unit: milliseconds
#       expr       min        lq      mean    median       uq       max neval
# Rfast::nth 15.721778 15.768818 16.808768 15.843653 16.24220 24.114424    10
# topn        1.109697  1.123382  1.222292  1.228792  1.31838  1.347887    10

Can I move on with a PR?

2005m on 14 Jan 2020

topn() would be a fantastically useful function. I feel like it should parallel frank(), but with two modes:

1) one for returning indices (e.g. order(c(3, 2, 10, 5, 1, 1, 2, 10), decreasing = TRUE)[1:5])
2) and one for returning values (e.g. sort(c(3, 2, 10, 5, 1, 1, 2, 10), decreasing = TRUE)[1:5])

Above examples also highlight the need for a ties.method param, as in frank(), perhaps with options {'first', 'last', 'random', 'all'}. NAs need to be managed as well.

ben519 on 14 Jan 2020

I may be wrong but I don't think it is difficult to implement 1 and 2, just don't see much value in it. Dealing with NA slows down massively the function but it remains 20 times faster than order. I would prefer to leave the ties.method out of it for now.

2005m on 15 Jan 2020

Only tangentially related, dplyr is retiring top_n() and going to slice_max() and slice_min().

#' \Sexpr[results=rd, stage=render]{lifecycle::badge("retired")}
#' `top_n()` has been retired in favour of [slice_min()]/[slice_max()].
#' While it will not be deprecated in the near future, retirement means
#' that we will only perform critical bug fixes, so we recommend moving to the
#' newer alternatives.
#'
#' `top_n()` was retired because the name was fundamentally confusing as
#' it returned what you might reasonably consider to be the _bottom_
#' rows. Additionally, the `wt` variable had a confusing name, and strange
#' default (the last column in the data frame). Unfortunately we could not
#' see an easy way to fix the existing `top_n()`

ColeMiller1 on 2 Feb 2020

👍1

My use case similar with this is mostly done within groups and ordered by another column, e.g.

dt[order(-x), .(m = head(y, 3)), by = grp]

Or aggregate the top n y ordered by x grouped by grp:

dt[order(-x), .(m = sum(head(y, 3))), by = grp]

renkun-ken on 25 May 2020

@renkun-ken , fyi and if you are interested, you will find the topn function in the package kit on CRAN.

2005m on 25 May 2020

👍1

@2005m Your kit::topn fits my use case nicely. Thanks!

renkun-ken on 26 May 2020

@renkun-ken , you are welcome. Any issue please let me know. If you have any other need like this, please let me know.

2005m on 26 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

include PostgreSQL's %ilike% into data.table

andschar · 3Comments

The foverlaps() function seems produce an incorrect answer (data.table 1.12.0)

lux5 · 3Comments

why data.table is faster with vectorized column subset than list column subset

sbudai · 3Comments

SHM size exceeded Error

tcederquist · 3Comments

GForce should be able to work with `:=` as well.

arunsrinivasan · 3Comments