Data.table: Reference original table when specifying .SDcols

Created on 1 Aug 2016 · 12Comments · Source: Rdatatable/data.table

This arose on SO recently in connection with dplyr but I was wondering about it with reference to data.table.

This data.table code works to operate only on the numeric columns:

library(data.table)
iris.dt <- as.data.table(iris)
iris.dt[, .SD / rowSums(.SD), .SDcols = sapply(iris.dt, is.numeric)]

but it requires that we know the name of the table in order to specify it in the .SDcols portion; however, often we don't know this when writing cascades like DT[...][...]. I would have liked to write:

iris.dt[, .SD / rowSums(.SD), .SDcols = sapply(.SD, is.numeric)]

where the last .SD (the one in the third argument) refers to the entire table and the others are modified by .SDcols. Is there a good way to do this or is there something already available for this? If not, I suggest adding the possibility of referencing .SD in the .SDcols argument.

Also what I really want is to perform .SD / rowMeans(.SD) on the numeric rows but not drop the non-numeric columns and actually neither of these "solutions" does that.

This one does preserve the non-numeric columns but it seems ugly and verbose:

iris.dt[, { is.num <- sapply(.SD, is.numeric)
            SDnum <- .SD[, is.num, with = FALSE]
            replace(.SD, is.num, SDnum / rowMeans(SDnum))
          } ]

This also works but does not seem very data.table like:

iris.dt <- as.data.table(iris)
nums <- which(sapply(iris.dt, is.numeric))
iris.num <- iris.dt[, nums, with = FALSE]
iris.dt[, nums] <- iris.num / rowMeans(iris.num)

enhancement

Source

ggrothendieck

All 12 comments

A bit hacky, but:

iris.dt[ , lapply(.SD, function(x) if (is.numeric(x)) x)][ , .SD / rowSums(.SD)]

Just a matter of splitting the .SDcols and the operation into two steps -- if we're already chaining, then shouldn't be a big deal.

So I guess I'd call this a convenience shorthand more than a missing functionality... not sure how hard it would be to get .SD to work in .SDcols (I think there are some issues about trying to get .N/.I to work in by, e.g.).

MichaelChirico on 1 Aug 2016

I hear you. I have a feeling there's a cleaner way with with = FALSE but it isn't coming to me

MichaelChirico on 2 Aug 2016

I have revised the first post in light of your comments. The question is now slightly different.

ggrothendieck on 2 Aug 2016

so I believe this is another case where #795 would help immensely to simplify things

MichaelChirico on 2 Aug 2016

Yes! That would be very useful here but note that it would still need the feature that .SDcols be able to reference .SD that we started out with and is the subject of this issue. That is, with the feature in #795 and the ability to use .SD in .SDcols (which is the subject of this issue) we could write this:

iris.dt[, names(.SD) := .SD / rowMeans(.SD), .SDcols = sapply(.SD, is.numeric)]

ggrothendieck on 2 Aug 2016

Thinking about this some more and as an aside the following works now and is more streamlined than the previous solutions I showed but it is still not as streamlined as the solution using names(.SD) a la #795 and the .SD in .SDcols feature.

iris.dt <- as.data.table(iris)
nums <- which(sapply(iris.dt, is.numeric))
if (length(nums)) iris.dt[, eval(nums) := .SD/rowMeans(.SD), .SDcols = nums]

ggrothendieck on 2 Aug 2016

👍1

Cool approach... IINM this doesn't allow for an on-the-fly solution either (since we still need to know the name iris.dt, which is unavailable at the fourth or fifth link of a [][] chain).

PS I think (nums) is fine (don't need eval(nums)).

MichaelChirico on 2 Aug 2016

Yes, we still need the .SD in .SDcols feature to optimally write this.

ggrothendieck on 2 Aug 2016

We have patterns now for .SDcols which is nice.

What about .SDcols = Filter(bool_function)? Where we capture RHS there and change it to sapply(x, bool_function) (x is the arg name for the data.table within [).

sapply only to get output in the usual format for .SDcols. Filter to evoke the base function...

It's also possible to just do Filter(is.numeric, .SD) in j (e.g.) but usages of .SDcols cited here are more general than that.

MichaelChirico on 8 Sep 2019

👍1

In recent devel it is also possible to simplify that syntax

iris.dt[, .SD / rowSums(.SD), .SDcols = is.numeric]

just by providing a function to .SDcols.
This of course is not solving the original issue.

jangorecki on 4 Feb 2020

I would vote to close unless we see another use case not solved by .SDcols=function

MichaelChirico on 30 Apr 2020

👍1

Agree, the less NSE massaging in .SDcols the better. If there are use cases not addressed by a function passed to this argument, please post here.

jangorecki on 30 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

R CMD check NOTE: No visible binding for global variable

mattdowle · 3Comments

[Request] Test for compiler compatibility with `-fopenmp` before using `SHLIB_OPENMP_CFLAGS`

jimhester · 3Comments

SHM size exceeded Error

tcederquist · 3Comments

fread Unable to handle mis-quoted field if it is out-of-sample

st-pasha · 3Comments

error using function setkey

alex46015 · 3Comments