Data.table: Reference original table when specifying .SDcols

Created on 1 Aug 2016  路  12Comments  路  Source: Rdatatable/data.table

This arose on SO recently in connection with dplyr but I was wondering about it with reference to data.table.

This data.table code works to operate only on the numeric columns:

library(data.table)
iris.dt <- as.data.table(iris)
iris.dt[, .SD / rowSums(.SD), .SDcols = sapply(iris.dt, is.numeric)]

but it requires that we know the name of the table in order to specify it in the .SDcols portion; however, often we don't know this when writing cascades like DT[...][...]. I would have liked to write:

iris.dt[, .SD / rowSums(.SD), .SDcols = sapply(.SD, is.numeric)]

where the last .SD (the one in the third argument) refers to the entire table and the others are modified by .SDcols. Is there a good way to do this or is there something already available for this? If not, I suggest adding the possibility of referencing .SD in the .SDcols argument.

Also what I really want is to perform .SD / rowMeans(.SD) on the numeric rows but not drop the non-numeric columns and actually neither of these "solutions" does that.

This one does preserve the non-numeric columns but it seems ugly and verbose:

iris.dt[, { is.num <- sapply(.SD, is.numeric)
            SDnum <- .SD[, is.num, with = FALSE]
            replace(.SD, is.num, SDnum / rowMeans(SDnum))
          } ]

This also works but does not seem very data.table like:

iris.dt <- as.data.table(iris)
nums <- which(sapply(iris.dt, is.numeric))
iris.num <- iris.dt[, nums, with = FALSE]
iris.dt[, nums] <- iris.num / rowMeans(iris.num)
enhancement

All 12 comments

A bit hacky, but:

iris.dt[ , lapply(.SD, function(x) if (is.numeric(x)) x)][ , .SD / rowSums(.SD)]

Just a matter of splitting the .SDcols and the operation into two steps -- if we're already chaining, then shouldn't be a big deal.

So I guess I'd call this a convenience shorthand more than a missing functionality... not sure how hard it would be to get .SD to work in .SDcols (I think there are some issues about trying to get .N/.I to work in by, e.g.).

I hear you. I have a feeling there's a cleaner way with with = FALSE but it isn't coming to me

I have revised the first post in light of your comments. The question is now slightly different.

so I believe this is another case where #795 would help immensely to simplify things

Yes! That would be very useful here but note that it would still need the feature that .SDcols be able to reference .SD that we started out with and is the subject of this issue. That is, with the feature in #795 and the ability to use .SD in .SDcols (which is the subject of this issue) we could write this:

iris.dt[, names(.SD) := .SD / rowMeans(.SD), .SDcols = sapply(.SD, is.numeric)]

Thinking about this some more and as an aside the following works now and is more streamlined than the previous solutions I showed but it is still not as streamlined as the solution using names(.SD) a la #795 and the .SD in .SDcols feature.

iris.dt <- as.data.table(iris)
nums <- which(sapply(iris.dt, is.numeric))
if (length(nums)) iris.dt[, eval(nums) := .SD/rowMeans(.SD), .SDcols = nums]

Cool approach... IINM this doesn't allow for an on-the-fly solution either (since we still need to know the name iris.dt, which is unavailable at the fourth or fifth link of a [][] chain).

PS I think (nums) is fine (don't need eval(nums)).

Yes, we still need the .SD in .SDcols feature to optimally write this.

We have patterns now for .SDcols which is nice.

What about .SDcols = Filter(bool_function)? Where we capture RHS there and change it to sapply(x, bool_function) (x is the arg name for the data.table within [).

sapply only to get output in the usual format for .SDcols. Filter to evoke the base function...

It's also possible to just do Filter(is.numeric, .SD) in j (e.g.) but usages of .SDcols cited here are more general than that.

In recent devel it is also possible to simplify that syntax

iris.dt[, .SD / rowSums(.SD), .SDcols = is.numeric]

just by providing a function to .SDcols.
This of course is not solving the original issue.

I would vote to close unless we see another use case not solved by .SDcols=function

Agree, the less NSE massaging in .SDcols the better. If there are use cases not addressed by a function passed to this argument, please post here.

Was this page helpful?
0 / 5 - 0 ratings