Data.table: Default names in j expression when using a function

Created on 15 Jul 2015 · 26Comments · Source: Rdatatable/data.table

Say we have a data.table DT
Could we change DT[,.(myLongVarName=f(myLongVarName))] by DT[,.(f(myLongVarName))] ?
Right now a V1 column name is created but it is quite a wate of space. This would replicate the behaviour of kdb+.
DT[,.(f(myLongVarName,varName2))] will still create a myLongVarName column, and if there is a column name clash we can come back to Vs

feature request

Source

statquant

Most helpful comment

Any code which relies on V1,V2 etc. I saw kind of dt[...]$V1 quite a lot.

jangorecki on 16 Jul 2015

👍2

All 26 comments

I think it would break a lot of existing code.
Also if the f() is the only call you are calling in j = .(...) then you may set names inside the f function and return list - so not using .() any more. See example.

library(data.table)
f <- function(x){
    xsub <- substitute(x)
    res <- x+1L # calculate what is required
    setNames(list(res), deparse(xsub))
}
dt <- data.table(myLongVarName=1:3)
dt[, f(myLongVarName)]
#    myLongVarName
# 1:             2
# 2:             3
# 3:             4

jangorecki on 15 Jul 2015

Not sure why it would "break" anything... can you give examples ?

statquant on 16 Jul 2015

Any code which relies on V1,V2 etc. I saw kind of dt[...]$V1 quite a lot.

jangorecki on 16 Jul 2015

👍2

I'd rather it incorporated function name, e.g. myLongVarName.f. Would come in handy for smth like:

dt[, .(mean(myVar1), sd(myVar1), mean(myVar2), sd(myVar2)), by = smth]

eantonya on 16 Jul 2015

A good practice is to name your columns, unless you're working interactively. So I don't think there's any issue in implementing this feature. I'd go with a variation of Eddi's proposal.

arunsrinivasan on 16 Jul 2015

👍1

I think default behavior should be the name itself. In the vast majority of cases you'll want DT[,.(f(x),g(y),v(z))] in the case one call the same column multiple time then any disambiguation is fine... BTW in kdb default is:

q)select first time, last time by date from alpha1
date      | time         time1       
----------| -------------------------

I think sticking to industry default is a fair default choice

statquant on 17 Jul 2015

@statquant "Tradition" is never a good argument.

eantonya on 17 Jul 2015

@eantonya ... not sure how that applies to the current thread ...

statquant on 17 Jul 2015

@statquant that's the argument you're pushing for naming - tradition.

To me, it's pretty clear that since you'll want to disambiguate in some situations, e.g. .(fun(x), bar(x)), you'll want to have the naming method be consistent between just .(fun(x)) or .(fun(x), bar(x)) or .(bar(x), fun(x)). And to do that you have to tie in the name of the function always, instead of some random numbering scheme that kdb does.

eantonya on 17 Jul 2015

@eantonya I disagree, I can see many case (in the ordinary case) where leaving x,y is better, like for a data.frame:::merge based on some DT[,.(f(x),g(y))], why renaming when there is no problem ?. For _disambiguation_ I think anything would do, using function name in column name is fine with me... But just in the ambiguous case, not in (hopefully) the majority of cases. Anyway the bosses will decide...

statquant on 17 Jul 2015

@statquant Understood, and I think that's not a good idea. The values you get from f(x) and g(y) are not in any sense 'x' and 'y', despite whatever naming convention you're used to using, and it makes things extremely messy in mixed use scenarios - I shouldn't have to think/guess what naming convention data.table will use next time I use f(x).

To elaborate a little, kdb's convention is at least internally consistent, even if not very descriptive (and also bad for cases where you switch functions around). It makes sense and is reasonably easy to track names such as "x", "x1", "x2". However if you change "x1" to have function name instead, you can't keep "x" anymore and stay consistent.

eantonya on 17 Jul 2015

👍1

I think data.frame() style column naming would be fine (which integrates both the function and the variable name). I don't like either of the other ideas: variable only or function only (not sure if the latter was proposed).

It could also be extended so that

DT <- data.table(a=rnorm(5),b=rnorm(5)) 
DT[,c(lapply(.SD,min),lapply(.SD,max))]

disambiguates, like "min.a", "min.b", "max.a", "max.b" -- the same as one sees with

DT[,c(min=lapply(.SD,min),max=lapply(.SD,max))]

However, I would like to disable such auto-naming by a global option, since I like the DT[...]$V1 syntax Jan mentioned and foresee no trouble in naming columns manually when I need to.

franknarf1 on 17 Jul 2015

@eantonya f(x) can be x in many case, typically DT[,.(first(x),last(y))] or DT[,.(sum(pnl)),.(month(date))] which makes sense for date,times,identifiers (so 90%+ cases in data oriented finance)... As I said I do not really care about x1 or f.x as I expect I won't get caught in misspecifying names like this...

statquant on 17 Jul 2015

@statquant having function identifiers in all the cases you mentioned would only make them clearer.

So far the only argument I can see _for not_ having function identifiers in some cases is: the cost of typing them later on.

The arguments _for_ having them always are: clarity/readability, consistency when changing expression details, i.e. you don't have to guess what the name will be in .(..., sum(x), ...) depending on what else is in that expression.

And here's an example messy expression that would require unnecessary mental exercise with only occasional function identifiers: .(sum(x), sum(y), mean(x)). Interesting to note that kdb convention accidentally works for this (giving the two sums the same identifier), but fails once you rearrange a tiny bit: .(mean(x), sum(x), sum(y))

eantonya on 17 Jul 2015

@eantonya "the cost of typing them later on." => that's precisely what we want to shave off. Else lets simply keep things as they are.
With your suggestion, the time you save with here: DT[,.(mean(x))] is lost again when you have to type the following line DT[,.(sum(mean.x))] ... and what about chaining ?? After DT[,,(mean(x),...)][,,(std(x))] should I expect mean.std.x as a result ??? This makes things even longer to write, if that's the default behaviour, for sure I won't use it and keep assigning x to f(x)...

statquant on 17 Jul 2015

@statquant you're not seeing the bigger picture. If your only concern was indeed cost of typing, then you'd actually want existing system - going back to your first example, V1 is much easier to type than myLongVarName.

But that's not your only concern - you're obviously looking for clarity, but your suggestion only adds clarity in selected cases only, while making things much less clear in others.

I understand that you're thinking of specific use cases, and perhaps for those that convention would be ok, but yours is not the only use case, and as I demonstrated in several examples above that convention doesn't work in general.

eantonya on 17 Jul 2015

@eantonya I disagree on all assertions, data.table is about speed and concise syntax, that's why it's DT[i,j,by] and not select j by from DT where, and why we have . now where we had list etc...
When you write code you want clarity off course, so no V1324 kind of names, you also do not want mean.myUserDefinedFunctionName.myVarName... Using function names in the variable names merely give you a trace back of the functions you've used that is of little interest anyway. If one do not bother about long names then current behavior is ok.

Well my 2 cents, you contribute, your code, your way...

statquant on 17 Jul 2015

@statquant the answer is very simple - if the traceback is irrelevant to _you_ - rename the variable. Or as @franknarf1 suggested, have a switch for keeping the current convention.

Fwiw, while we were having this argument, I actually went through a few scripts I have, and in almost all of them I'm doing things like dt[, .(myvar.mean = mean(myvar), myvar.min = min(myvar), myothervar.mean = mean(myothervar), ...), by = whatever]. And in places where I do chain, the code would benefit a lot in terms of clarity if I had predictable naming that makes sense.

On the flip side, there are also quite a few places where I just use the "V"'s, because it's not worth the bother and I just want to get to the result. But so far I found very very few cases where I'd name a variable with the same name, ignoring the applied function - it's either smth meaningful that explains what the hell I'm doing, or nothing because I'm rushing.

eantonya on 17 Jul 2015

I must admit I use more often the different functions on the same variable, than same functions on different variables. I think this is too much related to personal preferences/use cases to arbitrary decide which way would be better.

jangorecki on 17 Jul 2015

@statquant To state the obvious: there are different ways to be concise. Yours gets in the way of mine ($V1), which is plenty clear in the context of a line or two of code. And while chaining is possible with data.table, I rarely use it for more than two steps (maybe because I do not work in finance). Nor have I ever seen a long var name that isn't one among many suitable for being thrown together into .SDcols or some other vector of var names.

Also, I like explicitly writing my (few) var names so that I can search my code for where they originated. (In notepad++, all appearances of a selected word get highlighted, for example.) In contrast, auto-names would obscure where variables originate, dragging us towards the dark ages of Stata, where essentially everything works like

assign(paste0(a,b),x)

No thanks. I'm fine with something like this being implemented if the data.table authors are interested, and hope they are kind enough to include a toggle in that case. However, I'm not sold on your contention that autonaming (including the special case of reusing names) brings us closer to coding nirvana.

franknarf1 on 17 Jul 2015

👍1

That's really strange, people around me also advocate my way :) That must be a finance thing... Curious to know what matt will think of it if he passes by. Anyway if this gets done, hopefully we'll have an option of .data.table.namingConvention=c('default','reuse','extend')

statquant on 18 Jul 2015

fyi, I'm in finance ;)

On Fri, Jul 17, 2015, 6:36 PM statquant [email protected] wrote:

That's really strange, people around me also advocate my way :) That must
be a finance thing... Curious to know what matt will think of it if he
passes by. Anyway if this gets done, hopefully we'll have an option of
.data.table.namingConvention=c('default','reuse','extend')

—
Reply to this email directly or view it on GitHub
https://github.com/Rdatatable/data.table/issues/1227#issuecomment-122437041
.

eantonya on 19 Jul 2015

It seems I'm the only one thinking that this is a good idea so 'm closing this.
I was going to add this FR as this came again when I was showing data.table to a colleague, then realized that was already there for quite some time.

Still here is a real life example that compare syntaxes of data.table and kdb that demonstrate why I find this FR appealing

set.seed(1L)
rics <- sprintf('%s%s.N', sample(LETTERS), sample(LETTERS))
library(data.table)
day <- function() data.table(position_usd = 1e6*runif(26L), ric = rics, perf = 0.01*runif(26L, min = -1))
DT <- replicate(n = 5L, expr = day(), simplify = FALSE)
DT <- rbindlist(DT, idcol = "date")
DT[, `:=`(turnover_usd = c(NA_real_,diff(position_usd)), pnl_usd = position_usd*perf), ric]
fwrite(DT, file = "DT.csv")

// In R
DT <- fread("DT.csv")
DT[, .(pnl_usd = sum(pnl_usd), turnover_usd = sum(abs(turnover_usd), na.rm = T), gmv_usd = sum(abs(position_usd), na.rm = TRUE)), date][, .(pnl_usd = sum(pnl_usd), turnover_usd = mean(turnover_usd), gmv_usd = mean(gmv_usd)), date %% 2][]
#    date   pnl_usd turnover_usd  gmv_usd
# 1:    1  22285.12      5510249 12531305
# 2:    0 -12974.41      8223648 12294753

# That could be
DT[, .(sum(pnl_usd), sum(abs(turnover_usd), na.rm = T), gmv_usd = sum(abs(position_usd), na.rm = TRUE)), date][, .(sum(pnl_usd), mean(turnover_usd), mean(gmv_usd)), date %% 2][]


// In q
DT: ("IFSFFF"; 1#",") 0:`DT.csv
select sum pnl_usd, avg turnover_usd, avg gmv_usd by date mod 2 from select sum pnl_usd, sum abs turnover_usd, gmv_usd: sum abs position_usd by date from DT                    
\                                 
x| pnl_usd   turnover_usd gmv_usd       
-| -----------------------------------  
0| -12974.41 8223648      1.229475e+07  
1| 22285.12  5510249      1.25313e+07

Note that the by clause of data.table already works exactly like I suggest, not sure why j would be any different or the reason of this inconsistency

statquant on 24 Aug 2019

actually by works just the opposite -- if you do by=.(f(x)) you'll get out f as the auto name :)