Data.table: local and global versions of `.I`, `.N`

Created on 3 Jul 2015 · 13Comments · Source: Rdatatable/data.table

It's always been a bit confusing to me that .I is "global" in the sense that it doesn't change with by, while .N is "local" in the sense that it _does_.

I understand (some of) the advantages of this arrangement, but I think there are ample situations for using a local .I (see, e.g. 1, 2, 3, etc.) or a global .N (e.g., 1).

I'm not sure how easy this is to build into the source code, but having .i and .n be "local" while .I and .N are "global" seems like an intuitive alternative. On the other hand, it could be painful to switch the behavior of .N given that it's so ubiquitous in data.table code.

Throwing a hat in the ring for .SD and .sd as well, since I've been tempted a few times to try .SD with the intention of getting the full table within by, specifically here.

feature request

Source

MichaelChirico

👍1

All 13 comments

I agree. It'd be quite a break from backward compatibility, but that notation would be useful and a lot more intuitive.

franknarf1 on 3 Jul 2015

Quite a big chance. Not sure if the performance gain are big enough to use .i instead of current 1:.N, anybody measured it? 2.0.0 release is going to have some breaking changes so this could be the place to release such change.

jangorecki on 3 Jul 2015

Like the idea very much, but not sure if it's possible at this stage, as it'd break _a lot_ of code.. Where were you when this was implemented first :P?

Marked as FR for now.

arunsrinivasan on 7 Jul 2015

In my R swaddling blankets, I suppose ;)

I understand .N->.n is a big push, but I rarely need that.

.i, however, shouldn't break any code and I would use it all the time!

MichaelChirico on 7 Jul 2015

Right. But seq_len(.N) is .i what you're looking for.. Is that not okay? I ask because I find the intent quite clear and understandable. .i and .I can get quite confusing quickly. If it's really necessary, then maybe .seqN? Not sure. I'm always on alert when we've to add more symbols :-).

arunsrinivasan on 7 Jul 2015

👍1

It feels pretty natural to me, and writing, e.g. var[.i<5] is certainly more intuitive (and cleaner!) than var[seq_len(.N)<5] (or even var[.seqN<5]), but maybe that's just me--FWIW I have more of a math than a programming background, which may be why I'm facile with compartmentalizing capital vs. lowercase symbols.

I understand (and appreciate!) the aversion to over-loading data.table with arcane symbols, but <opinion>I think that anyone that can handle .I and .N can conquer .i quickly, given the tight relationship to .I. <\opinion>.

Just one more parallel to draw--var[.N] is redundant with var[length(var)], but .len was eschewed for the (I think) clearly superior .N; .end also would have worked but would seem more obtuse in other contexts (e.g. var:=.end).

Food for thought! Thanks for the consideration.

MichaelChirico on 8 Jul 2015

Or... .GRPI?

Personally, I think it would be easier for new users if the shortcuts were revamped to not only include this extra one but also to be consistent in some sense, like

capital/lowercase (which I also prefer, being a mathy type) or
.I & .N / .GRPI & .GRPN or
.I & .N / .GI & .GN (also making .G an alias or replacement for .GRP) or
.DTI & .DTN / .I & .N

Breaking compatibility like this isn't so great, and I'd settle for .i or .GRPI or .GI or even .seqN (though it strikes me as too R-ish) alone. I'd use that shortcut all the time.

franknarf1 on 8 Jul 2015

👍1

I've added an example to the main post of when my instinct was to use .n and .N but needed to use nrow(dt) instead

MichaelChirico on 10 Jul 2015

Maybe related: it might be nice to have .NGRP for the number of groups. E.g., here it could use the condition if (.GRP != .NGRP) instead of if(.I[.N] < nrow(DT)). http://stackoverflow.com/a/43615843/

This would also be nice for easily tracking progress by throwing a print(.GRP/.NGRP) into j.

franknarf1 on 25 Apr 2017

👍1

How about the following scheme

Current | New symbol | Meaning
-------- | ------------ | --------
.I (if no groups) | .I | row number in the resulting data.table
.N (if no groups) | .N | number of rows in the resulting data.table (may not be always computable)
? | x.I | row number in DT
? | x.N | number of rows in DT
? | i.I | row number in i data.table when joining
? | i.N | number of rows in i data.table when joining
.SD | .SD | data.table with subset of data within the current group
.I | .SD.I | row number within the current group
.N | .SD.N | number of rows within the current group
.BY | .BY | data.table with all groupby keys, OR current key within the current group
.GRP | .BY.I | group counter
? | .BY.N | number of groups

st-pasha on 31 Jan 2018

Not sure when _symbol overload_ kicks in... certainly most seem intuitive (though I admit I don't immediately get .BY.I/.BY.N. Why not .GRP.I and .GRP.N?

And why wouldn't .N always be computable? Unless there's some plan for distributing data.table?

The primary concern remains the introduction of code-breaking behavior.

MichaelChirico on 31 Jan 2018

The idea is that ?.I is always an index within some data.table, where ? explicitly states which one (and empty ? means the data.table which is being constructed, and hence having no name yet). Similarly, ?.N always denotes the number of rows in N.

The symbols .BY.I and .BY.N refer to data.table .BY, which is the currently existing symbol and it denotes the data.table of all unique group-by keys. On the contrary, .GRP currently means the "group counter", so .GRP.I/.GRP.N would require change in the meaning of .GRP.

I was trying to make a suggestion that is least breaking and most logically consistent. As it stands, it only changes the meaning of .I and .N, and only within the group-by context.

.N may not be computable if j expression returns a data.table with unpredictable number of rows. So if you have an expression like DT[, {if(.N>5) .SD else data.table()} ] then it is impossible to know how many rows there will be in the resulting data.table (which is the new meaning of .N) until you actually construct that data.table.

st-pasha on 31 Jan 2018

Adding this potentially confusing syntax to this issue (not sure if worth fixing):

testDT = data.table(full1 = LETTERS, full2 = letters)
# .N = nrow(testDT)
testDT[seq(1, .N, by = 2L),
# .N = .5 * nrow(testDT)
        stagger1 := rnorm(.N)]

MichaelChirico on 13 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings