Data.table: Inconsistent semantics after setDT

Created on 27 Oct 2020 · 15Comments · Source: Rdatatable/data.table

(This is a cleanup and improvement of some of the #4589 discussion.)
Take this code:

> d1 <- data.frame(a=c(1,2,3,4,5), b=c(2,3,4,5,6))
> d2 <- d1
> setDT(d2)  # At this point d2 is a shallow copy of d1, pointing to the same columns

Do modifications to d2 impact d1? We could live with both 'yes' or 'no', but the answer is _sometimes_:

d2[, b:=3:7]         # (1) impacts only d2
d2[, c:=4:8]         # (2) impacts only d2
d2[!is.na(a), b:=5:9] # (3) impacts both
d2[, b:=30]          # (4) impacts both

In cases 1&2 d2 'plunks' the full columns into itself and d1 isn't affected. In cases 3 & 4 it seems that operation-in-place optimization kicks in (address(d2$b) is unchanged), so there is no copy-on-write and data still pointed to by d1 is overwritten.

These semantic discrepancies make (the otherwise great) setDT unusable to us except in the most trivial scripts.

# Output of sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default

> packageVersion("data.table")
[1] ‘1.13.0’

Source

OfekShilon

Most helpful comment

@OfekShilon
I think it is a bit harsh to describe this behaviour as a "fundamental bug". It is absolutely clearly stated in the manual of setDT() that setDT() modifies its input by reference. So what do you want to achieve in your example? If you want to modify d1 by reference, why do you need d2? If you want to modify d2 without modifying d1, you have to follow the advice above and use copy or as.data.table. It seems you want to rely on some consequences of (undocumented and unexported) implementation details. Not a good idea. If your codebase contains such setDT calls, you have to fix them instead of offending the authors of data.table.

Note also that data.table:::shallow() is not (yet) exported (see also #2323), probably not by accident.

Note 2: Maybe you shall reformulate your issue as a feature request instead of a "bug report".

tdeenes on 29 Oct 2020

👍2

All 15 comments

One way out of this would be to disable the modify-in-place optimization. To us the little extra memory consumption is certainly worth the gain in consistency.
I think it's in everyone's best interest to disable it always, but if not - at least add an option to do it.

OfekShilon on 28 Oct 2020

👎1

as.data.table is guaranteed to copy, and of course there's copy. If you are trying to get around modify-in-place, either of these are fine:

d2 = copy(d1)
d2 = as.data.table(d1)

MichaelChirico on 29 Oct 2020

@MichaelChirico Thank you. I'm well aware of the alternatives - but I'm not doing exploratory interactive work. I work in a team of R developers with a large R codebase riddled with data.tables and setDT, and still hope this fundamental bug can be solved and not worked around at the user side.

OfekShilon on 29 Oct 2020

Note also that data.table:::shallow() is not (yet) exported (see also #2323), probably not by accident.

Note 2: Maybe you shall reformulate your issue as a feature request instead of a "bug report".

tdeenes on 29 Oct 2020

👍2

@tdeenes I know how to work around this behaviour, with the advice above and in other ways. This does not make this reported behaviour not a bug. Is the expectation for consistency really a 'feature request'?
I'm afraid the usage of 'by reference' in the documentation and discussions on data.table is often ambiguous, and careful distinction is needed. It might mean two different things:
(1) A data table object (pointing to multiple column data) is not copy-on-write. Therefore, when multiple names are bound to the same data.table - every modification to the data.table via one name manifests in all the names.
(2) The column data is _sometimes_ modified in place (cases 3&4 in the original bug), perhaps you mean this too when you write 'by reference'. When no data.frames are involved - this is not a problem.

Both these behaviours seem reasonable design choices - and usually are. Specifically, when the data.table was created either by data.table(***) or as.data.table. Where things go awry is when a data.table is generated by setDT.
The documentation for setDT indeed absolutely clearly states that it operates by reference, but this report is about _the following operations on the resulting dt_ , and whether _these_ are by reference or not.

dt<-df; setDT(dt) performs a shallow copy, so now dt and df are distinct objects pointing to the same columns. What does 'by reference' even mean in this context? Would you expect modifications to dt or its columns (these are different things!) to manifest in df?
Whatever you choose 'yes' or 'no', sometimes the current implementation would adhere to your expectation and sometimes not. This _cannot_ be accepted as 'by design'.

Not sure what made you say that I want to rely on undocumented and unexported implementation details. I don't. My examples were entirely public and very basic (even fundamental) data.table interfaces, and the results are inconsistent.

OfekShilon on 29 Oct 2020

dt<-df; setDT(dt) performs a shallow copy, so now dt and df are distinct objects pointing to the same columns. What does 'by reference' even mean in this context? Would you expect modifications to dt or its columns (these are different things!) to manifest in df?

This is the crucial point I guess; my interpretation of the documentation is that you shall not ask these questions when using setDT on data.frames (or lists). Just accept that you can not know which modifications of dt or its columns will manifest in df. I really can not imagine a legitimate use case where one wants to use a data.frame for breaking the standard copy-on-write semantics of R. So you have only two options:

You want to keep df as it is -> you have to protect it before calling setDT on it or on any objects which keeps a reference to it.
You do not bother what happens to df, you need a data.table -> you call setDT on it and continue with the return value of setDT.

setDT, :=, and other set* functions give you great power which comes with great responsibility. So when you use them, as you did in the following operations on the resulting dt, you have to be sure that you had not left behind any list or data.frame objects which you want to re-use later and whose elements are shared with dt.

If you think your use case does not fit into 1) or 2), please provide us with a minimal reproducible example. Note that your current example belongs to 2) unless you wanted to keep df as it is; in the latter case your example is an example for the misuse of setDT and :=.

tdeenes on 29 Oct 2020

Sorry, I failed to follow the link which points to the original issue (#4589) in which @mattdowle and others gave you a pretty exhaustive explanation for the proper use of setDT. Upon a quick check of your comments there and in this issue, it seems you want to get data.table-features but keep your object as a data.frame. I still do not get the idea behind this, but I definitely find it inappropriate to refer to the current behaviour of setDT as a "fundamental bug" after the first author of the package gave you a clear description of why this is not a bug and a major contributor of the package improved the documentation based on your original issue.

tdeenes on 30 Oct 2020

The code example is of course simplified, but lots of very real use cases exist. A prominent one is using setDT inside a function - in that discussion a data.table maintainer (Arun) expressed the will to have such cases resolved.

OfekShilon on 6 Nov 2020

I think I just hit a very related point, with a similar use case here? https://github.com/Rdatatable/data.table/issues/4816#issuecomment-731569939

MatthieuStigler on 22 Nov 2020

We keep getting bit by this. Perhaps the original example (d1<-data.frame(a=1); d2 <- d1; setDT(d1)) seemed contrived? It is just an extra simplification of real life, where the issue often surfaces when trying to operate on function arguments:

> df <- data.frame(a=1:2)
> f <- function(x) {
+  setDT(x)
+  x[ , b := 5:6]   # doesn't leak to df
+  x[!is.na(a), a:=3:4]  # leaks to df :(
+  setDF(x)
+ }
>   
> f(df)
> df
   a
1: 3
2: 4

OfekShilon on 14 Jan 2021

I think the solution to this might just be to warn users (via ?setDT) with something along the lines of: "use of setDT inside of function definitions, especially on objects that were passed as arguments, may cause unpredictable modification of objects outside the scope of the function. When writing a function, we recommend either A) using as.data.table (not setDT) inside functions. This guarantees a side-effect free ("pure") function or B) Write functions that expect (and are documented as expecting) data.tables as inputs. This allows creating "functions" which are pass-by-reference (ie, they avoid copying) but behave like procedures (in that there may be side-effects). "

As a bit of an aside, I'll add that I have some experience with a third approach, in the package intervalaverage, where data.tables are explicitly required as inputs (and thus are passed by reference) but the function is carefully written to restore any changes on.exit(). This results in a pure function that benefits from pass-by-reference speed without any side-effects (which are unpredictable to the typical R user who expects functions to behave like pure functions). The approach used in intervalaverage (pure functions using pass-by-reference under the hood), only makes sense if you want to return an entirely new table. If you want to modify the original table, approach B is "best" (although potentially unfamiliar to R users).

related post I made nearly a decade ago on SO: https://stackoverflow.com/questions/13756178/writings-functions-procedures-for-data-table-objects

myoung3 on 28 Jan 2021

@myoung3 this is pretty much what we try to do now. However in an enterprise-size codebase like ours (~600K R LOC, in ~12 large in-house packages) if you can't transition to data.table gradually or use it eliminate specific bottlenecks in a pipeline - it's very hard to use it at all. We tried various techniques and conventions, but this data.table inconsistent behavior is a major, major pain for us (and I suspect for others).
Also, warnings as you suggest don't solve other similar situations.

d1 <- data.frame(a=c(1,2,3,4,5), b=c(2,3,4,5,6))
d2 <- d1
setDT(d2)  
d2[, b:=3:7]         # impacts only d2
d2[!is.na(a), b:=5:9] # leaks to d1

One technical solution might be for setDT to mark the input columns as 'setDT generated', and then use this marking to bypass the memrecycle call in the C function assign.

OfekShilon on 28 Jan 2021

👍1

@OfekShilon with an R footprint that large, it would certainly make sense for some engineering effort to be "donated" to support us in fixing setDT. We accept PRs.

MichaelChirico on 29 Jan 2021

@MichaelChirico I can try - but do you guys now agree that this is a problem to be solved? Seems most of this thread doesn't.

OfekShilon on 29 Jan 2021

I mean, will you merge such a PR?

OfekShilon on 29 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings