Consider following example:
l = list(a = 1:10, b = 10:1)
tracemem(l$a)
# "<0x10cd266e8>"
df = data.frame(a = l$a, b = l$b)
# no copy
setDT(df)
# no copy
dt = data.table(a = l$a, b = l$b)
# tracemem[0x10cd266e8 -> 0x103812a88]: data.table
# copy!
Don't think this is expected...
Notice that after your setDT, if you do df[5, a := 999] that will change l.
I find that behavior unexpected, so looking at your example my issue is with the first expression, not the second one, i.e. imo there is a copy missing in the setDT/data.frame path.
Yeah, R core may have taken the R 3.10 "changes to reduce copying of objects" further than they wanted by giving data.frame() that behavior.
It's not a problem for data.frame since it will copy upon assignment.
The behaviour of data.table() is as expected. It always returns a copy. That's the reason setDT() was implemented.
@arunsrinivasan I agree that the data.table behavior is expected, but I don't think the setDT behavior here is expected (at least not without deeper understanding of R and data.table internals, which I think is an unreasonable expectation to set on users).
Imo, in this particular example, setDT should actually copy the two columns.
As I wrote the above, it became less clear to me. It feels wrong to modify l when you modify df in the OP, but so does copying on setDT, so I don't know if there is a clear resolution to this conflict.
@eantonya, I don't agree, that setDT should perform deep copy ( data.table makes deep copy, because I can see raise of ram usage with gc() call).
@arunsrinivasan, there should be some mechanism to prevent copy until underlying data would be modified.
@dselivanov can't do that with data.table's reference semantics since that'd modify the original object as well. This comes back to exporting shallow() which when used will behave like how base R does things now. But IIUC R will get better at reference counting in the near future, and it makes sense to work on shallow() after that.
Ok. Can I rely on setDT for now? (I work with very large obkects and copying them can easily eat all the ram).
You can always rely on setDT() -- what did you think is the issue? All set* functions modify input object by reference. But in this case, of course your list will also get modify if _sub-assigned by reference_, and that shouldn't come as a surprise, if you're familiar with reference semantics.
I asked because comments from @eantonya also make a lot of sense. So I just wanted to clarify, that behaviour of setDT will remain the same in future.
Most helpful comment
@dselivanov can't do that with data.table's reference semantics since that'd modify the original object as well. This comes back to exporting
shallow()which when used will behave like how base R does things now. But IIUC R will get better at reference counting in the near future, and it makes sense to work onshallow()after that.