Data.table: Unnecessary copy in `data.table()` constructor

Created on 16 Jun 2016  路  11Comments  路  Source: Rdatatable/data.table

Consider following example:

l = list(a = 1:10, b = 10:1)
tracemem(l$a)
# "<0x10cd266e8>"
df = data.frame(a = l$a, b = l$b)
# no copy
setDT(df)
# no copy
dt = data.table(a = l$a, b = l$b)
# tracemem[0x10cd266e8 -> 0x103812a88]: data.table 
# copy!

Don't think this is expected...

Most helpful comment

@dselivanov can't do that with data.table's reference semantics since that'd modify the original object as well. This comes back to exporting shallow() which when used will behave like how base R does things now. But IIUC R will get better at reference counting in the near future, and it makes sense to work on shallow() after that.

All 11 comments

Notice that after your setDT, if you do df[5, a := 999] that will change l.

I find that behavior unexpected, so looking at your example my issue is with the first expression, not the second one, i.e. imo there is a copy missing in the setDT/data.frame path.

Yeah, R core may have taken the R 3.10 "changes to reduce copying of objects" further than they wanted by giving data.frame() that behavior.

It's not a problem for data.frame since it will copy upon assignment.

The behaviour of data.table() is as expected. It always returns a copy. That's the reason setDT() was implemented.

@arunsrinivasan I agree that the data.table behavior is expected, but I don't think the setDT behavior here is expected (at least not without deeper understanding of R and data.table internals, which I think is an unreasonable expectation to set on users).

Imo, in this particular example, setDT should actually copy the two columns.

As I wrote the above, it became less clear to me. It feels wrong to modify l when you modify df in the OP, but so does copying on setDT, so I don't know if there is a clear resolution to this conflict.

@eantonya, I don't agree, that setDT should perform deep copy ( data.table makes deep copy, because I can see raise of ram usage with gc() call).
@arunsrinivasan, there should be some mechanism to prevent copy until underlying data would be modified.

@dselivanov can't do that with data.table's reference semantics since that'd modify the original object as well. This comes back to exporting shallow() which when used will behave like how base R does things now. But IIUC R will get better at reference counting in the near future, and it makes sense to work on shallow() after that.

Ok. Can I rely on setDT for now? (I work with very large obkects and copying them can easily eat all the ram).

You can always rely on setDT() -- what did you think is the issue? All set* functions modify input object by reference. But in this case, of course your list will also get modify if _sub-assigned by reference_, and that shouldn't come as a surprise, if you're familiar with reference semantics.

I asked because comments from @eantonya also make a lot of sense. So I just wanted to clarify, that behaviour of setDT will remain the same in future.

Was this page helpful?
0 / 5 - 0 ratings