Data.table: Unclear semantics across function call

Created on 2 Jul 2020  Â·  15Comments  Â·  Source: Rdatatable/data.table

(1) setDT modifies the class of the caller object copy too?

> x <- data.frame(a = 1:2)
> foo <- function(z) { setDT(z) ; z[, b:=3:4] ; z } 
> y <- foo(x) 
> print(class(x))
[1] "data.table" "data.frame"

Note that x's value isn't modified by foo, just its class.

(2) a data.table object is passed as an argument _by reference_??

> x <- data.table(a = 1:2) 
> foo <- function(z) { z[, b:=3:4]  }
> y <- foo(x)
> x[]
   a b
1: 1 3
2: 2 4

If so this is inconsistent with all other R objects, and seems unlikely to be intentional.
Note this is _not_ about the semantics of data.table operators like := , but rather about the semantics of passing a full data.table object as an argument.


Tested on R3.4/ubuntu14 + R3.6.1/Windows10.
Sample full environment:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] d
ata.table_1.12.8

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1   

Most helpful comment

@OfekShilon Thanks for your persistence and reproducible example. A few points here. I saw your S.O. comment and question too; https://stackoverflow.com/q/62775221/403310.

For the part (1) in original comment above regarding setDT() on a data.frame and then using := on that ...

  1. We are partly dealing with changes in R over time, combined with the fact that we do provide update-by-reference capabilities in data.table which is a deliberate break from R semantics. Latest releases of R do not copy as often. [Aside: and if we go back to my S-PLUS days 20 years ago, R was a dream then compared to the 11 copies S-PLUS used to make then, and now R is even better.]
  2. Your example is exclusively adding a new column. Internally in R and data.table, adding a new column is a fundamentally different operation to updating an existing column. data.frame in R is not over-allocated, so it is impossible to add a new column to it by reference unless a future version of R starts to over-allocate data.frame.

To illustrate both these points, consider updating by reference an existing column of a data.frame (not recommended) by using setDT from a function.

x = data.frame(a = 1:2)
foo = function(z) { setDT(z) ; z[2, a:=99] ; z[] }
foo(x)
       a
   <int>
1:     1
2:    99
x
       a
   <int>
1:     1
2:     2

The surprise is that x$a didn't change. It actually followed R behaviour. It used to update by reference in older versions of R, but now it doesn't because 1:2 is an ALTREP sequence. To make this update to this column, data.table converts (expands) the ALTREP to a regular column, hence the apparent copy of that column.

To avoid the ALTREP to illustrate, we can create a non sequence example data.

x = data.frame(a = c(1,3))
foo = function(z) { setDT(z) ; z[2, a:=99] ; z[] }
foo(x)
       a
   <num>
1:     1
2:    99
x
       a
   <num>
1:     1
2:    99

This time no copy of that column was made, and indeed an update to an existing data.frame column was made. I'm using R 4.0.1 with data.table 1.12.8.

Note that setDT was added to data.table due to user request in v1.9.2 Feb 2014 (a long time ago). I remember it caused quite some debate at the time because we knew problems like the one you've pointed out could occur. Updating a data.frame by reference is not really ever something we wanted to allow. But we wanted to solve the problem highlighted then that the copy of data.frame by as.data.table() copies the entire data.frame and that could often cause out-of-memory wherever the data.frame approached 50% of RAM. That problem is indeed long solved (by setDT). A common use-case of setDT is using data.table query syntax (e.g. grouping) on a data.frame directly. However, implementing setDT also opened up allowing := on a data.frame which I think we all agree is not desirable. It should only be possible to update a data.table by reference, not a data.frame. At the same time, there is desire to allow users to do what they ask for, especially more advanced users. Providing a path allowed users to get their work done 6 years ago without out-of-memory, even though it had a cost like in the cases you've shown.

I was hoping to gradually migrate to data.table, by pinpointing performance data.frame bottlenecks, locally converting them to data.table (without copies!), perform optimized data.table operations and convert back.

If setDT was changed to return a shallow copy, and all columns marked as shallow so the next := (if any) that updated an existing column (i.e. shared with the original data.frame) by reference copied the column before the update, would that satisfy everyone? I suspect it would. The only case that springs to mind is 1 million columns. Some people do have very very wide data where even the shallow copy of the column pointers comes into play. There is quite a bit of code that calls low-overhead set() in a loop (or :=) and each of those calls would start to check if the column they are updating is a shallow copy. For that reason a bool vector attached to the VECSXP might be quicker to fetch (C array lookup), but fetching that attribute is still an R API call and an attribute lookup. An attribute on each column might be ok but it seems a bit heavy just to store one bool. R's own reference counting is routinely bumped on data.table columns and we routinely want to update-by-reference without regard to those bumps on data.table columns, so using R's own reference counts doesn't seem tenable. set() could use a static variable to know if it is being the passed the same column as the last call, and save the is-shallow lookup, perhaps. However, if no columns are shallow (a common case of using data.table from start to finish) then the bool vector need not exist in that case, just exist when one or more columns are shallow.

Regarding part (2) from the original comment which starts with a data.table not a data.frame ....

a data.table object is passed as an argument by reference??

Yes. But please read the data.table FAQ and search for "reference" and "copy" and read ?copy.

If so this is inconsistent with all other R objects,

Yes

and seems unlikely to be intentional.

Absolutely intentional and referred to in many places. Some of us use data.table like a database where many different functions can update one central big data.table. Those functions can be called by other R processes remotely, too. I'll often have 10 or 20 tables in the global environment and change values in them by referring to the tables by name directly from within the function without passing them in as an argument at all. You can't do that in R because it's a side effect. We embrace side-effects in data.table much like a SQL database's UPDATE statement has the side-effect of updating a table for the very purpose of sharing that update with other functions and other processes.

Note this is not about the semantics of data.table operators like := , but rather about the semantics of passing a full data.table object as an argument.

I don't follow what you mean there because how can it not be about := when that example is all about :=. Merely passing an argument to a function in R is never a copy. R is copy-on-write. data.table's := and set* do not copy-on-write.

All 15 comments

I think it is quite clear after reading docs. Did you have a chance to look at ?copy?
In fact R does not copy on <- assignment, it just pretend it does, then it will eventually copy later on, when it thinks it is needed. The problem is that it is often not really needed, especially when working big data that takes more than half of your memory. In such case copy is not an option.

@jangorecki I assume you're talking just about the 2nd half?
Perhaps I'm missing something, but - I believe that by the documented R semantics, inside foo() the object z (whether evaluated or not yet) is a _copy_ of the one passed by the caller.
?copy says that set*** and := for data.tables operate by reference - but foo in the example isn't either of these.

@OfekShilon , function arguments are promises in R - they are not copied (and not evaluated until forced to) by default. So when your foo function calls setDT on z, you modify the original object which has been passed to foo. The documentation of setDT starts with:

In ‘data.table’ parlance, all ‘set’ functions change their input
_by reference_. That is, no copy is made at all, other than
temporary working memory, which is as large as one column.. The
only other ‘data.table’ operator that modifies input by reference
is ‘:=’. Check out the ‘See Also’ section below for other ‘set
’
function ‘data.table’ provides.

Solution: use as.data.table instead of setDT if you do not want to modify z by reference.

@tdeenes Let's force the promises to be evaluated and test again:

> x <- data.table(a = 1:2) ; foo <- function(z) { force(z); z[, b:=3:4]  } ; y <- foo(x)
> x[]
   a b
1: 1 3
2: 2 4

Same result.
Something else must be happening, and it makes data.table's behavior surprising in unpleasant ways.

I believe that by the documented R semantics, inside foo() the object z (whether evaluated or not yet) is a _copy_ of the one passed by the caller.

it is not a copy, you can easily verify that using address function

library(data.table)
x <- data.frame(a = 1:2)
foo <- function(z) address(z)
address(x)
#[1] "0x55a72500abf0"
foo(x)
#[1] "0x55a72500abf0"

forcing evaluation doesn't make difference

@OfekShilon I submitted PR to improve documentation on that matter. You are welcome to provide feedback there as well #4590

@jangorecki Several topics come into play here:

  1. setDT() indeed operates on its argument by reference.
  2. Function arguments are indeed promises, and are not copied _upon call_.

However:

  1. In R's semantics, function argument are passed by value. Lazy evaluation and copy-on-modify are implementation details, not semantics.
    For example:
> x <- data.frame(a = 1:2)
> foo <- function(z) {z<-z+1;address(z)}
> address(x)
[1] "000001e964bfdc10"
> foo(x)
[1] "000001e96757c4e0"
  1. Even setDT qualifies as a modification of the argument:
> x <- data.frame(a = 1:2)
> foo <- function(z) {setDT(z);address(z)}
> address(x)
[1] "000001e965af84a8"
> foo(x)
[1] "000001e969219d28"
  1. Reference classes do exist, but data.table isn't one:
> y <- data.table(a=1:2)
> foo <- function(z) { z<-z+1; z }
> foo(y)
   a
1: 2
2: 3
> y
   a
1: 1
2: 2

This is admittedly confusing, but the bottom line is: setDT _on a function argument_ indeed creates a new object, as it should, but the modification of the class leaks to the original object copy, which it shouldn't.

Perhaps I should have split this into two issue reports (the discussion zig zags between the two examples), but I suspect the root causes are adjacent.

Point 4 behaviour is caused by how the data.frame is implemented in R C api. I think there is no API to repoint VECSEXP (data.frame and list internal type), so in order to provide no-copy behaviour, it was necessary to allocate new VECSEXP. We still don't copy data, only a pointer to data is copied (=allocated). So the required functionality to work with bigger data and avoid data copies is accomplished. AFAIK this is the most we can do, considering R C api. If you have any ideas/suggestions for improvements we would be very happy to hear, evaluate and eventually implement them. Although dropping compatibility to data.frame is not an option.
Re 5, data.table is not a reference class, same as data.frame. DT just provides API to work with it _by reference_. data.frame doesn't provide such API. It does a good job with copy-on-modify, but there are use cases where it is not enough. You may not experience the need for that often, but once your data is more than half of available memory, then this feature is essential because you cannot afford even single copy.

@jangorecki I'm afraid I'm still failing to explain the gravity of these issues.

I was hoping to gradually migrate to data.table, by pinpointing performance data.frame bottlenecks, locally converting them to data.table (without copies!), perform optimized data.table operations and convert back. We have a ~1M lines R codebase, and this seemed like the only viable option - I suspect we're not alone in this position.

After fiddling with it for a bit, I learnt that I'm breaking _caller_ code in ways that were supposed to be impossible in R:

> x <- data.frame(a = 1:2)
> foo <- function(z) { 
+   setDT(z) 
+   z[, b:=3:4]
+   setDF(z)
+   z 
+   } 
> y <- foo(x) 
> class(x)    # !!!!! x turning into a data.table breaks future operations on x !!!!!
[1] "data.table" "data.frame"
> x
   a
1: 1
2: 2
> class(y)   # If setDF, like the earlier setDT, would have worked on the caller copy - maybe this would have worked - but it doesn't.
[1] "data.frame"
> y
  a b
1 1 3
2 2 4

(I realize that in this toy example there is a copy of z at the end of foo).

This is _not_ a documentation issue, not in copy and not otherwise. Promises, lazy evaluation or copy-on-modify are _all_ red herrings. Something is amiss here, and if it can be fixed - it is by the maintainers, not the reporters (who sadly have zero knowledge of the implementation internals).

@mattdowle Would you be willing to take a look too?

That's strange that the class of the original object is altered but it's value is not. However, what you're doing in this example fundamentally doesn't make sense. The usecase of the setDT function is to change the class of data.frame to data.table without copying it. The purpose of your example function is to return a modified copy of x. Given that your intent is to return a copy there's no reason to use setDT, you should instead use as.data.table. I'd be curious to see a more realistic example of what you're trying to accomplish and a more clear statement of the performance improvements you expect by switching to data.table.

I'll add that if you're using large datasets there are potentially greater performance gains to be had if you switch your global objects to class data.table and rewrite your functions with the awareness that data.table arguments are passed by reference unless explicitly copied. It's more work but it allows you to control copying to exactly when you need it and avoid it elsewhere .

@OfekShilon Thanks for your persistence and reproducible example. A few points here. I saw your S.O. comment and question too; https://stackoverflow.com/q/62775221/403310.

For the part (1) in original comment above regarding setDT() on a data.frame and then using := on that ...

  1. We are partly dealing with changes in R over time, combined with the fact that we do provide update-by-reference capabilities in data.table which is a deliberate break from R semantics. Latest releases of R do not copy as often. [Aside: and if we go back to my S-PLUS days 20 years ago, R was a dream then compared to the 11 copies S-PLUS used to make then, and now R is even better.]
  2. Your example is exclusively adding a new column. Internally in R and data.table, adding a new column is a fundamentally different operation to updating an existing column. data.frame in R is not over-allocated, so it is impossible to add a new column to it by reference unless a future version of R starts to over-allocate data.frame.

To illustrate both these points, consider updating by reference an existing column of a data.frame (not recommended) by using setDT from a function.

x = data.frame(a = 1:2)
foo = function(z) { setDT(z) ; z[2, a:=99] ; z[] }
foo(x)
       a
   <int>
1:     1
2:    99
x
       a
   <int>
1:     1
2:     2

The surprise is that x$a didn't change. It actually followed R behaviour. It used to update by reference in older versions of R, but now it doesn't because 1:2 is an ALTREP sequence. To make this update to this column, data.table converts (expands) the ALTREP to a regular column, hence the apparent copy of that column.

To avoid the ALTREP to illustrate, we can create a non sequence example data.

x = data.frame(a = c(1,3))
foo = function(z) { setDT(z) ; z[2, a:=99] ; z[] }
foo(x)
       a
   <num>
1:     1
2:    99
x
       a
   <num>
1:     1
2:    99

This time no copy of that column was made, and indeed an update to an existing data.frame column was made. I'm using R 4.0.1 with data.table 1.12.8.

Note that setDT was added to data.table due to user request in v1.9.2 Feb 2014 (a long time ago). I remember it caused quite some debate at the time because we knew problems like the one you've pointed out could occur. Updating a data.frame by reference is not really ever something we wanted to allow. But we wanted to solve the problem highlighted then that the copy of data.frame by as.data.table() copies the entire data.frame and that could often cause out-of-memory wherever the data.frame approached 50% of RAM. That problem is indeed long solved (by setDT). A common use-case of setDT is using data.table query syntax (e.g. grouping) on a data.frame directly. However, implementing setDT also opened up allowing := on a data.frame which I think we all agree is not desirable. It should only be possible to update a data.table by reference, not a data.frame. At the same time, there is desire to allow users to do what they ask for, especially more advanced users. Providing a path allowed users to get their work done 6 years ago without out-of-memory, even though it had a cost like in the cases you've shown.

I was hoping to gradually migrate to data.table, by pinpointing performance data.frame bottlenecks, locally converting them to data.table (without copies!), perform optimized data.table operations and convert back.

If setDT was changed to return a shallow copy, and all columns marked as shallow so the next := (if any) that updated an existing column (i.e. shared with the original data.frame) by reference copied the column before the update, would that satisfy everyone? I suspect it would. The only case that springs to mind is 1 million columns. Some people do have very very wide data where even the shallow copy of the column pointers comes into play. There is quite a bit of code that calls low-overhead set() in a loop (or :=) and each of those calls would start to check if the column they are updating is a shallow copy. For that reason a bool vector attached to the VECSXP might be quicker to fetch (C array lookup), but fetching that attribute is still an R API call and an attribute lookup. An attribute on each column might be ok but it seems a bit heavy just to store one bool. R's own reference counting is routinely bumped on data.table columns and we routinely want to update-by-reference without regard to those bumps on data.table columns, so using R's own reference counts doesn't seem tenable. set() could use a static variable to know if it is being the passed the same column as the last call, and save the is-shallow lookup, perhaps. However, if no columns are shallow (a common case of using data.table from start to finish) then the bool vector need not exist in that case, just exist when one or more columns are shallow.

Regarding part (2) from the original comment which starts with a data.table not a data.frame ....

a data.table object is passed as an argument by reference??

Yes. But please read the data.table FAQ and search for "reference" and "copy" and read ?copy.

If so this is inconsistent with all other R objects,

Yes

and seems unlikely to be intentional.

Absolutely intentional and referred to in many places. Some of us use data.table like a database where many different functions can update one central big data.table. Those functions can be called by other R processes remotely, too. I'll often have 10 or 20 tables in the global environment and change values in them by referring to the tables by name directly from within the function without passing them in as an argument at all. You can't do that in R because it's a side effect. We embrace side-effects in data.table much like a SQL database's UPDATE statement has the side-effect of updating a table for the very purpose of sharing that update with other functions and other processes.

Note this is not about the semantics of data.table operators like := , but rather about the semantics of passing a full data.table object as an argument.

I don't follow what you mean there because how can it not be about := when that example is all about :=. Merely passing an argument to a function in R is never a copy. R is copy-on-write. data.table's := and set* do not copy-on-write.

This issue is also discussed on SO: R data.table weird value/reference semantics

@mattdowle Thank you very much for taking the time for such a detailed and informative reply!

I don't follow what you mean there because how can it not be about := when that example is all about :=. Merely passing an argument to a function in R is never a copy. R is copy-on-write. data.table's := and set* do not copy-on-write.

Let me try and clarify. By 'reference semantics' I mean - changes to the argument within a function apply to the argument at the caller scope. R is said to support "only value semantics" (data.table notwithstanding). In some languages reference semantics is considered an attribute of a _type_. In others it is an attribute of a specific function argument. Let's use the latter, more general, approach.
datatable:::= accepts its 1st argument by reference. So does datatable::setnames. But foo? as in - foo = function(z) { setDT(z) ; z[2, a:=99] ; z[] } ?

The answer wasn't entirely clear, certainly not to me but neither to others. AFAIK (after the SO discussion) it is mentioned in passing only in a single vignette. This was in a way the crux of this issue report.
From the discussion here I get the impression that the real answer is, let's say, pragmatic. E.g., setDT operates on its argument by reference, and if by a happy coincidence its argument is a not-yet-modified-and-therefore-not-copied reference to a caller data.frame - we (mostly) get the beneficial side effect of modifying this data.frame by reference. In some scenarios this side effect breaks, and anyway - it's definitely not documented (and even heavily dependent on R version).

I respect all this entirely. What I'm still hoping might be accomplished, is to have either value or reference semantics (depending on exact operation and R version and what have you) - _but not a mix of the two_. Recall that this is the state of affairs today -

> x <- data.frame(a = 1:2)
> foo <- function(z) { setDT(z) ; z[, b:=3:4] ; z } 
> y <- foo(x)
> 
> class(x)
[1] "data.table" "data.frame"
> x
   a
1: 1
2: 2

With the help of another SO commenter, I learnt that some time ago you wrote on this very issue:

The idea so far is to use setDT to convert to data.tables before providing it to a function. But I'd like that these cases be resolved (will take a look).

Following the investigation by GKi, it seems exchanging two lines in the R implementation of setDT might solve this:

    setattr(x, "class", .resetclass(x, "data.frame"))
    setalloccol(x)

(the former modifies x by reference, the latter causes the x copy).

Perhaps this is the right course of action?

Moving the discussion to 2 separate threads

Was this page helpful?
0 / 5 - 0 ratings