Data.table: := does not update by reference existing column if i is missing

Created on 24 Sep 2019 · 6Comments · Source: Rdatatable/data.table

I found a behavior that was not what I was expecting. When using := to update an existing column without setting i the column is copied. I have never seen that documented. In all the documentation (examples and vignettes) := is always used to add a new column or to update using i.

If this behavior is expected I suggest to add this case in the examples and in the vignettes.

library(data.table)
DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7)
DT[, d := 9L]
address(DT$d)
#> [1] "0x56402b48d7b8"
DT[2, d := -8L]
address(DT$d)
#> [1] "0x56402b48d7b8"
DT[2, d := 10L]
address(DT$d)
#> [1] "0x56402b48d7b8"
DT[1:.N, d := d*2L]
address(DT$d)
#> [1] "0x56402b48d7b8"
DT[, d := d*2L] # I was not expecting that one
address(DT$d)
#> [1] "0x56402a0cae98"

^{Created on 2019-09-24 by the reprex package (v0.3.0)}

documentation

Source

Jean-Romain

Most helpful comment

The column is not copied -- it's a new column, and instead of copying it to the old address, the pointer is updated.

library(data.table)
x = 1:2
address(x)
x[1] = 3L
address(x) # same
y = x*2
address(y)
x = y
address(x) # now it's pointing to y

However, one could force an overwrite, by copying the vector into the space occupied by the old x vector:

y = x*2
address(x)
x[] = y
address(x) # unchanged

But this is not desirable vs the lower cost step of updating the pointer to the symbol x.

So, I think the plonk behavior and verbose message are correct, though maybe could be clearer.

franknarf1 on 25 Sep 2019

👍2

All 6 comments

Even more, verbose explicitly says "no copy":

library(data.table)
DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7)
DT[, d := 9L]
addr <- address(DT$d)
DT[2, d := -8L]
identical(addr, address(DT$d))
# [1] TRUE
DT[2, d := 10L]
identical(addr, address(DT$d))
# [1] TRUE
DT[1:.N, d := d*2L]
identical(addr, address(DT$d))
# [1] TRUE
DT[, d := d*2L, verbose = TRUE]
# Detected that j uses these columns: d 
# Assigning to all 4 rows
# RHS_list_of_columns == false
# Direct plonk of unnamed RHS, no copy. NAMED==1, MAYBE_SHARED==0
identical(addr, address(DT$d))
# [1] FALSE

MichaelChirico on 25 Sep 2019

👍1

The column is not copied -- it's a new column, and instead of copying it to the old address, the pointer is updated.

library(data.table)
x = 1:2
address(x)
x[1] = 3L
address(x) # same
y = x*2
address(y)
x = y
address(x) # now it's pointing to y

However, one could force an overwrite, by copying the vector into the space occupied by the old x vector:

y = x*2
address(x)
x[] = y
address(x) # unchanged

But this is not desirable vs the lower cost step of updating the pointer to the symbol x.

So, I think the plonk behavior and verbose message are correct, though maybe could be clearer.

franknarf1 on 25 Sep 2019

👍2

@franknarf1 I understand why it works like that. This is indeed a new column and a pointer update. However I think it does not correspond to the expected behavior.

When d does not exist, a memory allocation is required, this makes sense. But once the memory is allocated it should not be reallocated. Actually the behavior is inconsistent. Test that one:

library(data.table)
DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7)

# New columns d
DT[, d := 1:4]
x = DT$d
address(DT$d)
#> [1] "0x56192961a258"
address(x)
#> [1] "0x56192961a258"

# Update d[1] in place should update x (yes)
DT[1, d := 0]
x 
#> [1] 0 2 3 4
address(DT$d)
#> [1] "0x56192961a258"
address(x)
#> [1] "0x56192961a258"

# Update d[1:4] in place should update x (yes)
DT[, d := 1]
x
#> [1] 1 1 1 1
address(DT$d)
#> [1] "0x56192961a258"
address(x)
#> [1] "0x56192961a258"

# Update d[1:4] in place should update x (no)
DT[, d := 2*d]
x
#> [1] 1 1 1 1
address(DT$d)
#> [1] "0x56192902da58"
address(x)
#> [1] "0x56192961a258"

More generally if an existing column is updated in place with more than 1 value without i it is not modified in place. So DT[, d := 1L] is updated in place but not DT[, d := 2L*d]. And DT[1:4, d := 2L*d] is modified in place.

library(data.table)
DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7, d = 1:4)
x = DT$d
address(DT$d)
#> [1] "0x55d3fbf8ffc8"
address(x)
#> [1] "0x55d3fbf8ffc8"

# Update d[1:4] in place should update x (yes)
DT[, d := 1L]
x
#> [1] 1 1 1 1
address(DT$d)
#> [1] "0x55d3fbf8ffc8"
address(x)
#> [1] "0x55d3fbf8ffc8"

# Update d[1:4] in place should update x (no)
DT[, d := 2L*d]
x
#> [1] 1 1 1 1
address(DT$d)
#> [1] "0x55d3fc8e5ef8"
address(x)
#> [1] "0x55d3fbf8ffc8"

^{Created on 2019-09-25 by the reprex package (v0.3.0)}

Jean-Romain on 25 Sep 2019

👍1

But it does the fastest thing in all cases, doesn't it? If you the user has already created a whole column that could be plonked (*), then it should be plonked because that's faster, no? It would take longer to copy the values into the current column, especially if that column is a character column.
A column plonk is also the way to change a column's type: you provide a whole column to be plonked which becomes the new type, otherwise the copy-each-value-into-current-column would have to deal with coercing.
(* the RHS is what you the user creates first before := decides what to do with that RHS.)

mattdowle on 8 Oct 2019

Well, I'm not saying it is not the best behavior. I'm just saying that if it is the expected behavior, I think it deserves a small addition in the documentation. If you think about it long enough it makes senses, and I eventually reached the same argument for explaining this behavior. But I'm not sure it is trivial for everybody.

Actually when I reported this issue, I believed it was the cause of my "not working" function. I misunderstood my own issue. Actually I found another issue I didn't reported yet that was the real cause of my issue. I'll report soon.

Jean-Romain on 8 Oct 2019

Ok I see. There is the following paragraph in ?':='. Can you suggest some specific better wording and where to put it please?

Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it is clearer to readers of your code that you really do intend to change the column type.

mattdowle on 8 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings