Data.table: Does modifications by reference have to be returned invisibly?

Created on 31 Mar 2020 · 10Comments · Source: Rdatatable/data.table

I've learned that the set* family and := could do in place replacement with less time and space. However, does the result have to be returned invisibly? Can we just print the results to avoid puzzles of many new users (of course we could always use another [] to get the results, but why couldn't we just print the results directly)? Any special reasons for the design right now?

Thanks.

print question

Source

hope-data-science

Most helpful comment

I don’t agree with that. Modify by reference is like calling an assignment operator, invisibly. You don’t expect a print when you assign variables.

shrektan on 31 Mar 2020

👍5

All 10 comments

Good question. I agree that reducing burden would be nice, but not sure if your suggestion is the proper way to approach the problem, after so many years when the current behaviour is in place. I believe current design is quite intiutive, if you think about C function that meant to not return any value, but do some side-effect, then the function will be defined as void. Printing the results from an operation (:=, set) corresponds to returning a value, while suppressing print corresponds to void.
I recall there were related idea, like suppressing printing by extra special [!] suffix, like DT[, a := 1L][!].

jangorecki on 31 Mar 2020

I think there should be a reaction after all. As R is an interactive environment, it is sometimes a little wierd that we get no response after an action and do not know what happens or whether we have got the correct answers. I think the invisible return might increase some efficiency though, perhaps?

hope-data-science on 31 Mar 2020

I don’t agree with that. Modify by reference is like calling an assignment operator, invisibly. You don’t expect a print when you assign variables.

shrektan on 31 Mar 2020

👍5

What about an optional verbose? For a normal assignment, a = or <- should appear, even in C, unless this is an object-oriented operation and call a function to modify the data inside. The set* family might be ambiguous unless someone is very familiar with data.table.

hope-data-science on 31 Mar 2020

I'm not sure what you really mean. If you mean that := should generally print the output everywhere, making the console into a mess, that does not align with my experience. Imagine when you have a relatively large table, do you really want to print it over and over again?

I don't consider "unless someone is very familiar with data.table" stands, either, as it's what the user usually expects, at least myself expect that. In addition, modifying objects by references itself just surprises a "plain" R user. If he wants to use data.table, he is required to learn something new.

However, if you are referring to the issue that the := operator sometimes suppresses the next printing, that's something we can discuss.


library(data.table)
fun <- function() {
  x <- data.table(A = 1:3)
  x[, B := 3]
  x
}
dt <- fun()
dt # suppressed
dt # has to call it twice
#>    A B
#> 1: 1 3
#> 2: 2 3
#> 3: 3 3

^{Created on 2020-03-31 by the reprex package (v0.3.0)}

shrektan on 31 Mar 2020

I agree with @shrektan. In addition, I think returning the object visibly may make it even more ambiguous in a way that makes me suspect that the operation is not done in place but returns a new object with that change.

renkun-ken on 31 Mar 2020

👍1

I'm not sure what you really mean. If you mean that := should generally print the output everywhere, making the console into a mess, that does not align with my experience. Imagine when you have a relatively large table, do you really want to print it over and over again?

I don't consider "unless someone is very familiar with data.table" stands, either, as it's what the user usually expects, at least myself expect that. In addition, modifying objects by references itself just surprises a "plain" R user. If he wants to use data.table, he is required to learn something new.

However, if you are referring to the issue that the := operator sometimes suppresses the next printing, that's something we can discuss.
library(data.table)
fun <- function() {
  x <- data.table(A = 1:3)
  x[, B := 3]
  x
}
dt <- fun()
dt # suppressed
dt # has to call it twice
#>    A B
#> 1: 1 3
#> 2: 2 3
#> 3: 3 3
Created on 2020-03-31 by the reprex package (v0.3.0)

I agree that the twice yielding should be fixed.

And as I am more familiar with data.table now, I acutually do not care that much about set* by now. But that was something that blocks me in the very first place. I was amazed that an operation could return nothing but still change the variable, without notifying me (of course, when I use it, I should know what I am doing...just not when I was a beginner). After all, I agree that the set is designed well, because it could still return the results invisibly and could be used in the pipe. This is convenient.

My point was, while modification by reference is good, do we have to suppress the printing (if any, while there are not, perhaps some friendly interactions should be made?) in the end? And I think you are giving the answer as: the process of assignment should not be printed, neither the results (the final data.table) nor the message.

I am not quite sure with the answer to the debate.

hope-data-science on 31 Mar 2020

I'm going to argue that modifying by reference itself is the root cause of confusion. Printing it may or may not help.

Let's imagine an object of R6 or ReferenceClasses. When you call obj$do_something(), it may change the state of itself. But will it print itself or throw a message to notify the user? No, as it makes the whole process too verbose, especially considering that this verboseness doesn't really help in many cases. The same goes for the environment object in R.

Nobody wants to modify it by reference. Returning a copy is both much easier to implement and much easier to learn. I guess it's the price that we may have to pay for the performance code and fast execution. At least, I personally couldn't see a way of avoiding this little confusion.

shrektan on 31 Mar 2020

Here's my opinion: If it will reduce the speed, I would very much like the modification by reference keep the way it is now. Many extensions could deal with this feature (by simply adding a [] at last) , but no package is as fast as data.table.

But this feature is not perfect. Personally I think the OOP is also ambiguous, because every action should get a response in the R environment (when piping, omit them; when calling, print them). R itself is great for its user-friendly design. If this ever causes problems (for majority), I think it should be and could be solved.

I think there might be novel solutions to improve this feature in the future (both saving the performance and being more intuitive). Therefore I hope this question could be open and someone might come up with a better idea later.

hope-data-science on 31 Mar 2020

Printing results after operations by reference will not reduce the speed.
Base R functions know to work by side-effect, will return results invisibly, I think it is good to align to base R in that aspect as well.
Of course there is a space for improvement here for data.table, like for example #4331, but in general I think it is better to return invisibly from functions that are meant to work by side-effect. If some better idea will be provided we are open to put it on the road map. As for now, the question has been discussed and answered, and further actions are not defined, thus I will close this issue.

jangorecki on 3 Apr 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings