Data.table: rowwise DT creation

Created on 4 Mar 2020 · 15Comments · Source: Rdatatable/data.table

Often, I find creating a data.table rowwisely, e.g., tibble::tribble(), is very convenient, as it's easier to distinguish each row, in contrast to the vector-style initializing.

I often use the below wrapper in my own script. It's simple and useful. I think it's better if we can integrate into the current data.table() implementations.

I suggest the user interface is to add a .rowwise = TRUE argument in data.table(). So it should be used in this style:

data.table(
  ~a, ~b, ~c,
  1, 2, "a",
  2, 3, "b",
  3, 4, "c",
  .rowwise = TRUE
)

If there's no better option or clear objection, I will start to file a PR.

(Note, creating a data.table in this way usually means a very small amount of data - otherwise, it makes no sense to create it rowwisely, so the consideration on performance should be small)

Thanks.

The demonstration implementation

library(data.table)
rowdt <- function(...) {
  args <- list(...)
  header <- Filter(function(x) inherits(x, "formula"), args)
  header <- vapply(header, function(x) as.character(x[[2L]]), FUN.VALUE = "a")
  ncol <- length(header)
  body <- Filter(function(x) !inherits(x, "formula"), args)
  stopifnot(length(body) %% ncol == 0)
  body <- split(body, rep(seq_len(length(body) / ncol), each = ncol))
  out <- rbindlist(body)
  setnames(out, header)
  out
}

rowdt(
  ~a, ~b, ~c,
  1, 2, "a",
  2, 3, "b",
  3, 4, "c"
)
#>    a b c
#> 1: 1 2 a
#> 2: 2 3 b
#> 3: 3 4 c

^{Created on 2020-03-04 by the reprex package (v0.3.0)}

feature request

Source

shrektan

Most helpful comment

I do not know whether an alist-like solution has already been suggested:

rowwiseDT(
  a=, b=, c=, 
  1, 2, 'a', 
  3, 4, 'b'
)

And here goes an implementation:

Click to show code

rowwiseDT = function(..., key=NULL) {
  ## look for header
  args = substitute(...())
  argnames = names(args)
  header = argnames[argnames != ""] # note: setdiff(argnames, "") would remove duplicates
  header_empty = length(header) == 0L
  if (header_empty) 
    stop("No named items found which can be interpreted as the header")
  header_ok = all(vapply(
      args[header], 
      function(x) identical(x, substitute()), 
      logical(1L)))
  if (!header_ok) 
    stop("Named items must be empty to be considered as part of the header")
  nrcol = length(header) # do not use a function name (base::ncol)
  ## check body
  body = lapply(args[argnames == ""], eval)
  if (length(body) %% nrcol != 0L) {
    stop(sprintf("There're %d columns but the number of cells is %d, which is not an integer multiple of the columns", nrcol, length(body)))
  }
  ## make all the non-scalar elements to a list
  body = lapply(body, function(x) if (length(x) != 1L) list(x) else x)
  body = split(body, rep(seq_len(length(body) / nrcol), each = nrcol))
  ans = rbindlist(body)
  setnames(ans, header)
  if (!is.null(key)) {
    key = trimws(strsplit(key, split = ",", fixed = TRUE)[[1L]])
    setkeyv(ans, key)
  }
  ans
}

tdeenes on 10 Mar 2020

👍4

All 15 comments

Usually I do this with fread:

x1 <- fread("
  a, b, c
  1, 2, a
  2, 3, b
  3, 4, c
")

The only difference is that fread will parse 'a' and 'b' as integer, which comes handy, anyway.

tdeenes on 4 Mar 2020

👍3

I usually do

rbindlist(list(
  list(a=1, b=2, c="a"),
  list(1, 2, "a"),
  list(2, 3, "b")
))

jangorecki on 4 Mar 2020

I normally do fread. I think if implemented, it would be good to do a separate function instead of including it in data.table()

ColeMiller1 on 7 Mar 2020

Agree, but please not R formula's hack

jangorecki on 7 Mar 2020

Well, I've actually played the interface for a while and it looks good to me so far. Honestly speaking, I prefer to have this functionality just inside data.table().

The reason is it can have basically zero impact on data.table's implementation, since we only need to add the single line intodata.table(), and let it call an internal function rowwiseDT().

if (isTRUE(.rowwise)) return(rowwiseDT(..., key=key))

Integrating this into data.table makes sense to me as data.table() is all about creating a data.table object and as.data.table()/setDT() is all about converting another object to data.table (by val / ref) .

As of the formula hack, yeah, the only alternative (and simple yet) way to replace that is just using the first character vector as the column names. The user interface will be like below:

library(data.table)
data.table(
  "a, b, c, d",         # or c('a', 'b', 'c', 'd'), they are the same as the key in data.table(...,key)
  1, 2, 'a', (2:3),
  3, 4, 'b', list('e'),
  5, 6, 'c', character(),
  .rowwise = TRUE
)
#>        a     b      c      d
#>    <num> <num> <char> <list>
#> 1:     1     2      a    2,3
#> 2:     3     4      b      e
#> 3:     5     6      c

(And the above example illustrates an advantage against fread() that is the supports for list element)

shrektan on 7 Mar 2020

The main case for not including it within data.table() is that users will see rowwise creation at SO or other resources and think that's what data.table() is. Additionally, the implementation shows just how little it depends on data.table() - they are completely separate functions that just happen to have the same output type with some arguments in common.

The rest of the implementation looks good. I tested it to try to break it and it is pretty resilient. It is unfortunate that the header could not be either the ~ hack or some other way that's more distinguishable than a comma delimited scalar string or a string vector. The only other single symbol that works is ? and that would be an even worse hack.

ColeMiller1 on 10 Mar 2020

@ColeMiller1 Thanks! Your argument sounds persuasive to me. I'll think about it. I may turn it into a separate function. Regarding to the ~ hack, indeed, I see no better alternatives by far...

shrektan on 10 Mar 2020

How about this?

rowwiseDT(
  .(a, b, c),
  1,  2,  'a',
  3,  4,  'b'
)

shrektan on 10 Mar 2020

I do not know whether an alist-like solution has already been suggested:

rowwiseDT(
  a=, b=, c=, 
  1, 2, 'a', 
  3, 4, 'b'
)

And here goes an implementation:

Click to show code

rowwiseDT = function(..., key=NULL) {
  ## look for header
  args = substitute(...())
  argnames = names(args)
  header = argnames[argnames != ""] # note: setdiff(argnames, "") would remove duplicates
  header_empty = length(header) == 0L
  if (header_empty) 
    stop("No named items found which can be interpreted as the header")
  header_ok = all(vapply(
      args[header], 
      function(x) identical(x, substitute()), 
      logical(1L)))
  if (!header_ok) 
    stop("Named items must be empty to be considered as part of the header")
  nrcol = length(header) # do not use a function name (base::ncol)
  ## check body
  body = lapply(args[argnames == ""], eval)
  if (length(body) %% nrcol != 0L) {
    stop(sprintf("There're %d columns but the number of cells is %d, which is not an integer multiple of the columns", nrcol, length(body)))
  }
  ## make all the non-scalar elements to a list
  body = lapply(body, function(x) if (length(x) != 1L) list(x) else x)
  body = split(body, rep(seq_len(length(body) / nrcol), each = nrcol))
  ans = rbindlist(body)
  setnames(ans, header)
  if (!is.null(key)) {
    key = trimws(strsplit(key, split = ",", fixed = TRUE)[[1L]])
    setkeyv(ans, key)
  }
  ans
}

tdeenes on 10 Mar 2020

👍4

If it looks like this, I would prefer the original formula hack.

shrektan on 10 Mar 2020

.(a, b, c) looks nice, and could leverage existing name_dots helper

MichaelChirico on 10 Mar 2020

And what happens if a is already present in the environment before the rowwiseDT call? I do really think that rowwiseDT is a convencience function which is OK in examples and demonstration context but shall be avoided in "serious" code. So the focus is on readability; I would not exclude the formula hack, either, but using = is just much more self-explaining.

tdeenes on 10 Mar 2020

👍2

Well, existing variable with the same name in the previous context doesn’t matter as it only uses NSE to capture the symbols anyway.

But after a second thought, a= could be a good alternative, it’s indeed self-explaining... Thanks for the suggestion.

shrektan on 10 Mar 2020

Well, existing variable with the same name in the previous context doesn’t matter as it only uses NSE to capture the symbols anyway.

Yes, of course. I meant it can confuse the reader of the code.

tdeenes on 10 Mar 2020

@ColeMiller1 @tdeenes Thanks for both of your ideas. The latest interface is like below. Looks good to me by far 😸 .

PR has been updated. It will be great if you can try it. Feedback is welcome.

library(data.table)
rowwiseDT(
  a=,b=,c=,  d=,
  1, 2, "a", (2:3),
  3, 4, "b", list("e"),
  5, 6, "c", ~a+b,
  key="a"
)
#>        a     b      c      d
#>    <num> <num> <char> <list>
#> 1:     1     2      a    2,3
#> 2:     3     4      b      e
#> 3:     5     6      c ~a + b

shrektan on 11 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings