Data.table: rowwise DT creation

Created on 4 Mar 2020  路  15Comments  路  Source: Rdatatable/data.table

Often, I find creating a data.table rowwisely, e.g., tibble::tribble(), is very convenient, as it's easier to distinguish each row, in contrast to the vector-style initializing.

I often use the below wrapper in my own script. It's simple and useful. I think it's better if we can integrate into the current data.table() implementations.

I suggest the user interface is to add a .rowwise = TRUE argument in data.table(). So it should be used in this style:

data.table(
  ~a, ~b, ~c,
  1, 2, "a",
  2, 3, "b",
  3, 4, "c",
  .rowwise = TRUE
)

If there's no better option or clear objection, I will start to file a PR.

(Note, creating a data.table in this way usually means a very small amount of data - otherwise, it makes no sense to create it rowwisely, so the consideration on performance should be small)

Thanks.

The demonstration implementation

library(data.table)
rowdt <- function(...) {
  args <- list(...)
  header <- Filter(function(x) inherits(x, "formula"), args)
  header <- vapply(header, function(x) as.character(x[[2L]]), FUN.VALUE = "a")
  ncol <- length(header)
  body <- Filter(function(x) !inherits(x, "formula"), args)
  stopifnot(length(body) %% ncol == 0)
  body <- split(body, rep(seq_len(length(body) / ncol), each = ncol))
  out <- rbindlist(body)
  setnames(out, header)
  out
}

rowdt(
  ~a, ~b, ~c,
  1, 2, "a",
  2, 3, "b",
  3, 4, "c"
)
#>    a b c
#> 1: 1 2 a
#> 2: 2 3 b
#> 3: 3 4 c

Created on 2020-03-04 by the reprex package (v0.3.0)

feature request

Most helpful comment

I do not know whether an alist-like solution has already been suggested:

rowwiseDT(
  a=, b=, c=, 
  1, 2, 'a', 
  3, 4, 'b'
)

And here goes an implementation:


Click to show code

rowwiseDT = function(..., key=NULL) {
  ## look for header
  args = substitute(...())
  argnames = names(args)
  header = argnames[argnames != ""] # note: setdiff(argnames, "") would remove duplicates
  header_empty = length(header) == 0L
  if (header_empty) 
    stop("No named items found which can be interpreted as the header")
  header_ok = all(vapply(
      args[header], 
      function(x) identical(x, substitute()), 
      logical(1L)))
  if (!header_ok) 
    stop("Named items must be empty to be considered as part of the header")
  nrcol = length(header) # do not use a function name (base::ncol)
  ## check body
  body = lapply(args[argnames == ""], eval)
  if (length(body) %% nrcol != 0L) {
    stop(sprintf("There're %d columns but the number of cells is %d, which is not an integer multiple of the columns", nrcol, length(body)))
  }
  ## make all the non-scalar elements to a list
  body = lapply(body, function(x) if (length(x) != 1L) list(x) else x)
  body = split(body, rep(seq_len(length(body) / nrcol), each = nrcol))
  ans = rbindlist(body)
  setnames(ans, header)
  if (!is.null(key)) {
    key = trimws(strsplit(key, split = ",", fixed = TRUE)[[1L]])
    setkeyv(ans, key)
  }
  ans
}

All 15 comments

Usually I do this with fread:

x1 <- fread("
  a, b, c
  1, 2, a
  2, 3, b
  3, 4, c
")

The only difference is that fread will parse 'a' and 'b' as integer, which comes handy, anyway.

I usually do

rbindlist(list(
  list(a=1, b=2, c="a"),
  list(1, 2, "a"),
  list(2, 3, "b")
))

I normally do fread. I think if implemented, it would be good to do a separate function instead of including it in data.table()

Agree, but please not R formula's hack

Well, I've actually played the interface for a while and it looks good to me so far. Honestly speaking, I prefer to have this functionality just inside data.table().

The reason is it can have basically zero impact on data.table's implementation, since we only need to add the single line intodata.table(), and let it call an internal function rowwiseDT().

if (isTRUE(.rowwise)) return(rowwiseDT(..., key=key))

Integrating this into data.table makes sense to me as data.table() is all about creating a data.table object and as.data.table()/setDT() is all about converting another object to data.table (by val / ref) .

As of the formula hack, yeah, the only alternative (and simple yet) way to replace that is just using the first character vector as the column names. The user interface will be like below:

library(data.table)
data.table(
  "a, b, c, d",         # or c('a', 'b', 'c', 'd'), they are the same as the key in data.table(...,key)
  1, 2, 'a', (2:3),
  3, 4, 'b', list('e'),
  5, 6, 'c', character(),
  .rowwise = TRUE
)
#>        a     b      c      d
#>    <num> <num> <char> <list>
#> 1:     1     2      a    2,3
#> 2:     3     4      b      e
#> 3:     5     6      c

(And the above example illustrates an advantage against fread() that is the supports for list element)

The main case for not including it within data.table() is that users will see rowwise creation at SO or other resources and think that's what data.table() is. Additionally, the implementation shows just how little it depends on data.table() - they are completely separate functions that just happen to have the same output type with some arguments in common.

The rest of the implementation looks good. I tested it to try to break it and it is pretty resilient. It is unfortunate that the header could not be either the ~ hack or some other way that's more distinguishable than a comma delimited scalar string or a string vector. The only other single symbol that works is ? and that would be an even worse hack.

@ColeMiller1 Thanks! Your argument sounds persuasive to me. I'll think about it. I may turn it into a separate function. Regarding to the ~ hack, indeed, I see no better alternatives by far...

How about this?

rowwiseDT(
  .(a, b, c),
  1,  2,  'a',
  3,  4,  'b'
)

I do not know whether an alist-like solution has already been suggested:

rowwiseDT(
  a=, b=, c=, 
  1, 2, 'a', 
  3, 4, 'b'
)

And here goes an implementation:


Click to show code

rowwiseDT = function(..., key=NULL) {
  ## look for header
  args = substitute(...())
  argnames = names(args)
  header = argnames[argnames != ""] # note: setdiff(argnames, "") would remove duplicates
  header_empty = length(header) == 0L
  if (header_empty) 
    stop("No named items found which can be interpreted as the header")
  header_ok = all(vapply(
      args[header], 
      function(x) identical(x, substitute()), 
      logical(1L)))
  if (!header_ok) 
    stop("Named items must be empty to be considered as part of the header")
  nrcol = length(header) # do not use a function name (base::ncol)
  ## check body
  body = lapply(args[argnames == ""], eval)
  if (length(body) %% nrcol != 0L) {
    stop(sprintf("There're %d columns but the number of cells is %d, which is not an integer multiple of the columns", nrcol, length(body)))
  }
  ## make all the non-scalar elements to a list
  body = lapply(body, function(x) if (length(x) != 1L) list(x) else x)
  body = split(body, rep(seq_len(length(body) / nrcol), each = nrcol))
  ans = rbindlist(body)
  setnames(ans, header)
  if (!is.null(key)) {
    key = trimws(strsplit(key, split = ",", fixed = TRUE)[[1L]])
    setkeyv(ans, key)
  }
  ans
}

If it looks like this, I would prefer the original formula hack.

.(a, b, c) looks nice, and could leverage existing name_dots helper

And what happens if a is already present in the environment before the rowwiseDT call? I do really think that rowwiseDT is a convencience function which is OK in examples and demonstration context but shall be avoided in "serious" code. So the focus is on readability; I would not exclude the formula hack, either, but using = is just much more self-explaining.

Well, existing variable with the same name in the previous context doesn鈥檛 matter as it only uses NSE to capture the symbols anyway.

But after a second thought, a= could be a good alternative, it鈥檚 indeed self-explaining... Thanks for the suggestion.

Well, existing variable with the same name in the previous context doesn鈥檛 matter as it only uses NSE to capture the symbols anyway.

Yes, of course. I meant it can confuse the reader of the code.

@ColeMiller1 @tdeenes Thanks for both of your ideas. The latest interface is like below. Looks good to me by far 馃樃 .

PR has been updated. It will be great if you can try it. Feedback is welcome.

library(data.table)
rowwiseDT(
  a=,b=,c=,  d=,
  1, 2, "a", (2:3),
  3, 4, "b", list("e"),
  5, 6, "c", ~a+b,
  key="a"
)
#>        a     b      c      d
#>    <num> <num> <char> <list>
#> 1:     1     2      a    2,3
#> 2:     3     4      b      e
#> 3:     5     6      c ~a + b
Was this page helpful?
0 / 5 - 0 ratings