Data.table: allow keys on type list

Created on 6 Jun 2019 · 11Comments · Source: Rdatatable/data.table

Please allow us to key on lists. See https://stackoverflow.com/a/56469092/4228193 for a convoluted work-around.

# Minimal reproducible example

A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
setkey(B,b)

Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
Column 'b' is type 'list' which is not supported as a key column type, currently.

A[B]

Error in [.data.table(A, B) :
When i is a data.table (or character vector), the columns to join by must be specified either using 'on=' argument (see ?data.table) or by keying x (i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

A[CJ(B,by=a)]

Error: Column 1 of by= (1) is type 'list', not yet supported. Please use the by= argument to specify columns with types that are supported. See NEWS item in v1.12.2 for more information.

A[CJ(B,on=a)]

Error: Column 1 of by= (1) is type 'list', not yet supported. Please use the by= argument to specify columns with types that are supported. See NEWS item in v1.12.2 for more information.

A[CJ(B[,b],by=a)]

Error: Column 2 is length 3 which differs from length of column 1 (2)

A[B[,b],,on=.(a)]

Error in forderv(x, by = rightcols) :
Column 1 of by= (1) is type 'list', not yet supported. Please use the by= argument to specify columns with types that are supported. See NEWS item in v1.12.2 for more information.
In addition: Warning message:
In as.data.table.list(i) :
Item 1 is of size 2 but maximum size is 3 (recycled leaving a remainder of 1 items)

# Output of sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.12.2 lubridate_1.7.4 forcats_0.3.0 stringr_1.4.0 dplyr_0.7.8 purrr_0.2.5 readr_1.3.0 tidyr_0.8.2
[9] tibble_2.1.1 ggplot2_3.1.0 tidyverse_1.2.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.0 plyr_1.8.4 bindr_0.1.1 tools_3.5.0 jsonlite_1.6 nlme_3.1-137
[10] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.2 rlang_0.3.4 cli_1.0.1 rstudioapi_0.10 haven_2.0.0 bindrcpp_0.2.2 withr_2.1.2
[19] xml2_1.2.0 httr_1.4.0 generics_0.0.2 hms_0.4.2 grid_3.5.0 tidyselect_0.2.5 glue_1.3.0 R6_2.3.0 readxl_1.3.1
[28] modelr_0.1.2 magrittr_1.5 backports_1.1.4 scales_1.0.0 rvest_0.3.2 assertthat_0.2.0 colorspace_1.4-1 stringi_1.4.3 lazyeval_0.2.1
[37] munsell_0.5.0 broom_0.5.1 crayon_1.3.4

feature request

Source

MPagel

Most helpful comment

If one needs the element-wise string representation of a list consisting of arbitrary objects, hashing can help.

library(fastdigest)      # for memory-efficient and fast hashing
# create a list of length 1e5
listvar <- list(letters, list(LETTERS, letters), runif(10), lm(hp ~ am, data = mtcars))
listvar <- sample(listvar, 1e5, TRUE)
# transform
hashvar <- vapply(listvar, fastdigest, character(1L))

The code above is not terribly slow. On my laptop, calculating the hash vector takes ~1.5 seconds, and can be parallelized.

tdeenes on 31 Jul 2019

👍2

All 11 comments

To allow keys on a list column we would need to know how to calculate order of list elements, where every single entity can be of a different type. Linked SO shows only a single special case, which is not generic enough for a list. I would close this issue, unless we will figure out how to calculate order of list elements.

jangorecki on 31 Jul 2019

I'm back&forth on this one... identical(l1, l2) is certainly valid...

setkey per se might not make sense, but I think grouping by list is helpful
& often meaningful (just this week I found my self grouping by sapply(l,
do.call, what=paste) as a workaround)... and I'm not sure we can get to
group by list without basically making setkey(list) available

On Wed, Jul 31, 2019, 9:29 PM Jan Gorecki notifications@github.com wrote:

To allow keys on a list column we would need to know how to calculate
order of list elements, where every single entity can be of a different
type. Linked SO shows only a single special case, which is not generic
enough for a list. I would close this issue, unless we will figure out how
to calculate order of list elements.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3632?email_source=notifications&email_token=AB2BA5NK3XUHTDS5MS63GRTQCGHSTA5CNFSM4HUN6M22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HH4DQ#issuecomment-516849166,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2BA5P46EEOCZ5ARZ53XOTQCGHSTANCNFSM4HUN6M2Q
.

MichaelChirico on 31 Jul 2019

although likely horribly inefficient, https://stackoverflow.com/a/28822665 might be the starting point to a solution of working with disparate data types.
Would it be possible to pipe through e.g. fwrite to a in-memory object, then compare the generated strings?

If this would be too inefficient to maintain the order in case of a keyed column, would it be possible to allow for joining using DT's "on" for list data types using this method, with the caveat that it'll probably be slow to run?

MPagel on 31 Jul 2019

Both of your cases are not really ordering/grouping by list but their string representation. List can store any objects, including those that cannot be easily represented with string.
It might make sense to have a fast function for pasting, deparsing, etc. and then key on character.

jangorecki on 31 Jul 2019

I acknowledge, as above, that dump/fwrite would be looking at a string representation, albeit one that'll lead to a more accurate comparison of objects than simply a print/paste. It'd be nice to not have to make a new field in the DT to store the string/character representation of the list object which would be then keyed or compared for the join operation, but to implicitly do this in the "background" such that a user can join on list objects.

An aside that a direct comparison of dputed objects doesn't quite work (which I acknowledge proves your point) for data.table objects themselves, as there is an .internal.selfref field generated which shouldn't matter for the purposes of comparison. A "get" or "parse" function that I use for such dumped objects is as follows:

data.table.parse<-function (file = "", n = NULL, text = NULL, prompt = "?", keep.source = getOption("keep.source"), 
                            srcfile = NULL, encoding = "unknown") { # needed for dput data.tables (rather than data.frames)
  keep.source <- isTRUE(keep.source)
  if (!is.null(text)) {
    if (length(text) == 0L) return(expression())
    if (missing(srcfile)) {
      srcfile <- "<text>"
      if (keep.source)srcfile <- srcfilecopy(srcfile, text)
    }
    file <- stdin()
  }
  else {
    if (is.character(file)) {
      if (file == "") {
        file <- stdin()
        if (missing(srcfile)) srcfile <- "<stdin>"
      }
      else {
        filename <- file
        file <- file(filename, "r")
        if (missing(srcfile)) srcfile <- filename
        if (keep.source) {
          text <- readLines(file, warn = FALSE)
          if (!length(text)) text <- ""
          close(file)
          file <- stdin()
          srcfile <- srcfilecopy(filename, text, file.mtime(filename), isFile = TRUE)
        }
        else {
          text <- readLines(file, warn = FALSE)
          if (!length(text)) text <- ""
          else text <- gsub("(, .internal.selfref = <pointer: 0x[0-9A-Fa-f]+>)","",text,perl=TRUE)
          on.exit(close(file))
        }
      }
    }
  }
  .Internal(parse(file, n, text, prompt, srcfile, encoding))
}
data.table.get <- function(file, keep.source = FALSE)
  eval(data.table.parse(file = file, keep.source = keep.source))
dtget <- data.table.get # alias

MPagel on 31 Jul 2019

If one needs the element-wise string representation of a list consisting of arbitrary objects, hashing can help.

library(fastdigest)      # for memory-efficient and fast hashing
# create a list of length 1e5
listvar <- list(letters, list(LETTERS, letters), runif(10), lm(hp ~ am, data = mtcars))
listvar <- sample(listvar, 1e5, TRUE)
# transform
hashvar <- vapply(listvar, fastdigest, character(1L))

The code above is not terribly slow. On my laptop, calculating the hash vector takes ~1.5 seconds, and can be parallelized.

tdeenes on 31 Jul 2019

👍2

Jan I don't know if there's any demand for grouping sorting on generic
objects?

I think it would be good enough to have something that works on "simple"
lists of atomic vectors?

hashing sounds like a decent approach. I also don't expect grouping by list
with massive cardinality so collisions shouldn't be an issue...

On Thu, Aug 1, 2019, 3:48 AM Dénes Tóth notifications@github.com wrote:

If one needs the element-wise string representation of a list consisting
of arbitrary objects, hashing can help.

library(fastdigest) # for memory-efficient and fast hashing# create a list of length 1e5listvar <- list(letters, list(LETTERS, letters), runif(10), lm(hp ~ am, data = mtcars))listvar <- sample(listvar, 1e5, TRUE)# transformhashvar <- vapply(listvar, fastdigest, character(1L))

The code above is not terribly slow. On my laptop, calculating the hash
vector takes ~1.5 seconds, and can be parallelized.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3632?email_source=notifications&email_token=AB2BA5NXSBZRPYQB5SL657TQCHUCNA5CNFSM4HUN6M22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ILMGA#issuecomment-516994584,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2BA5MWIKW2R725I2WUSODQCHUCNANCNFSM4HUN6M2Q
.

MichaelChirico on 1 Aug 2019

👍1

sorting on a hash would be self-defeating though, as the sort order would look rather arbitrary and "unordered" when displaying the underlying lists. But it would be excellent for joining via on.

MPagel on 1 Aug 2019

maybe, but OTOH, probably everyone has their own idea of what the "right"
sorting is. It would make sense to be to use arbitrary "sorting" in this
case & would be an improvement over current (lack of) functionality

On Thu, Aug 1, 2019, 10:48 AM Matthew D. Pagel notifications@github.com
wrote:

sorting on a hash would be self-defeating though, as the sort order would
look rather arbitrary and "unordered" when displaying the underlying lists.
But it would be excellent for joining via on.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3632?email_source=notifications&email_token=AB2BA5KMBSL7O4TOB3SYXADQCJFG3A5CNFSM4HUN6M22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JEKZQ#issuecomment-517096806,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2BA5MH62RLVHOXX35EG2DQCJFG3ANCNFSM4HUN6M2Q
.

MichaelChirico on 1 Aug 2019

👍1

I do not think there exists a commonly accepted definition of "order" for list-like objects. One can only introduce operationalized definitions for it, e.g.:

if the list is named, the order is based on the lexicographic order of those names
the order is based on the size of the object
the order is based on the lexicographic order of the hashed list elements (see above)
...

So I would suggest to treat "order" as a technical requirement which is needed to handle list columns in the same way as columns of atomic vectors.

tdeenes on 1 Aug 2019

It'd be nice to not have to make a new field in the DT to store the string/character representation of the list object which would be then keyed or compared for the join operation, but to implicitly do this in the "background" such that a user can join on list objects.

Doing something implicitly in the background, something that is ambiguous doesn't seem to be good idea.

I do not think there exists a commonly accepted definition of "order" for list-like objects.

agree.

I am closing this issue because lists are a data types not constrained enough to be used as a key. List of strings, list of integers, etc. could be eventually, but then they are not a list, but a list of .... Hashing is a good idea to address such a specific lists, or eventually just pasting.

jangorecki on 30 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings