Please allow us to key on lists. See https://stackoverflow.com/a/56469092/4228193 for a convoluted work-around.
# Minimal reproducible example
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
setkey(B,b)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
Column 'b' is type 'list' which is not supported as a key column type, currently.
A[B]
Error in
[.data.table(A, B) :
When i is a data.table (or character vector), the columns to join by must be specified either using 'on=' argument (see ?data.table) or by keying x (i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
A[CJ(B,by=a)]
Error: Column 1 of by= (1) is type 'list', not yet supported. Please use the by= argument to specify columns with types that are supported. See NEWS item in v1.12.2 for more information.
A[CJ(B,on=a)]
Error: Column 1 of by= (1) is type 'list', not yet supported. Please use the by= argument to specify columns with types that are supported. See NEWS item in v1.12.2 for more information.
A[CJ(B[,b],by=a)]
Error: Column 2 is length 3 which differs from length of column 1 (2)
A[B[,b],,on=.(a)]
Error in forderv(x, by = rightcols) :
Column 1 of by= (1) is type 'list', not yet supported. Please use the by= argument to specify columns with types that are supported. See NEWS item in v1.12.2 for more information.
In addition: Warning message:
In as.data.table.list(i) :
Item 1 is of size 2 but maximum size is 3 (recycled leaving a remainder of 1 items)
# Output of sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.2 lubridate_1.7.4 forcats_0.3.0 stringr_1.4.0 dplyr_0.7.8 purrr_0.2.5 readr_1.3.0 tidyr_0.8.2
[9] tibble_2.1.1 ggplot2_3.1.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.0 plyr_1.8.4 bindr_0.1.1 tools_3.5.0 jsonlite_1.6 nlme_3.1-137
[10] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.2 rlang_0.3.4 cli_1.0.1 rstudioapi_0.10 haven_2.0.0 bindrcpp_0.2.2 withr_2.1.2
[19] xml2_1.2.0 httr_1.4.0 generics_0.0.2 hms_0.4.2 grid_3.5.0 tidyselect_0.2.5 glue_1.3.0 R6_2.3.0 readxl_1.3.1
[28] modelr_0.1.2 magrittr_1.5 backports_1.1.4 scales_1.0.0 rvest_0.3.2 assertthat_0.2.0 colorspace_1.4-1 stringi_1.4.3 lazyeval_0.2.1
[37] munsell_0.5.0 broom_0.5.1 crayon_1.3.4
To allow keys on a list column we would need to know how to calculate order of list elements, where every single entity can be of a different type. Linked SO shows only a single special case, which is not generic enough for a list. I would close this issue, unless we will figure out how to calculate order of list elements.
I'm back&forth on this one... identical(l1, l2) is certainly valid...
setkey per se might not make sense, but I think grouping by list is helpful
& often meaningful (just this week I found my self grouping by sapply(l,
do.call, what=paste) as a workaround)... and I'm not sure we can get to
group by list without basically making setkey(list) available
On Wed, Jul 31, 2019, 9:29 PM Jan Gorecki notifications@github.com wrote:
To allow keys on a list column we would need to know how to calculate
order of list elements, where every single entity can be of a different
type. Linked SO shows only a single special case, which is not generic
enough for a list. I would close this issue, unless we will figure out how
to calculate order of list elements.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3632?email_source=notifications&email_token=AB2BA5NK3XUHTDS5MS63GRTQCGHSTA5CNFSM4HUN6M22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HH4DQ#issuecomment-516849166,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2BA5P46EEOCZ5ARZ53XOTQCGHSTANCNFSM4HUN6M2Q
.
although likely horribly inefficient, https://stackoverflow.com/a/28822665 might be the starting point to a solution of working with disparate data types.
Would it be possible to pipe through e.g. fwrite to a in-memory object, then compare the generated strings?
If this would be too inefficient to maintain the order in case of a keyed column, would it be possible to allow for joining using DT's "on" for list data types using this method, with the caveat that it'll probably be slow to run?
Both of your cases are not really ordering/grouping by list but their string representation. List can store any objects, including those that cannot be easily represented with string.
It might make sense to have a fast function for pasting, deparsing, etc. and then key on character.
I acknowledge, as above, that dump/fwrite would be looking at a string representation, albeit one that'll lead to a more accurate comparison of objects than simply a print/paste. It'd be nice to not have to make a new field in the DT to store the string/character representation of the list object which would be then keyed or compared for the join operation, but to implicitly do this in the "background" such that a user can join on list objects.
An aside that a direct comparison of dputed objects doesn't quite work (which I acknowledge proves your point) for data.table objects themselves, as there is an .internal.selfref field generated which shouldn't matter for the purposes of comparison. A "get" or "parse" function that I use for such dumped objects is as follows:
data.table.parse<-function (file = "", n = NULL, text = NULL, prompt = "?", keep.source = getOption("keep.source"),
srcfile = NULL, encoding = "unknown") { # needed for dput data.tables (rather than data.frames)
keep.source <- isTRUE(keep.source)
if (!is.null(text)) {
if (length(text) == 0L) return(expression())
if (missing(srcfile)) {
srcfile <- "<text>"
if (keep.source)srcfile <- srcfilecopy(srcfile, text)
}
file <- stdin()
}
else {
if (is.character(file)) {
if (file == "") {
file <- stdin()
if (missing(srcfile)) srcfile <- "<stdin>"
}
else {
filename <- file
file <- file(filename, "r")
if (missing(srcfile)) srcfile <- filename
if (keep.source) {
text <- readLines(file, warn = FALSE)
if (!length(text)) text <- ""
close(file)
file <- stdin()
srcfile <- srcfilecopy(filename, text, file.mtime(filename), isFile = TRUE)
}
else {
text <- readLines(file, warn = FALSE)
if (!length(text)) text <- ""
else text <- gsub("(, .internal.selfref = <pointer: 0x[0-9A-Fa-f]+>)","",text,perl=TRUE)
on.exit(close(file))
}
}
}
}
.Internal(parse(file, n, text, prompt, srcfile, encoding))
}
data.table.get <- function(file, keep.source = FALSE)
eval(data.table.parse(file = file, keep.source = keep.source))
dtget <- data.table.get # alias
If one needs the element-wise string representation of a list consisting of arbitrary objects, hashing can help.
library(fastdigest) # for memory-efficient and fast hashing
# create a list of length 1e5
listvar <- list(letters, list(LETTERS, letters), runif(10), lm(hp ~ am, data = mtcars))
listvar <- sample(listvar, 1e5, TRUE)
# transform
hashvar <- vapply(listvar, fastdigest, character(1L))
The code above is not terribly slow. On my laptop, calculating the hash vector takes ~1.5 seconds, and can be parallelized.
Jan I don't know if there's any demand for grouping sorting on generic
objects?
I think it would be good enough to have something that works on "simple"
lists of atomic vectors?
hashing sounds like a decent approach. I also don't expect grouping by list
with massive cardinality so collisions shouldn't be an issue...
On Thu, Aug 1, 2019, 3:48 AM Dénes Tóth notifications@github.com wrote:
If one needs the element-wise string representation of a list consisting
of arbitrary objects, hashing can help.library(fastdigest) # for memory-efficient and fast hashing# create a list of length 1e5listvar <- list(letters, list(LETTERS, letters), runif(10), lm(hp ~ am, data = mtcars))listvar <- sample(listvar, 1e5, TRUE)# transformhashvar <- vapply(listvar, fastdigest, character(1L))
The code above is not terribly slow. On my laptop, calculating the hash
vector takes ~1.5 seconds, and can be parallelized.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3632?email_source=notifications&email_token=AB2BA5NXSBZRPYQB5SL657TQCHUCNA5CNFSM4HUN6M22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ILMGA#issuecomment-516994584,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2BA5MWIKW2R725I2WUSODQCHUCNANCNFSM4HUN6M2Q
.
sorting on a hash would be self-defeating though, as the sort order would look rather arbitrary and "unordered" when displaying the underlying lists. But it would be excellent for joining via on.
maybe, but OTOH, probably everyone has their own idea of what the "right"
sorting is. It would make sense to be to use arbitrary "sorting" in this
case & would be an improvement over current (lack of) functionality
On Thu, Aug 1, 2019, 10:48 AM Matthew D. Pagel notifications@github.com
wrote:
sorting on a hash would be self-defeating though, as the sort order would
look rather arbitrary and "unordered" when displaying the underlying lists.
But it would be excellent for joining via on.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3632?email_source=notifications&email_token=AB2BA5KMBSL7O4TOB3SYXADQCJFG3A5CNFSM4HUN6M22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JEKZQ#issuecomment-517096806,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB2BA5MH62RLVHOXX35EG2DQCJFG3ANCNFSM4HUN6M2Q
.
I do not think there exists a commonly accepted definition of "order" for list-like objects. One can only introduce operationalized definitions for it, e.g.:
So I would suggest to treat "order" as a technical requirement which is needed to handle list columns in the same way as columns of atomic vectors.
It'd be nice to not have to make a new field in the DT to store the string/character representation of the list object which would be then keyed or compared for the join operation, but to implicitly do this in the "background" such that a user can join on list objects.
Doing something implicitly in the background, something that is ambiguous doesn't seem to be good idea.
I do not think there exists a commonly accepted definition of "order" for list-like objects.
agree.
I am closing this issue because lists are a data types not constrained enough to be used as a key. List of strings, list of integers, etc. could be eventually, but then they are not a list, but a list of .... Hashing is a good idea to address such a specific lists, or eventually just pasting.
Most helpful comment
If one needs the element-wise string representation of a list consisting of arbitrary objects, hashing can help.
The code above is not terribly slow. On my laptop, calculating the hash vector takes ~1.5 seconds, and can be parallelized.