I've been trying to reproduce this error, but I'm having difficulties. Please bare with me. Code to reproduce appears below!
I have a file with a few columns, which gets red in via read_tsv. I can then go on to group_by and mutate, and if I pipe in to distinct() it throws an error, if and only if, i add .keep_all = TRUE (or if this is the case implicitly, as in dplyr >= 0.7.0.
The error I get is:
Error in distinct_impl(dist$data, dist$vars, dist$keep) :
Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'integer'
In an effort for reproducibility, I created a gist from the file, hoping this would help reproducibility. But sometimes the error 'magically' disappears, sometimes I can reproduce it.
Here's the code that should reproduce it:
library(tidyverse)
gist <- 'https://gist.githubusercontent.com/breichholf/3b2e5eb253a932b8b0e540812811ecb6/raw/2798b2a58e281fcd3867e4dbf4adbe11f8a7b4f3/test.bed'
bed <- read_tsv(gist, col_names = c('chromosome', 'start', 'end', 'gene', 'score', 'strand', 'anno.id', 'interval.id', 'window.id'))
geneBed <-
bed %>%
group_by(interval.id) %>%
mutate(min.start = min(start),
max.end = max(end),
dist.to.start = start - min.start,
exon.len = end - start,
cds.start = min.start,
cds.end = max.end,
all.starts = paste(dist.to.start, collapse=","),
all.lens = paste(exon.len, collapse=","))
> geneBed %>% distinct(interval.id, .keep_all = TRUE)
Error in distinct_impl(dist$data, dist$vars, dist$keep) :
Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'integer'
The reason I figured it might have something to do with the encoding is that write_tsv also throws an error:
> geneBed %>% write_tsv('test.txt')
Error in stream_delim_(df, path, ...) :
'translateCharUTF8' must be called on a CHARSXP
However, as mentioned above geneBed %>% distinct(interval.id) without .keep_all = TRUE performs as expected. Additionally, perhaps of note: unique() also throws an error:
> geneBed %>% unique()
Error in paste(chromosome = c("chr1", "chr10", "chr11", "chr12", "chr11", :
'translateChar' must be called on a CHARSXP
I've tried the same code on another machine (OSX instead of linux), and can reproduce the error if it's from a fresh R session. I've (strangely only sometimes) managed to resolve the error, by splitting up mutate in to several statements, or piping directly into distinct after mutate, but haven't been able to work out how to reproduce the fix so far, unfortunately.
If there's anything I can do or try on my end please let me know.
Relevant session info:
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 7 (wheezy)
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 dplyr_0.7.1 purrr_0.2.2.2 readr_1.1.1
[5] tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 cellranger_1.1.0 compiler_3.4.0 plyr_1.8.4
[5] bindr_0.1 forcats_0.2.0 tools_3.4.0 jsonlite_1.5
[9] lubridate_1.6.0 nlme_3.1-131 gtable_0.2.0 lattice_0.20-35
[13] pkgconfig_2.0.1 rlang_0.1.1 psych_1.7.5 curl_2.7
[17] parallel_3.4.0 haven_1.1.0 xml2_1.1.1 stringr_1.2.0
[21] httr_1.2.1 hms_0.3 grid_3.4.0 glue_1.1.1
[25] R6_2.2.2 readxl_1.0.0 foreign_0.8-69 reshape2_1.4.2
[29] modelr_0.1.0 magrittr_1.5 scales_0.4.1 rvest_0.3.2
[33] assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2 stringi_1.1.5
[37] lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
FWIW, after downgrading to dplyr == 0.5.0 makes the above code fine.
Thanks. This is worrisome, could you please try to run with gctorture() or e.g. gctorture2(99) and see if you can replicate the error more reliably. I'll look into it, too.
Actually, I can replicate the error without gctorture(), so there seems to be no need to use it. Will investigate further.
Have you looked into nested tibbles? Try this:
geneBed <-
bed %>%
group_by(interval.id) %>%
mutate(min.start = min(start),
max.end = max(end),
dist.to.start = start - min.start,
exon.len = end - start,
cds.start = min.start,
cds.end = max.end) %>%
nest(dist.to.start, exon.len)
Your original problem seems to be caused by the grouped mutate that assigns a string. Looks like a protection error to me. Simpler reprex:
library(dplyr)
set.seed(20170715L)
df <-
data_frame(x = 1:10000) %>%
group_by(x) %>%
mutate(y = as.character(runif(1L)),
z = as.character(runif(1L)))
df %>% distinct(x, .keep_all = TRUE)
#> Error in distinct_impl(dist$data, dist$vars, dist$keep): Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'list'
Can you please try with:
# install.packages("remotes")
remotes::install_github("tidyverse/dplyr#2976")
Sorry for the late reply. #2976 fixed it! 馃憤
Tried both with my code above, and your reprex.
Do we know when will this fix be in a released version of dplyr ?
It is now in the dev version, but a CRAN release is likely to take a while.
So will be on next dplyr release ? 0.7.3 ?
Most helpful comment
Sorry for the late reply. #2976 fixed it! 馃憤
Tried both with my code above, and your reprex.