Dplyr: Summing across many columns

Created on 29 Aug 2019 · 9Comments · Source: tidyverse/dplyr

It is quite common in social sci to need to add-up many columns, representing questions on a questionnaire, into a single vector. When only summing 2 or 3 columns, this can be solved with a simple +:

QIMU_raw <- structure(
  list(ID = 1:6,
       Sex = structure(c(1L, 1L, 1L, 1L, 2L, 2L),
                       .Label = c("F", "M"),
                       class = "factor"),
       Age = c(25L, 20L, 23L, 29L, 24L, 27L),
       Q1 = c(-0.49, -0.08, -2.01, 1.00, 0.27, -0.45),
       Q2 = c(0.40, 0.71, -0.77, 0.88, -1.92, -0.48),
       Q3 = c(0.53, 0.71, -1.02, -0.09, 0.17, -0.31),
       Q4 = c(-0.47, -0.51, -2.27, -0.26, -1.35, -1.45),
       Q5 = c(-0.44, 0.64, -1.05, 0.80, 0.83, 0.01),
       Q6 = c(-1.02, -0.02, -1.57, 0.07, -0.52, -0.90)),
  row.names = c(NA, 6L),
  class = "data.frame"
)

library(dplyr)
QIMU_raw %>%
  mutate(Score = Q1 + Q2)
#> Source: local data frame [6 x 10]
#> Groups: <by row>
#> 
#> # A tibble: 6 x 10
#>      ID Sex     Age    Q1    Q2    Q3    Q4    Q5    Q6   Score
#>   <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#> 1     1 F        25 -0.49  0.4   0.53 -0.47 -0.44 -1.02 -0.0900
#> 2     2 F        20 -0.08  0.71  0.71 -0.51  0.64 -0.02  0.63  
#> 3     3 F        23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57 -2.78  
#> 4     4 F        29  1     0.88 -0.09 -0.26  0.8   0.07  1.88  
#> 5     5 M        24  0.27 -1.92  0.17 -1.35  0.83 -0.52 -1.65  
#> 6     6 M        27 -0.45 -0.48 -0.31 -1.45  0.01 -0.9  -0.93

However, when there is need to sum many columns, this become somewhat impractical, and
rowwise() + mutate() cannot be used, as tidyselect is not respected in sum() and returns bogus results:

QIMU_raw %>%
  rowwise() %>% 
  mutate(Score = sum(Q1:Q2))
#> Source: local data frame [6 x 10]
#> Groups: <by row>
#> 
#> # A tibble: 6 x 10
#>      ID Sex     Age    Q1    Q2    Q3    Q4    Q5    Q6 Score
#>   <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 F        25 -0.49  0.4   0.53 -0.47 -0.44 -1.02 -0.49
#> 2     2 F        20 -0.08  0.71  0.71 -0.51  0.64 -0.02 -0.08
#> 3     3 F        23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57 -3.02
#> 4     4 F        29  1     0.88 -0.09 -0.26  0.8   0.07  1   
#> 5     5 M        24  0.27 -1.92  0.17 -1.35  0.83 -0.52 -2.19
#> 6     6 M        27 -0.45 -0.48 -0.31 -1.45  0.01 -0.9  -0.45

Unfortunately, even though this seems to be a popular question, there simply doesn't seem to be a way to do this within dplyr that isn't cumbersome or at least somewhat prone to error:

1. `select() + rowSums()` in `mutate()`

QIMU_raw %>%
  mutate(Score = select(., Q1:Q6) %>% rowSums())
#>   ID Sex Age    Q1    Q2    Q3    Q4    Q5    Q6 Score
#> 1  1   F  25 -0.49  0.40  0.53 -0.47 -0.44 -1.02 -1.49
#> 2  2   F  20 -0.08  0.71  0.71 -0.51  0.64 -0.02  1.45
#> 3  3   F  23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57 -8.69
#> 4  4   F  29  1.00  0.88 -0.09 -0.26  0.80  0.07  2.40
#> 5  5   M  24  0.27 -1.92  0.17 -1.35  0.83 -0.52 -2.52
#> 6  6   M  27 -0.45 -0.48 -0.31 -1.45  0.01 -0.90 -3.58

This seems to work fine, except when an item is modified in the same call to mutate (also common for score flipping):

QIMU_raw %>%
  mutate(Q1 = -1 * Q1,
         Score = select(., Q1:Q6) %>% rowSums())
#>   ID Sex Age    Q1    Q2    Q3    Q4    Q5    Q6 Score
#> 1  1   F  25  0.49  0.40  0.53 -0.47 -0.44 -1.02 -1.49
#> 2  2   F  20  0.08  0.71  0.71 -0.51  0.64 -0.02  1.45
#> 3  3   F  23  2.01 -0.77 -1.02 -2.27 -1.05 -1.57 -8.69
#> 4  4   F  29 -1.00  0.88 -0.09 -0.26  0.80  0.07  2.40
#> 5  5   M  24 -0.27 -1.92  0.17 -1.35  0.83 -0.52 -2.52
#> 6  6   M  27  0.45 -0.48 -0.31 -1.45  0.01 -0.90 -3.58

2. make long and `summarize()`

library(tidyr)
QIMU_raw %>% 
  gather("key","value",Q1:Q6) %>% 
  group_by(ID,Sex,Age) %>% 
  summarise(Score = sum(value))
#> # A tibble: 6 x 4
#> # Groups:   ID, Sex [6]
#>      ID Sex     Age Score
#>   <int> <fct> <int> <dbl>
#> 1     1 F        25 -1.49
#> 2     2 F        20  1.45
#> 3     3 F        23 -8.69
#> 4     4 F        29  2.4 
#> 5     5 M        24 -2.52
#> 6     6 M        27 -3.58

This is prone to errors (and columns to be maintained in the final data frame need to be explicitly named in group_by()).

3. custom function?

mutate_colsum <- function(.data, col_name = "sum", ...){
  tmp <- select_at(.data, vars(...))
  .data[[col_name]] <- rowSums(tmp)
  .data
}

QIMU_raw %>%
  mutate_colsum("Score", Q1:Q6) %>%
  head()
#>   ID Sex Age    Q1    Q2    Q3    Q4    Q5    Q6 Score
#> 1  1   F  25 -0.49  0.40  0.53 -0.47 -0.44 -1.02 -1.49
#> 2  2   F  20 -0.08  0.71  0.71 -0.51  0.64 -0.02  1.45
#> 3  3   F  23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57 -8.69
#> 4  4   F  29  1.00  0.88 -0.09 -0.26  0.80  0.07  2.40
#> 5  5   M  24  0.27 -1.92  0.17 -1.35  0.83 -0.52 -2.52
#> 6  6   M  27 -0.45 -0.48 -0.31 -1.45  0.01 -0.90 -3.58

It works, but... wouldn't it be the coolest if it could just be done IN dplyr???

What do we want?

The ideal solution would be some dplyr-esque rowSums() that can work in mutate() that can take tidyselect input. Would look something like this:

QIMU_raw %>%
  mutate(Score = new_and_improved_rowsums(Q1:Q6))
#>   ID Sex Age    Q1    Q2    Q3    Q4    Q5    Q6 Score
#> 1  1   F  25 -0.49  0.40  0.53 -0.47 -0.44 -1.02 -1.49
#> 2  2   F  20 -0.08  0.71  0.71 -0.51  0.64 -0.02  1.45
#> 3  3   F  23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57 -8.69
#> 4  4   F  29  1.00  0.88 -0.09 -0.26  0.80  0.07  2.40
#> 5  5   M  24  0.27 -1.92  0.17 -1.35  0.83 -0.52 -2.52
#> 6  6   M  27 -0.45 -0.48 -0.31 -1.45  0.01 -0.90 -3.58

If anything, I hope this question will be found by the lost souls searching for solutions on google...

Thanks!

Source

mattansb

👍1

Most helpful comment

rowSums(across(Q1:Q6)) reads fine to me.

hadley on 27 Nov 2019

👍3

All 9 comments

What don't you like about the last approach (custom function)? The only thing I would change is to have this function do a little less, so something like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

QIMU_raw <- structure(
  list(ID = 1:6,
       Sex = structure(c(1L, 1L, 1L, 1L, 2L, 2L),
                       .Label = c("F", "M"),
                       class = "factor"),
       Age = c(25L, 20L, 23L, 29L, 24L, 27L),
       Q1 = c(-0.49, -0.08, -2.01, 1.00, 0.27, -0.45),
       Q2 = c(0.40, 0.71, -0.77, 0.88, -1.92, -0.48),
       Q3 = c(0.53, 0.71, -1.02, -0.09, 0.17, -0.31),
       Q4 = c(-0.47, -0.51, -2.27, -0.26, -1.35, -1.45),
       Q5 = c(-0.44, 0.64, -1.05, 0.80, 0.83, 0.01),
       Q6 = c(-1.02, -0.02, -1.57, 0.07, -0.52, -0.90)),
  row.names = c(NA, 6L),
  class = "data.frame"
)

add_cols <- function(.data, ..., .f = sum){
  tmp <- dplyr::select_at(.data, dplyr::vars(...))
  purrr::pmap_dbl(tmp, .f = .f)
}

QIMU_raw %>%
  dplyr::mutate(Score = add_cols(., Q1:Q6))
#>   ID Sex Age    Q1    Q2    Q3    Q4    Q5    Q6 Score
#> 1  1   F  25 -0.49  0.40  0.53 -0.47 -0.44 -1.02 -1.49
#> 2  2   F  20 -0.08  0.71  0.71 -0.51  0.64 -0.02  1.45
#> 3  3   F  23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57 -8.69
#> 4  4   F  29  1.00  0.88 -0.09 -0.26  0.80  0.07  2.40
#> 5  5   M  24  0.27 -1.92  0.17 -1.35  0.83 -0.52 -2.52
#> 6  6   M  27 -0.45 -0.48 -0.31 -1.45  0.01 -0.90 -3.58

QIMU_raw %>%
  dplyr::mutate(Score = add_cols(., dplyr::starts_with("Q")))
#>   ID Sex Age    Q1    Q2    Q3    Q4    Q5    Q6 Score
#> 1  1   F  25 -0.49  0.40  0.53 -0.47 -0.44 -1.02 -1.49
#> 2  2   F  20 -0.08  0.71  0.71 -0.51  0.64 -0.02  1.45
#> 3  3   F  23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57 -8.69
#> 4  4   F  29  1.00  0.88 -0.09 -0.26  0.80  0.07  2.40
#> 5  5   M  24  0.27 -1.92  0.17 -1.35  0.83 -0.52 -2.52
#> 6  6   M  27 -0.45 -0.48 -0.31 -1.45  0.01 -0.90 -3.58

^{Created on 2019-08-31 by the reprex package (v0.3.0)}
You can even change the summary function on the fly, if you want.

romatik on 31 Aug 2019

The problem with this solution is that when using the . in mutate() you're referring to the original data frame, which is not necessarily the one you mean, if you've used mutate() to change some vars...

e.g.:

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.1
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

QIMU_raw <- structure(
  list(ID = 1:6,
       Sex = structure(c(1L, 1L, 1L, 1L, 2L, 2L),
                       .Label = c("F", "M"),
                       class = "factor"),
       Age = c(25L, 20L, 23L, 29L, 24L, 27L),
       Q1 = c(-0.49, -0.08, -2.01, 1.00, 0.27, -0.45),
       Q2 = c(0.40, 0.71, -0.77, 0.88, -1.92, -0.48),
       Q3 = c(0.53, 0.71, -1.02, -0.09, 0.17, -0.31),
       Q4 = c(-0.47, -0.51, -2.27, -0.26, -1.35, -1.45),
       Q5 = c(-0.44, 0.64, -1.05, 0.80, 0.83, 0.01),
       Q6 = c(-1.02, -0.02, -1.57, 0.07, -0.52, -0.90)),
  row.names = c(NA, 6L),
  class = "data.frame"
)

add_cols <- function(.data, ..., .f = sum){
  tmp <- dplyr::select_at(.data, dplyr::vars(...))
  purrr::pmap_dbl(tmp, .f = .f)
}

QIMU_raw %>%
  dplyr::mutate(Q1 = 1, Q2 = 1, Q3 = 1,
                Q4 = 1, Q5 = 1, Q6 = 1,
                Score = add_cols(., Q1:Q6))
#>   ID Sex Age Q1 Q2 Q3 Q4 Q5 Q6 Score
#> 1  1   F  25  1  1  1  1  1  1 -1.49
#> 2  2   F  20  1  1  1  1  1  1  1.45
#> 3  3   F  23  1  1  1  1  1  1 -8.69
#> 4  4   F  29  1  1  1  1  1  1  2.40
#> 5  5   M  24  1  1  1  1  1  1 -2.52
#> 6  6   M  27  1  1  1  1  1  1 -3.58

^{Created on 2019-09-01 by the reprex package (v0.3.0)}

I would want / expect Score to be all 6..

mattansb on 1 Sep 2019

With pick() from #4586, you may do something like this:

library(dplyr, warn.conflicts = FALSE)

QIMU_raw <- data.frame(
  ID = c(1L, 2L, 3L, 4L, 5L, 6L),
  Age = c(25L, 20L, 23L, 29L, 24L, 27L),
  Q1 = c(-0.49, -0.08, -2.01, 1, 0.27, -0.45),
  Q2 = c(0.4, 0.71, -0.77, 0.88, -1.92, -0.48),
  Q3 = c(0.53, 0.71, -1.02, -0.09, 0.17, -0.31),
  Q4 = c(-0.47, -0.51, -2.27, -0.26, -1.35, -1.45),
  Q5 = c(-0.44, 0.64, -1.05, 0.8, 0.83, 0.01),
  Q6 = c(-1.02, -0.02, -1.57, 0.07, -0.52, -0.9),
  Sex = as.factor(c("F", "F", "F", "F", "M", "M"))
)

QIMU_raw %>% 
  mutate(Score = rowSums(pick(Q1:Q6)))
#>   ID Age    Q1    Q2    Q3    Q4    Q5    Q6 Sex Score
#> 1  1  25 -0.49  0.40  0.53 -0.47 -0.44 -1.02   F -1.49
#> 2  2  20 -0.08  0.71  0.71 -0.51  0.64 -0.02   F  1.45
#> 3  3  23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57   F -8.69
#> 4  4  29  1.00  0.88 -0.09 -0.26  0.80  0.07   F  2.40
#> 5  5  24  0.27 -1.92  0.17 -1.35  0.83 -0.52   M -2.52
#> 6  6  27 -0.45 -0.48 -0.31 -1.45  0.01 -0.90   M -3.58

^{Created on 2019-11-26 by the reprex package (v0.3.0.9000)}

romainfrancois on 26 Nov 2019

or:

library(dplyr, warn.conflicts = FALSE)

QIMU_raw <- data.frame(
  ID = c(1L, 2L, 3L, 4L, 5L, 6L),
  Age = c(25L, 20L, 23L, 29L, 24L, 27L),
  Q1 = c(-0.49, -0.08, -2.01, 1, 0.27, -0.45),
  Q2 = c(0.4, 0.71, -0.77, 0.88, -1.92, -0.48),
  Q3 = c(0.53, 0.71, -1.02, -0.09, 0.17, -0.31),
  Q4 = c(-0.47, -0.51, -2.27, -0.26, -1.35, -1.45),
  Q5 = c(-0.44, 0.64, -1.05, 0.8, 0.83, 0.01),
  Q6 = c(-1.02, -0.02, -1.57, 0.07, -0.52, -0.9),
  Sex = as.factor(c("F", "F", "F", "F", "M", "M"))
)

p_sum <- function(...) {
  rowSums(pick(...))
}

QIMU_raw %>% 
  mutate(Score = p_sum(Q1:Q6))
#>   ID Age    Q1    Q2    Q3    Q4    Q5    Q6 Sex Score
#> 1  1  25 -0.49  0.40  0.53 -0.47 -0.44 -1.02   F -1.49
#> 2  2  20 -0.08  0.71  0.71 -0.51  0.64 -0.02   F  1.45
#> 3  3  23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57   F -8.69
#> 4  4  29  1.00  0.88 -0.09 -0.26  0.80  0.07   F  2.40
#> 5  5  24  0.27 -1.92  0.17 -1.35  0.83 -0.52   M -2.52
#> 6  6  27 -0.45 -0.48 -0.31 -1.45  0.01 -0.90   M -3.58

^{Created on 2019-11-26 by the reprex package (v0.3.0.9000)}

@hadley, @lionel- I think this would be in favor of keeping pick() because using across() here looks weird ? Or maybe not.

p_sum <- function(...) {
  rowSums(across(...))
}

romainfrancois on 26 Nov 2019

@romainfrancois This is exactly it! Hope one/both of these will be stable/available soon!
Thanks!

mattansb on 26 Nov 2019

Would this respect changes make in the same mutate() call?

e.g. would Score1 and Score2 be different here?

````R
library(dplyr, warn.conflicts = FALSE)

QIMU_raw <- data.frame(
ID = c(1L, 2L, 3L, 4L, 5L, 6L),
Age = c(25L, 20L, 23L, 29L, 24L, 27L),
Q1 = c(-0.49, -0.08, -2.01, 1, 0.27, -0.45),
Q2 = c(0.4, 0.71, -0.77, 0.88, -1.92, -0.48),
Q3 = c(0.53, 0.71, -1.02, -0.09, 0.17, -0.31),
Q4 = c(-0.47, -0.51, -2.27, -0.26, -1.35, -1.45),
Q5 = c(-0.44, 0.64, -1.05, 0.8, 0.83, 0.01),
Q6 = c(-1.02, -0.02, -1.57, 0.07, -0.52, -0.9),
Sex = as.factor(c("F", "F", "F", "F", "M", "M"))
)

QIMU_raw %>%
mutate(Score1 = rowSums(pick(Q1:Q6)),
Q1 = Q1 * 3,
Score2 = rowSums(pick(Q1:Q6)))
````

mattansb on 26 Nov 2019

At the moment they are not, but I suppose they would have to.

romainfrancois on 26 Nov 2019

👍1

rowSums(across(Q1:Q6)) reads fine to me.

hadley on 27 Nov 2019

👍3

Hallelujah (:

Thanks @romainfrancois

library(dplyr)

QIMU_raw <- data.frame(
  ID = c(1L, 2L, 3L, 4L, 5L, 6L),
  Age = c(25L, 20L, 23L, 29L, 24L, 27L),
  Q1 = c(-0.49, -0.08, -2.01, 1, 0.27, -0.45),
  Q2 = c(0.4, 0.71, -0.77, 0.88, -1.92, -0.48),
  Q3 = c(0.53, 0.71, -1.02, -0.09, 0.17, -0.31),
  Q4 = c(-0.47, -0.51, -2.27, -0.26, -1.35, -1.45),
  Q5 = c(-0.44, 0.64, -1.05, 0.8, 0.83, 0.01),
  Q6 = c(-1.02, -0.02, -1.57, 0.07, -0.52, -0.9),
  Sex = as.factor(c("F", "F", "F", "F", "M", "M"))
)

## W/ `rowMeans` (`across`)

QIMU_raw %>% 
  mutate(Score = rowMeans(across(Q1:Q6), na.rm = TRUE))
#>   ID Age    Q1    Q2    Q3    Q4    Q5    Q6 Sex      Score
#> 1  1  25 -0.49  0.40  0.53 -0.47 -0.44 -1.02   F -0.2483333
#> 2  2  20 -0.08  0.71  0.71 -0.51  0.64 -0.02   F  0.2416667
#> 3  3  23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57   F -1.4483333
#> 4  4  29  1.00  0.88 -0.09 -0.26  0.80  0.07   F  0.4000000
#> 5  5  24  0.27 -1.92  0.17 -1.35  0.83 -0.52   M -0.4200000
#> 6  6  27 -0.45 -0.48 -0.31 -1.45  0.01 -0.90   M -0.5966667

## W/ `rowwise` and `mean` (`c_across`)
QIMU_raw %>% 
  rowwise() %>% 
  mutate(Score = mean(c_across(Q1:Q6), na.rm = TRUE)) %>% 
  ungroup()
# A tibble: 6 x 10
#>      ID   Age    Q1    Q2    Q3    Q4    Q5    Q6 Sex    Score
#>   <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>  <dbl>
#> 1     1    25 -0.49  0.4   0.53 -0.47 -0.44 -1.02 F     -0.248
#> 2     2    20 -0.08  0.71  0.71 -0.51  0.64 -0.02 F      0.242
#> 3     3    23 -2.01 -0.77 -1.02 -2.27 -1.05 -1.57 F     -1.45 
#> 4     4    29  1     0.88 -0.09 -0.26  0.8   0.07 F      0.4  
#> 5     5    24  0.27 -1.92  0.17 -1.35  0.83 -0.52 M     -0.42 
#> 6     6    27 -0.45 -0.48 -0.31 -1.45  0.01 -0.9  M     -0.597

## Detects changes in same call to mutate? (YES!)

QIMU_raw %>% 
  mutate(Score1 = rowMeans(across(Q1:Q6)),
         Q1 = NA, # to fail Score2
         Score2 = rowMeans(across(Q1:Q6)))
#>   ID Age Q1    Q2    Q3    Q4    Q5    Q6 Sex     Score1 Score2
#> 1  1  25 NA  0.40  0.53 -0.47 -0.44 -1.02   F -0.2483333     NA
#> 2  2  20 NA  0.71  0.71 -0.51  0.64 -0.02   F  0.2416667     NA
#> 3  3  23 NA -0.77 -1.02 -2.27 -1.05 -1.57   F -1.4483333     NA
#> 4  4  29 NA  0.88 -0.09 -0.26  0.80  0.07   F  0.4000000     NA
#> 5  5  24 NA -1.92  0.17 -1.35  0.83 -0.52   M -0.4200000     NA
#> 6  6  27 NA -0.48 -0.31 -1.45  0.01 -0.90   M -0.5966667     NA

mattansb on 21 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings