Dplyr: Bug in conditional assignment using ifelse within mutate

Created on 10 Sep 2016  Â·  6Comments  Â·  Source: tidyverse/dplyr

I am baffled by this bug. Here's some example code:

set.seed(1000)

df <- data.frame(Day = 1:100, 
                 Equity = rbinom(100, 1, 0.5), 
                 Bond = rbinom(100, 1, 0.5), 
                 Gold = rbinom(100, 1, 0.5))
str(df)

# Method 1: Incorrect calculation of weights
dfWeights <- df %>% 
                mutate(Total = (Equity + Bond + Gold)) %>% 
                mutate(EquityWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Equity/Total),
                       BondWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Bond/Total),
                       GoldWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Gold/Total))
dfWeights %>% filter(Total == 3)

# Method 2: Incorrect calculation of weights
dfWeights <- df %>% 
                mutate(Total = (Equity + Bond + Gold)) %>% 
                mutate(EquityWeight = ifelse(identical(Total, 0L), 0, Equity/Total),
                       BondWeight = ifelse(identical(Total, 0L), 0, Bond/Total),
                       GoldWeight = ifelse(identical(Total, 0L), 0, Gold/Total))
dfWeights %>% filter(Total == 3)

# Method 3: Correct calculations
dfWeights <- df %>% 
                mutate(Total = (Equity + Bond + Gold)) %>% 
                mutate(EquityWeight = ifelse(Total == 0, 0, Equity/Total),
                       BondWeight = ifelse(Total == 0, 0, Bond/Total),
                       GoldWeight = ifelse(Total == 0, 0, Gold/Total))
dfWeights %>% filter(Total == 3)

I wish to avoid using Method 3 because Total could be a floating point number in real world. But calculations from both methods 1 & 2 are incorrect even though upon checking individual cases, they are correct.

Most helpful comment

I think it is not a bug from dplyr but a wrong use and understanding of dplyr logic.
mutate works on the column of the tibble. So when you call a column by name inside mutate, it will apply the function on the whole column. You need to group_by or rowwise otherwise. I try underneath to explain how it work.

I hope it will help you both


@ozagordi, any do not work as you expect on vectors. So, the results you obtain is to me the right results.

expand.grid(x=c(FALSE, TRUE), y=c(TRUE, FALSE)) %>%
  mutate(z = ifelse(any(x, y, na.rm = TRUE), TRUE, FALSE))

correspond in a vector way to

x <- c(T, F, T, F)
y <- c(T, T, F, F)

z <- ifelse(any(x, y), TRUE, FALSE)
z
#> [1] TRUE

so z is a vector of length one. In the dplyr way you use, z in the mutate is duplicate to the length of the table. It is why you have only some TRUE in the z column.

I understand what you want is to create z by applying any on x and y but row by row , like that

library(dplyr)
expand.grid(x=c(TRUE, FALSE), y=c(TRUE, FALSE)) %>%
  rowwise() %>%
  mutate(z = ifelse(any(x, y), TRUE, FALSE))
#> Source: local data frame [4 x 3]
#> Groups: <by row>
#> 
#> # A tibble: 4 × 3
#>       x     y     z
#>   <lgl> <lgl> <lgl>
#> 1  TRUE  TRUE  TRUE
#> 2 FALSE  TRUE  TRUE
#> 3  TRUE FALSE  TRUE
#> 4 FALSE FALSE FALSE

or that in a non dplyr way

z <- mapply(function(x,y) ifelse(any(x, y), TRUE, FALSE),
            x = c(T, F, T, F), 
            y = c(T, T, F, F)) 

z
#> [1]  TRUE  TRUE  TRUE FALSE

@nitingupta2 it is the same with your problem. You try to use the all.equal function on a vector named _Total_ and compare to the 0. I understand from your problem that you want to do your calculation not on the whole table with the whole vector _Total_, but rather on each _Day_ in your table.
It needs a group_by somewhere to work with the correct type.

Using your method 1 :

library(dplyr)

set.seed(1000)

df <- data.frame(Day = 1:100, 
                 Equity = rbinom(100, 1, 0.5), 
                 Bond = rbinom(100, 1, 0.5), 
                 Gold = rbinom(100, 1, 0.5))
str(df)
#> 'data.frame':    100 obs. of  4 variables:
#>  $ Day   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Equity: int  0 1 0 1 1 0 1 1 0 0 ...
#>  $ Bond  : int  1 0 0 1 1 1 0 1 1 0 ...
#>  $ Gold  : int  0 0 1 0 0 0 1 1 1 1 ...

dfWeights <- df %>% 
  mutate(Total = (Equity + Bond + Gold)) %>% 
  group_by(Day) %>%
  mutate(EquityWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Equity/Total),
         BondWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Bond/Total),
         GoldWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Gold/Total))
dfWeights %>% filter(Total == 3)
#> Source: local data frame [17 x 8]
#> Groups: Day [17]
#> 
#>      Day Equity  Bond  Gold Total EquityWeight BondWeight GoldWeight
#>    <int>  <int> <int> <int> <int>        <dbl>      <dbl>      <dbl>
#> 1      8      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 2     12      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 3     14      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 4     26      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 5     28      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 6     39      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 7     42      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 8     45      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 9     47      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 10    48      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 11    50      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 12    52      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 13    57      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 14    63      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 15    68      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 16    93      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 17    95      1     1     1     3    0.3333333  0.3333333  0.3333333

it seems ok to me.


Conclusion : It is not a bug in dplyr.
Advice : Try to ask on Stackoverflow when you have probleme like that, I find it usefull and I think you could find help more quickly.

All 6 comments

I add my observations here because it seems the same to me, but I'm happy to post separately if anyone thinks it's more appropriate.

First I observed this:

> expand.grid(x=c(FALSE, NA), y=c(TRUE, FALSE)) %>%
  mutate(z = ifelse(any(x, y, na.rm = TRUE), TRUE, FALSE))
      x     y    z
1 FALSE  TRUE TRUE
2    NA  TRUE TRUE
3 FALSE FALSE TRUE
4    NA FALSE TRUE

In the last two rows z should actually be FALSE, and in fact

> ifelse(any(NA, FALSE, na.rm = TRUE), TRUE, FALSE)
[1] FALSE

Then I noticed it's even more basic/pervasive.

> expand.grid(x=c(TRUE, FALSE), y=c(TRUE, FALSE)) %>%
  mutate(z = ifelse(any(x, y), TRUE, FALSE))
        x     y    z
  1  TRUE  TRUE TRUE
  2 FALSE  TRUE TRUE
  3  TRUE FALSE TRUE
  4 FALSE FALSE TRUE

This has been observed on both 0.4 and 0.5.

I think it is not a bug from dplyr but a wrong use and understanding of dplyr logic.
mutate works on the column of the tibble. So when you call a column by name inside mutate, it will apply the function on the whole column. You need to group_by or rowwise otherwise. I try underneath to explain how it work.

I hope it will help you both


@ozagordi, any do not work as you expect on vectors. So, the results you obtain is to me the right results.

expand.grid(x=c(FALSE, TRUE), y=c(TRUE, FALSE)) %>%
  mutate(z = ifelse(any(x, y, na.rm = TRUE), TRUE, FALSE))

correspond in a vector way to

x <- c(T, F, T, F)
y <- c(T, T, F, F)

z <- ifelse(any(x, y), TRUE, FALSE)
z
#> [1] TRUE

so z is a vector of length one. In the dplyr way you use, z in the mutate is duplicate to the length of the table. It is why you have only some TRUE in the z column.

I understand what you want is to create z by applying any on x and y but row by row , like that

library(dplyr)
expand.grid(x=c(TRUE, FALSE), y=c(TRUE, FALSE)) %>%
  rowwise() %>%
  mutate(z = ifelse(any(x, y), TRUE, FALSE))
#> Source: local data frame [4 x 3]
#> Groups: <by row>
#> 
#> # A tibble: 4 × 3
#>       x     y     z
#>   <lgl> <lgl> <lgl>
#> 1  TRUE  TRUE  TRUE
#> 2 FALSE  TRUE  TRUE
#> 3  TRUE FALSE  TRUE
#> 4 FALSE FALSE FALSE

or that in a non dplyr way

z <- mapply(function(x,y) ifelse(any(x, y), TRUE, FALSE),
            x = c(T, F, T, F), 
            y = c(T, T, F, F)) 

z
#> [1]  TRUE  TRUE  TRUE FALSE

@nitingupta2 it is the same with your problem. You try to use the all.equal function on a vector named _Total_ and compare to the 0. I understand from your problem that you want to do your calculation not on the whole table with the whole vector _Total_, but rather on each _Day_ in your table.
It needs a group_by somewhere to work with the correct type.

Using your method 1 :

library(dplyr)

set.seed(1000)

df <- data.frame(Day = 1:100, 
                 Equity = rbinom(100, 1, 0.5), 
                 Bond = rbinom(100, 1, 0.5), 
                 Gold = rbinom(100, 1, 0.5))
str(df)
#> 'data.frame':    100 obs. of  4 variables:
#>  $ Day   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Equity: int  0 1 0 1 1 0 1 1 0 0 ...
#>  $ Bond  : int  1 0 0 1 1 1 0 1 1 0 ...
#>  $ Gold  : int  0 0 1 0 0 0 1 1 1 1 ...

dfWeights <- df %>% 
  mutate(Total = (Equity + Bond + Gold)) %>% 
  group_by(Day) %>%
  mutate(EquityWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Equity/Total),
         BondWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Bond/Total),
         GoldWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Gold/Total))
dfWeights %>% filter(Total == 3)
#> Source: local data frame [17 x 8]
#> Groups: Day [17]
#> 
#>      Day Equity  Bond  Gold Total EquityWeight BondWeight GoldWeight
#>    <int>  <int> <int> <int> <int>        <dbl>      <dbl>      <dbl>
#> 1      8      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 2     12      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 3     14      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 4     26      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 5     28      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 6     39      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 7     42      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 8     45      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 9     47      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 10    48      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 11    50      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 12    52      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 13    57      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 14    63      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 15    68      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 16    93      1     1     1     3    0.3333333  0.3333333  0.3333333
#> 17    95      1     1     1     3    0.3333333  0.3333333  0.3333333

it seems ok to me.


Conclusion : It is not a bug in dplyr.
Advice : Try to ask on Stackoverflow when you have probleme like that, I find it usefull and I think you could find help more quickly.

Thanks a lot, very clear now. My bad that I did not check the behaviour of any, first. I'll keep in mind to use rowwisewhenever an R function does not return a vector. I'm now wondering if a warning would be possible/advisable. Thanks again.

Keep in mind to use

  • group_by when you want to apply a function on several groups in your data,
  • rowwise is a special use case grouping by rows - if you have not any key column in your data.
  • look at purrr package if you want to do more advanced operation - could help too.

I think a warning is not possible because dplyr cannot know what you want to do. There are a lot of functions like any which take vectors and return value. A warning for each cannot be possible. Best way is to practice and understand the underlying mechanism of what you use. Read the docs, they are pretty clear.

Glad it help !

Thanks. Can we close this issue?

Thanks cderv for the explanation.

Was this page helpful?
0 / 5 - 0 ratings