I am baffled by this bug. Here's some example code:
set.seed(1000)
df <- data.frame(Day = 1:100,
Equity = rbinom(100, 1, 0.5),
Bond = rbinom(100, 1, 0.5),
Gold = rbinom(100, 1, 0.5))
str(df)
# Method 1: Incorrect calculation of weights
dfWeights <- df %>%
mutate(Total = (Equity + Bond + Gold)) %>%
mutate(EquityWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Equity/Total),
BondWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Bond/Total),
GoldWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Gold/Total))
dfWeights %>% filter(Total == 3)
# Method 2: Incorrect calculation of weights
dfWeights <- df %>%
mutate(Total = (Equity + Bond + Gold)) %>%
mutate(EquityWeight = ifelse(identical(Total, 0L), 0, Equity/Total),
BondWeight = ifelse(identical(Total, 0L), 0, Bond/Total),
GoldWeight = ifelse(identical(Total, 0L), 0, Gold/Total))
dfWeights %>% filter(Total == 3)
# Method 3: Correct calculations
dfWeights <- df %>%
mutate(Total = (Equity + Bond + Gold)) %>%
mutate(EquityWeight = ifelse(Total == 0, 0, Equity/Total),
BondWeight = ifelse(Total == 0, 0, Bond/Total),
GoldWeight = ifelse(Total == 0, 0, Gold/Total))
dfWeights %>% filter(Total == 3)
I wish to avoid using Method 3 because Total could be a floating point number in real world. But calculations from both methods 1 & 2 are incorrect even though upon checking individual cases, they are correct.
I add my observations here because it seems the same to me, but I'm happy to post separately if anyone thinks it's more appropriate.
First I observed this:
> expand.grid(x=c(FALSE, NA), y=c(TRUE, FALSE)) %>%
mutate(z = ifelse(any(x, y, na.rm = TRUE), TRUE, FALSE))
x y z
1 FALSE TRUE TRUE
2 NA TRUE TRUE
3 FALSE FALSE TRUE
4 NA FALSE TRUE
In the last two rows z should actually be FALSE, and in fact
> ifelse(any(NA, FALSE, na.rm = TRUE), TRUE, FALSE)
[1] FALSE
Then I noticed it's even more basic/pervasive.
> expand.grid(x=c(TRUE, FALSE), y=c(TRUE, FALSE)) %>%
mutate(z = ifelse(any(x, y), TRUE, FALSE))
x y z
1 TRUE TRUE TRUE
2 FALSE TRUE TRUE
3 TRUE FALSE TRUE
4 FALSE FALSE TRUE
This has been observed on both 0.4 and 0.5.
I think it is not a bug from dplyr but a wrong use and understanding of dplyr logic.
mutate works on the column of the tibble. So when you call a column by name inside mutate, it will apply the function on the whole column. You need to group_by or rowwise otherwise. I try underneath to explain how it work.
I hope it will help you both
@ozagordi, any do not work as you expect on vectors. So, the results you obtain is to me the right results.
expand.grid(x=c(FALSE, TRUE), y=c(TRUE, FALSE)) %>%
mutate(z = ifelse(any(x, y, na.rm = TRUE), TRUE, FALSE))
correspond in a vector way to
x <- c(T, F, T, F)
y <- c(T, T, F, F)
z <- ifelse(any(x, y), TRUE, FALSE)
z
#> [1] TRUE
so z is a vector of length one. In the dplyr way you use, z in the mutate is duplicate to the length of the table. It is why you have only some TRUE in the z column.
I understand what you want is to create z by applying any on x and y but row by row , like that
library(dplyr)
expand.grid(x=c(TRUE, FALSE), y=c(TRUE, FALSE)) %>%
rowwise() %>%
mutate(z = ifelse(any(x, y), TRUE, FALSE))
#> Source: local data frame [4 x 3]
#> Groups: <by row>
#>
#> # A tibble: 4 × 3
#> x y z
#> <lgl> <lgl> <lgl>
#> 1 TRUE TRUE TRUE
#> 2 FALSE TRUE TRUE
#> 3 TRUE FALSE TRUE
#> 4 FALSE FALSE FALSE
or that in a non dplyr way
z <- mapply(function(x,y) ifelse(any(x, y), TRUE, FALSE),
x = c(T, F, T, F),
y = c(T, T, F, F))
z
#> [1] TRUE TRUE TRUE FALSE
@nitingupta2 it is the same with your problem. You try to use the all.equal function on a vector named _Total_ and compare to the 0. I understand from your problem that you want to do your calculation not on the whole table with the whole vector _Total_, but rather on each _Day_ in your table.
It needs a group_by somewhere to work with the correct type.
Using your method 1 :
library(dplyr)
set.seed(1000)
df <- data.frame(Day = 1:100,
Equity = rbinom(100, 1, 0.5),
Bond = rbinom(100, 1, 0.5),
Gold = rbinom(100, 1, 0.5))
str(df)
#> 'data.frame': 100 obs. of 4 variables:
#> $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Equity: int 0 1 0 1 1 0 1 1 0 0 ...
#> $ Bond : int 1 0 0 1 1 1 0 1 1 0 ...
#> $ Gold : int 0 0 1 0 0 0 1 1 1 1 ...
dfWeights <- df %>%
mutate(Total = (Equity + Bond + Gold)) %>%
group_by(Day) %>%
mutate(EquityWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Equity/Total),
BondWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Bond/Total),
GoldWeight = ifelse(isTRUE(all.equal(Total, 0)), 0, Gold/Total))
dfWeights %>% filter(Total == 3)
#> Source: local data frame [17 x 8]
#> Groups: Day [17]
#>
#> Day Equity Bond Gold Total EquityWeight BondWeight GoldWeight
#> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 8 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 2 12 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 3 14 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 4 26 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 5 28 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 6 39 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 7 42 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 8 45 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 9 47 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 10 48 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 11 50 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 12 52 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 13 57 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 14 63 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 15 68 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 16 93 1 1 1 3 0.3333333 0.3333333 0.3333333
#> 17 95 1 1 1 3 0.3333333 0.3333333 0.3333333
it seems ok to me.
Conclusion : It is not a bug in dplyr.
Advice : Try to ask on Stackoverflow when you have probleme like that, I find it usefull and I think you could find help more quickly.
Thanks a lot, very clear now. My bad that I did not check the behaviour of any, first. I'll keep in mind to use rowwisewhenever an R function does not return a vector. I'm now wondering if a warning would be possible/advisable. Thanks again.
Keep in mind to use
group_by when you want to apply a function on several groups in your data, rowwise is a special use case grouping by rows - if you have not any key column in your data.purrr package if you want to do more advanced operation - could help too.I think a warning is not possible because dplyr cannot know what you want to do. There are a lot of functions like any which take vectors and return value. A warning for each cannot be possible. Best way is to practice and understand the underlying mechanism of what you use. Read the docs, they are pretty clear.
Glad it help !
Thanks. Can we close this issue?
Thanks cderv for the explanation.
Most helpful comment
I think it is not a bug from dplyr but a wrong use and understanding of dplyr logic.
mutateworks on the column of thetibble. So when you call a column by name insidemutate, it will apply the function on the whole column. You need togroup_byorrowwiseotherwise. I try underneath to explain how it work.I hope it will help you both
@ozagordi,
anydo not work as you expect on vectors. So, the results you obtain is to me the right results.correspond in a vector way to
so
zis a vector of length one. In the dplyr way you use,zin the mutate is duplicate to the length of the table. It is why you have only someTRUEin thezcolumn.I understand what you want is to create
zby applyinganyonxandybut row by row , like thator that in a non dplyr way
@nitingupta2 it is the same with your problem. You try to use the
all.equalfunction on a vector named _Total_ and compare to the 0. I understand from your problem that you want to do your calculation not on the whole table with the whole vector _Total_, but rather on each _Day_ in your table.It needs a
group_bysomewhere to work with the correct type.Using your method 1 :
it seems ok to me.
Conclusion : It is not a bug in dplyr.
Advice : Try to ask on Stackoverflow when you have probleme like that, I find it usefull and I think you could find help more quickly.