Data.table: Consider zero-length vector for := no recycling

Created on 11 Feb 2019  路  4Comments  路  Source: Rdatatable/data.table

I believe it is a reasonable change that := no longer recycles length>1 RHS vectors (#3310).

The following cases are some usual usage that may be broken:

For example, suppose we have a data.table of quarterly prices in each year of each symbol. Now create a column of price of the fourth quarter of each year:

dt[, last_quarter_price := price[quarter == 4L], by = .(symbol, year)]

For those stocks which delisted for some reason, there might be no data for the fourth quarter of the last year so that price[quarter == 4L] may result in a zero-length numeric vector. With the newer (restrict) recycling behavior, this will end up in an error.

To handle it, I have to change the code into the following

dt[, last_quarter_price := price[quarter == 4L][1L], by = .(symbol, year)]

A similar use case is as follows:

dt[, first_price := first(price[volume > 0]), by = symbol]

For some reason, price[volume > 0] may be a zero-length numeric vector, and first(<zero-length vector>) also returns a zero-length vector. In this case, first_price should get an NA, so I have to change the code into the following to achieve this:

dt[, first_price := price[volume > 0][1L], by = symbol]

In both cases, the length of the resulted vector must be zero or one. One is consistently recycled like before but the zero-length cases are broken. I'm not sure if it makes sense that := <zero-length vector> automatically gets a missing value, or otherwise, I need to rework all such cases so that they get NA like before.

Most helpful comment

@jangorecki It's quite easy to make some reproducible examples.

library(data.table)

quarterly_prices_csv <- "
symbol,year,quarter,price
A1,2017,1,10.0
A1,2017,2,11.0
A1,2017,3,12.0
A1,2017,4,11.0
A1,2018,1,12.0
A1,2018,2,13.0
A2,2017,1,10.0
A2,2017,2,11.0
A2,2017,3,12.0
A2,2017,4,11.0
A2,2018,1,12.0
"

quarterly_prices_dt <- fread(quarterly_prices_csv)
quarterly_prices_dt[, fourth_quarter_price := price[quarter == 4L], by = .(symbol, year)]
#> Error in `[.data.table`(quarterly_prices_dt, , `:=`(fourth_quarter_price, : Supplied 0 items to be assigned to group 2 of size 2 in column 'fourth_quarter_price'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

# walk-around
quarterly_prices_dt[, fourth_quarter_price := price[quarter == 4L][1L], by = .(symbol, year)]
quarterly_prices_dt
#>     symbol year quarter price fourth_quarter_price
#>  1:     A1 2017       1    10                   11
#>  2:     A1 2017       2    11                   11
#>  3:     A1 2017       3    12                   11
#>  4:     A1 2017       4    11                   11
#>  5:     A1 2018       1    12                   NA
#>  6:     A1 2018       2    13                   NA
#>  7:     A2 2017       1    10                   11
#>  8:     A2 2017       2    11                   11
#>  9:     A2 2017       3    12                   11
#> 10:     A2 2017       4    11                   11
#> 11:     A2 2018       1    12                   NA

price_volume_csv <- "
symbol,date,price,volume
A1,20180102,10.0,0
A1,20180103,10.0,0
A1,20180104,11.0,100
A2,20180102,5.0,0
A2,20180103,5.0,0
A2,20180104,5.0,0
"

price_volume_dt <- fread(price_volume_csv)
price_volume_dt[, first_price := first(price[volume > 0]), by = symbol]
#> Error in `[.data.table`(price_volume_dt, , `:=`(first_price, first(price[volume > : Supplied 0 items to be assigned to group 2 of size 3 in column 'first_price'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

# walk-around
price_volume_dt[, first_price := price[volume > 0][1L], by = symbol]
price_volume_dt
#>    symbol     date price volume first_price
#> 1:     A1 20180102    10      0          11
#> 2:     A1 20180103    10      0          11
#> 3:     A1 20180104    11    100          11
#> 4:     A2 20180102     5      0          NA
#> 5:     A2 20180103     5      0          NA
#> 6:     A2 20180104     5      0          NA

All 4 comments

there is no reproducible example so I cannot ultimately test this but it might have been resolved by https://github.com/Rdatatable/data.table/pull/3393 @renkun-ken please confirm and close

@jangorecki It's quite easy to make some reproducible examples.

library(data.table)

quarterly_prices_csv <- "
symbol,year,quarter,price
A1,2017,1,10.0
A1,2017,2,11.0
A1,2017,3,12.0
A1,2017,4,11.0
A1,2018,1,12.0
A1,2018,2,13.0
A2,2017,1,10.0
A2,2017,2,11.0
A2,2017,3,12.0
A2,2017,4,11.0
A2,2018,1,12.0
"

quarterly_prices_dt <- fread(quarterly_prices_csv)
quarterly_prices_dt[, fourth_quarter_price := price[quarter == 4L], by = .(symbol, year)]
#> Error in `[.data.table`(quarterly_prices_dt, , `:=`(fourth_quarter_price, : Supplied 0 items to be assigned to group 2 of size 2 in column 'fourth_quarter_price'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

# walk-around
quarterly_prices_dt[, fourth_quarter_price := price[quarter == 4L][1L], by = .(symbol, year)]
quarterly_prices_dt
#>     symbol year quarter price fourth_quarter_price
#>  1:     A1 2017       1    10                   11
#>  2:     A1 2017       2    11                   11
#>  3:     A1 2017       3    12                   11
#>  4:     A1 2017       4    11                   11
#>  5:     A1 2018       1    12                   NA
#>  6:     A1 2018       2    13                   NA
#>  7:     A2 2017       1    10                   11
#>  8:     A2 2017       2    11                   11
#>  9:     A2 2017       3    12                   11
#> 10:     A2 2017       4    11                   11
#> 11:     A2 2018       1    12                   NA

price_volume_csv <- "
symbol,date,price,volume
A1,20180102,10.0,0
A1,20180103,10.0,0
A1,20180104,11.0,100
A2,20180102,5.0,0
A2,20180103,5.0,0
A2,20180104,5.0,0
"

price_volume_dt <- fread(price_volume_csv)
price_volume_dt[, first_price := first(price[volume > 0]), by = symbol]
#> Error in `[.data.table`(price_volume_dt, , `:=`(first_price, first(price[volume > : Supplied 0 items to be assigned to group 2 of size 3 in column 'first_price'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

# walk-around
price_volume_dt[, first_price := price[volume > 0][1L], by = symbol]
price_volume_dt
#>    symbol     date price volume first_price
#> 1:     A1 20180102    10      0          11
#> 2:     A1 20180103    10      0          11
#> 3:     A1 20180104    11    100          11
#> 4:     A2 20180102     5      0          NA
#> 5:     A2 20180103     5      0          NA
#> 6:     A2 20180104     5      0          NA

Thanks @renkun-ken! Now fixed in dev and your tests added verbatim.

Thanks, @mattdowle!

Was this page helpful?
0 / 5 - 0 ratings