Cudf: [FEA] Month addition or subtraction is inaccurate

Created on 12 Nov 2020  Â·  6Comments  Â·  Source: rapidsai/cudf

I wish I could use cuDF to do month addition or subtraction accurately, because there could be 30, 31, 28 and 29 days in a month.

The perfect feature would take a column of datetime variable to add or substract any unit of months to be a new column, in the most clean and simple way to code and run this manipulation.

For example,
DF = {'id': ['a','b','c'], 'old_date': ['2019-11-01', '2019-12-01', '2020-01-01']}
month_add = 1
I need DF['new_date'] = DF['old_date'] + month_add
so
DF = {'id': ['a','b','c'], 'old_date': ['2019-11-01', '2019-12-01', '2020-01-01'], 'new_date': ['2019-12-01', '2020-01-01', '2020-02-01']}

In order to work around, I have to convert datetime to string and work on year and month separately and do the manipulation. A lot of extra time to breakdown single digit vs double digit month dataframes independently to process the correct datetime format and append dfs back together. ALso, single vs double digit month cannot be uniformly calculated and concatenated to YYYY-MM-DD format, like ‘2020-1-01’ and not ‘2020-01-01’ correctly

Pain points -

  1. np.timedelta(month=n) does not consider the occurrence of 28,29,30,31 days in any month, but adds a month in terms of average number of days per month, a problem in numpy datetime calculation
  1. dateutil.relativedelta(months=+n) does not work with RAPIDS due to issue broadcasting this specific package/function

  2. Calculating ‘YYYY’ & ‘MM’ separately and concatenating strings back to ‘YYYY-MM-01’ would cause ‘MM’ as ‘M’ when MM<10, so we had to distinguish single M vs double MM dfs and process ad-hoc to add the ‘0’ back to single ‘M’

  3. This approach is extremely slow bc of breaking down df and appending df back together, especially when scaled up or expanding the cudf based on other columns

cuDF (Python) feature request

Most helpful comment

Right - we'd write a cuDF python API that'd be close (if not identical) to pandas and produce cython bindings that call the c++ under the hood. That said this is at the "seems like will theoretically work" phase of development and I have not at all scoped out what caveats there might be to this.

All 6 comments

I believe pandas exposes this functionality as a module level function with pd.DateOffset

I can look into plumbing the libcudf function through here.

I can look into plumbing the libcudf function through here.

Thanks Brandon! So just to be clear, this C++ functionality is not available in Python, right? Our team is a quant team and does not have skillset to look into C++ so it'd be awesome if Python can have the same functionality!

Right - we'd write a cuDF python API that'd be close (if not identical) to pandas and produce cython bindings that call the c++ under the hood. That said this is at the "seems like will theoretically work" phase of development and I have not at all scoped out what caveats there might be to this.

hi @roe246 , this should be available in the coming nightlies as cudf.DateOffset. Let us know if this works out for you.

Was this page helpful?
0 / 5 - 0 ratings