I think we should discuss incorporating dask-related tests (or even bringing dask-cudf directly) into our cudf development process. I’ve only explicitly laid out some potential benefits of doing this here, but there are of course negatives, too. I’m interested to hear people’s thoughts on this.
I think the experience of getting multi-column groupbys, string-related groupbys, and cumulative aggregations (to pick a few examples) to work with dask has been rockier than we’d all have liked. One of the reasons for this is that it’s currently possible to make a change to solve a problem in cudf land that causes a different problem in dask land. The end result is that we end up repeatedly going back to some of the same parts of the code in both cudf and dask to solve the new bug that we didn’t see coming. It can be hard to anticipate the failure, and the opportunity cost of this iterative process is high.
Part of this can be addressed by increasing all of our familiarity with the dask codebase (one big dask/cudf happy family to quote @kkraus14 and @mrocklin) so that we can better anticipate, but I think we’ll still run into this problem by and large.
One of many possible approaches for streamlining the development process of making new/improved functionality work with dask could be to explicitly write unit tests against dask dataframes.
This could provide at least two benefits (for developers and users):
We shouldn’t block cuDF features because they don’t succeed in Dask, but we would be able to know if changes we make in cuDF land end up breaking existing functionality in Dask (something we want to avoid). Currently, we have no way of doing this, and both our development teams and users are hurt by this.
Interested to see people’s thoughts on this general topic of more tightly integrating the cuDF and dask testing and/or development process.
Thanks for writing this up @beckernick . I think that co-testing would be in general great.
As an alternative possibility, we might also consider re-merging the codebases. Currently the only non-trivial chunks of code in dask-cudf are merge/sorts (which we're hoping to rip out) and a bit of I/O. I think that if as continue down the path of removing chunks of dask-cudf as we align with dask dataframe then keeping that codebase within cudf becomes more palatable.
Just chiming in, that I'm 100% in support of merging the codebases and if there's no opposition will execute it during the 0.9 development cycle. This will allow us to test dask-cudf much more effectively and work towards refactoring out the code more effectively.
When I last looked through dask-cudf I noticed that the join and reduction
code was around and could probably be removed. That might be nice to
accomplish before the move, but I wouldn't block on it.
On Mon, Jun 3, 2019 at 9:31 AM Keith Kraus notifications@github.com wrote:
Just chiming in, that I'm 100% in support of merging the codebases and if
there's no opposition will execute it during the 0.9 development cycle.
This will allow us to test dask-cudf much more effectively and work towards
refactoring out the code more effectively.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rapidsai/cudf/issues/1903?email_source=notifications&email_token=AACKZTHUTHFW2KY5MBU4BALPYVBM5A5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ6V5Q#issuecomment-498330358,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACKZTGV7JC4CHIVWYB4ZHLPYVBM5ANCNFSM4HR6FJIA
.
When I last looked through dask-cudf I noticed that the join and reduction code was around and could probably be removed. That might be nice to accomplish before the move, but I wouldn't block on it.
…
On Mon, Jun 3, 2019 at 9:31 AM Keith Kraus @.*> wrote: Just chiming in, that I'm 100% in support of merging the codebases and if there's no opposition will execute it during the 0.9 development cycle. This will allow us to test dask-cudf much more effectively and work towards refactoring out the code more effectively. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1903?email_source=notifications&email_token=AACKZTHUTHFW2KY5MBU4BALPYVBM5A5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ6V5Q#issuecomment-498330358>, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTGV7JC4CHIVWYB4ZHLPYVBM5ANCNFSM4HR6FJIA .
The join code can't be removed quite yet as there's things within the dask shuffle that we don't have support for quite yet (surrounding groupby).
See #1720
There are two join implementations. One is very old and should be removed
regardless.
On Tue, Jun 4, 2019 at 11:23 AM Keith Kraus notifications@github.com
wrote:
When I last looked through dask-cudf I noticed that the join and reduction
code was around and could probably be removed. That might be nice to
accomplish before the move, but I wouldn't block on it.
… <#m_967189383566836751_>
On Mon, Jun 3, 2019 at 9:31 AM Keith Kraus @.*> wrote: Just chiming
in, that I'm 100% in support of merging the codebases and if there's no
opposition will execute it during the 0.9 development cycle. This will
allow us to test dask-cudf much more effectively and work towards
refactoring out the code more effectively. — You are receiving this because
you were mentioned. Reply to this email directly, view it on GitHub <#1903
https://github.com/rapidsai/cudf/issues/1903?email_source=notifications&email_token=AACKZTHUTHFW2KY5MBU4BALPYVBM5A5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ6V5Q#issuecomment-498330358>,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACKZTGV7JC4CHIVWYB4ZHLPYVBM5ANCNFSM4HR6FJIA
.The join code can't be removed quite yet as there's things within the dask
shuffle that we don't have support for quite yet (surrounding groupby).—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rapidsai/cudf/issues/1903?email_source=notifications&email_token=AACKZTEKI4GJOO4CZASU42DPY2XITA5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW5OC2A#issuecomment-498786664,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACKZTGBZWYAKQTXBGHKSXLPY2XITANCNFSM4HR6FJIA
.
Or rather, there are two within dask-cudf, three total. We'd like to get
down to one (just dask dataframe) but short term I would like to get down
to at least two.
On Tue, Jun 4, 2019 at 11:58 AM Matthew Rocklin mrocklin@gmail.com wrote:
There are two join implementations. One is very old and should be removed
regardless.On Tue, Jun 4, 2019 at 11:23 AM Keith Kraus notifications@github.com
wrote:When I last looked through dask-cudf I noticed that the join and
reduction code was around and could probably be removed. That might be nice
to accomplish before the move, but I wouldn't block on it.
… <#m_-5889423002370746000_m_967189383566836751_>
On Mon, Jun 3, 2019 at 9:31 AM Keith Kraus @.*> wrote: Just chiming
in, that I'm 100% in support of merging the codebases and if there's no
opposition will execute it during the 0.9 development cycle. This will
allow us to test dask-cudf much more effectively and work towards
refactoring out the code more effectively. — You are receiving this because
you were mentioned. Reply to this email directly, view it on GitHub <1903 https://github.com/rapidsai/cudf/issues/1903?email_source=notifications&email_token=AACKZTHUTHFW2KY5MBU4BALPYVBM5A5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ6V5Q#issuecomment-498330358>,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACKZTGV7JC4CHIVWYB4ZHLPYVBM5ANCNFSM4HR6FJIA
.The join code can't be removed quite yet as there's things within the
dask shuffle that we don't have support for quite yet (surrounding groupby).—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rapidsai/cudf/issues/1903?email_source=notifications&email_token=AACKZTEKI4GJOO4CZASU42DPY2XITA5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW5OC2A#issuecomment-498786664,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACKZTGBZWYAKQTXBGHKSXLPY2XITANCNFSM4HR6FJIA
.
Or rather, there are two within dask-cudf, three total. We'd like to get down to one (just dask dataframe) but short term I would like to get down to at least two.
…
On Tue, Jun 4, 2019 at 11:58 AM Matthew Rocklin @.> wrote: There are two join implementations. One is very old and should be removed regardless. On Tue, Jun 4, 2019 at 11:23 AM Keith Kraus *@.> wrote: > When I last looked through dask-cudf I noticed that the join and > reduction code was around and could probably be removed. That might be nice > to accomplish before the move, but I wouldn't block on it. > … <#m_-5889423002370746000_m_967189383566836751_> > On Mon, Jun 3, 2019 at 9:31 AM Keith Kraus *@.*> wrote: Just chiming > in, that I'm 100% in support of merging the codebases and if there's no > opposition will execute it during the 0.9 development cycle. This will > allow us to test dask-cudf much more effectively and work towards > refactoring out the code more effectively. — You are receiving this because > you were mentioned. Reply to this email directly, view it on GitHub < > #1903 <#1903>?email_source=notifications&email_token=AACKZTHUTHFW2KY5MBU4BALPYVBM5A5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ6V5Q#issuecomment-498330358>, > or mute the thread > https://github.com/notifications/unsubscribe-auth/AACKZTGV7JC4CHIVWYB4ZHLPYVBM5ANCNFSM4HR6FJIA > . > > The join code can't be removed quite yet as there's things within the > dask shuffle that we don't have support for quite yet (surrounding groupby). > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1903?email_source=notifications&email_token=AACKZTEKI4GJOO4CZASU42DPY2XITA5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW5OC2A#issuecomment-498786664>, > or mute the thread > https://github.com/notifications/unsubscribe-auth/AACKZTGBZWYAKQTXBGHKSXLPY2XITANCNFSM4HR6FJIA > . >
Could you link me to the two within dask-cudf? I apologize I'm not following where the different joins are within dask-cudf unless you're referring to .join vs .merge.
See https://github.com/rapidsai/dask-cudf/issues/193
On Tue, Jun 4, 2019 at 12:00 PM Keith Kraus notifications@github.com
wrote:
Or rather, there are two within dask-cudf, three total. We'd like to get
down to one (just dask dataframe) but short term I would like to get down
to at least two.
… <#m_-2389660100574970085_>
On Tue, Jun 4, 2019 at 11:58 AM Matthew Rocklin @.> wrote: There are
two join implementations. One is very old and should be removed regardless.
On Tue, Jun 4, 2019 at 11:23 AM Keith Kraus @.> wrote: > When I last
looked through dask-cudf I noticed that the join and > reduction code was
around and could probably be removed. That might be nice > to accomplish
before the move, but I wouldn't block on it. > …
<#m_-5889423002370746000_m_967189383566836751_> > On Mon, Jun 3, 2019 at
9:31 AM Keith Kraus @.*> wrote: Just chiming > in, that I'm 100% in
support of merging the codebases and if there's no > opposition will
execute it during the 0.9 development cycle. This will > allow us to test
dask-cudf much more effectively and work towards > refactoring out the code
more effectively. — You are receiving this because > you were mentioned.
Reply to this email directly, view it on GitHub < > #1903
https://github.com/rapidsai/cudf/issues/1903 <#1903
https://github.com/rapidsai/cudf/issues/1903>?email_source=notifications&email_token=AACKZTHUTHFW2KY5MBU4BALPYVBM5A5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZ6V5Q#issuecomment-498330358>,or mute the thread >
https://github.com/notifications/unsubscribe-auth/AACKZTGV7JC4CHIVWYB4ZHLPYVBM5ANCNFSM4HR6FJIA
. > > The join code can't be removed quite yet as there's things within
the > dask shuffle that we don't have support for quite yet (surrounding
groupby). > > — > You are receiving this because you were mentioned. >
Reply to this email directly, view it on GitHub > <#1903
https://github.com/rapidsai/cudf/issues/1903?email_source=notifications&email_token=AACKZTEKI4GJOO4CZASU42DPY2XITA5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW5OC2A#issuecomment-498786664>,
or mute the thread >
https://github.com/notifications/unsubscribe-auth/AACKZTGBZWYAKQTXBGHKSXLPY2XITANCNFSM4HR6FJIA
. >Could you link me to the two within dask-cudf? I apologize I'm not
following where the different joins are within dask-cudf unless you're
referring to .join vs .merge.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rapidsai/cudf/issues/1903?email_source=notifications&email_token=AACKZTGH43LRND634KN7VZTPY23UZA5CNFSM4HR6FJIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW5RL4Y#issuecomment-498800115,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACKZTDSCWWSCGB36XNMHBLPY23UZANCNFSM4HR6FJIA
.
I agree with @beckernick concerns and I am for merging the codebases but I have two main concerns:
This issue, https://github.com/rapidsai/cudf/issues/1945, is an example of when cudf changes and dask-cudf tests have not caught up
Fixed by #2160!
Most helpful comment
Just chiming in, that I'm 100% in support of merging the codebases and if there's no opposition will execute it during the 0.9 development cycle. This will allow us to test dask-cudf much more effectively and work towards refactoring out the code more effectively.