Xgboost: [Roadmap] XGBoost 1.0.0 Roadmap

Created on 18 Jul 2019 · 52Comments · Source: dmlc/xgboost

@dmlc/xgboost-committer please add your items here by editing this post. Let's ensure that

each item has to be associated with a ticket
major design/refactoring are associated with a RFC before committing the code
blocking issue must be marked as blocking
breaking change must be marked as breaking

for other contributors who have no permission to edit the post, please comment here about what you think should be in 1.0.0

I have created three new types labels, 1.0.0, Blocking, Breaking

[x] Improve installation experience on Mac OSX (#4477)
[x] Remove old GPU objectives.
[x] Remove gpu_exact updater (deprecated) #4527
[x] Remove multi threaded multi gpu support (deprecated) #4531
[x] External memory for gpu and associated dmatrix refactoring #4357 #4354
[ ] Spark Checkpoint Performance Improvement (https://github.com/dmlc/xgboost/issues/3946)
[x] [BLOCKING] the sync mechanism in hist method in master branch is broken due to the inconsistent shape of tree in different workers (https://github.com/dmlc/xgboost/pull/4716, https://github.com/dmlc/xgboost/issues/4679)
[x] Per-node sync slows down distributed training with 'hist' (#4679)
[x] Regression tests including binary IO compatibility, output stability, performance regressions.

roadmap

Source

CodingCat

Most helpful comment

Not a committer, but can we please target PySpark API for 1.0?
Issue: #3370
Current PR: #4656

thesuperzapper on 19 Jul 2019

👍6

All 52 comments

Not a committer, but can we please target PySpark API for 1.0?
Issue: #3370
Current PR: #4656

thesuperzapper on 19 Jul 2019

👍6

for other contributors who have no permission to edit the post, please comment here about what you think should be in 1.0.0

CodingCat on 19 Jul 2019

Also, should we target moving exclusively to the Scala based Rabit tracker (for Spark) in 1.0?

thesuperzapper on 19 Jul 2019

I am also not a committer but me and the company I work in is very interested in fixing the performance issue with checkpointing (or at least mitigate it) #3946

trams on 20 Jul 2019

@trams @thesuperzapper I think this is an overview for everyone to have a feeling for what's coming next. It would be difficult to list everything coming since XGBoost is a community driven project. Just open a PR when it's ready.

Not a committer, but can we please target PySpark API for 1.0?

@thesuperzapper Let's track the progress. I certainly hope that I can start testing it. :-)

trivialfis on 20 Jul 2019

There is also the secondary consideration, that we might not be ready for 1.0, and the API guarantees that come with that, for example, we could instead do 0.10.0 next?

thesuperzapper on 21 Jul 2019

@thesuperzapper 1.0 is not gonna be a final version. It's just we are trying to do semantic versioning.

trivialfis on 21 Jul 2019

Added some gpu related items.

RAMitchell on 23 Jul 2019

would like to get native xgb fix included.
https://github.com/dmlc/xgboost/issues/4753

chenqin on 8 Aug 2019

JSON is removed from the list. See https://github.com/dmlc/xgboost/pull/4683#issuecomment-520485615

trivialfis on 12 Aug 2019

I raised an issue for my above suggestion: #4781 (To remove the python Rabit tracker)

thesuperzapper on 16 Aug 2019

FeatureImportance in the Spark version will be great as well (i.e. easily have the feature Importance)
https://github.com/dmlc/xgboost/pull/988

Daniel8hen on 18 Aug 2019

Added regression test.

trivialfis on 21 Aug 2019

@chenqin I'd like to hear from you about regression tests, since you have experience with managing ML in production. Any suggestions?

hcho3 on 21 Aug 2019

@chenqin I'd like to hear from you about regression tests, since you have experience with managing ML in production. Any suggestions?

I think we should cover regression test on various of workloads and benchmark against prediction accuracy and stability (equal or better) than previous version within approximate same time. Two candidates on top of my head are

https://archive.ics.uci.edu/ml/datasets/HIGGS

sparse Dmatrix
https://www.kaggle.com/c/ClaimPredictionChallenge

We can try various of tree methods and configurations to ensure good coverage

tree_method, configurations / dataset / standalone or cluster

declaimer:
I think it worth clarify a bit.

Release regression is not something we already done in the company I worked.
The data sets I proposed is arbitrary which may not used as benchmark to claim one framework better than another. (this is most concerning when I saw biased benchmarks from time to time)
In fact, the essence of tune and uncover proper features/settings have always been more important. Unfortunately we may not cover this in regression tests.

May be more organized plan is to build a automation tool where user can take and benchmark various settings against their private data-set and model in their own data center.

chenqin on 22 Aug 2019

👍2

We should add fixing #4779 as a requirement to ship 1.0

thesuperzapper on 17 Sep 2019

I add #4899 as a cleanup step.

codingforfun on 26 Sep 2019

@dmlc/xgboost-committer Since we have quite a few tasks left for 1.0, maybe we should make an interim release 0.91?

hcho3 on 5 Oct 2019

👍3

@hcho3 Or perhaps 0.10.0

thesuperzapper on 5 Oct 2019

@thesuperzapper That will confuse version system. I don't mind a 0.91 release, but still I want to see proper procedures for regression tests.

trivialfis on 5 Oct 2019

@trivialfis If master has API changes, shouldn't we bump a major version, which I guess would look like 0.100.0

thesuperzapper on 5 Oct 2019

@thesuperzapper The 1.0.0 version is the first version we would adopt semantic versioning scheme, so no, semantic versioning won't apply to the interim release. It's a bit tricky, since we have quite a lot to do until 1.0.0 is released.

hcho3 on 5 Oct 2019

If we want a 0.91, we should review all changes and ensure that 0.91 is an
incremental update based on 0.90, and as such, we don’t hurt our roadmap of
1.0.0 by shifting several features to 0.9x or any other version

My suggestion would be release 1.0.0.preview.1, some other project also
does this before a major release

On Sat, Oct 5, 2019 at 10:19 AM Philip Hyunsu Cho notifications@github.com
wrote:

@thesuperzapper https://github.com/thesuperzapper The 1.0.0 version is
the first version we would adopt semantic versioning scheme, so no,
semantic versioning won't apply to the interim release.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/4680?email_source=notifications&email_token=AAFFQ6GBEQSXJKFW6QDPN53QNDEALA5CNFSM4IE5CQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANXH7Q#issuecomment-538670078,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFFQ6BYMDES3537PDMGE5DQNDEALANCNFSM4IE5CQGA
.

CodingCat on 5 Oct 2019

@CodingCat 1.0.0.preview.1 is an interesting suggestion. Does Maven accept this version?

hcho3 on 5 Oct 2019

yes, you can have non-numeric letters in version number

On Sat, Oct 5, 2019 at 11:11 AM Philip Hyunsu Cho notifications@github.com
wrote:

@CodingCat https://github.com/CodingCat 1.0.0.preview.1 is an
interesting suggestion. Does Maven accept this version?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/4680?email_source=notifications&email_token=AAFFQ6H64Y75JBSSDRVYIS3QNDKFNA5CNFSM4IE5CQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANYPSQ#issuecomment-538675146,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFFQ6BHKVVMQIDMRPY4DSTQNDKFNANCNFSM4IE5CQGA
.

CodingCat on 5 Oct 2019

An interim release is a good idea, there are a lot of improvements since 0.9.

RAMitchell on 5 Oct 2019

👍3

Got it, I will do some plumbing in the CI system in the next few days, and then prepare 1.0.0.preview.1 release.

hcho3 on 6 Oct 2019

@CodingCat How about 0.100 or 0.95? "Preview" sounds like the 1.0.0 release is just around the corner, but we have quite a few major features (PySpark) on the line.

hcho3 on 8 Oct 2019

Does it support weight xgboost ?

douglasren on 9 Oct 2019

I am not worrying about the impression of 1.0.0 to users

Spark 3.0 preview is releasing in this month, but formal release is next
April (around spark summit) maybe

On Tue, Oct 8, 2019 at 11:41 AM Philip Hyunsu Cho notifications@github.com
wrote:

@CodingCat https://github.com/CodingCat How about 0.100 or 0.95?
"Preview" sounds like the 1.0.0 release is just around the corner, but we
have quite a few major features (PySpark) on the line.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/4680?email_source=notifications&email_token=AAFFQ6AOGIWIB6W6TW3R5W3QNTH6TA5CNFSM4IE5CQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAVF7MA#issuecomment-539647920,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFFQ6HF52HBR7ZNSKLIY3TQNTH6TANCNFSM4IE5CQGA
.

CodingCat on 9 Oct 2019

@CodingCat at least from the point of view of xgboost4j-spark, that 1.0.0 preview won't be useful for most people, as almost no one is running Spark on 2.12. Additionally, you can't easily get a compiled binary as https://spark.apache.org/downloads.html dosen't distribute compiled versions of Spark for 2.12 with the Hadoop binaries included.

thesuperzapper on 11 Oct 2019

Then we should release nothing?

On Thu, Oct 10, 2019 at 10:05 PM Mathew Wicks notifications@github.com
wrote:

@CodingCat https://github.com/CodingCat at least from the point of view
of xgboost4j-spark, that 1.0.0 preview won't be useful for most people, as
almost no one is running Spark on 2.12. Additionally, you can't easily get
a compiled binary as https://spark.apache.org/downloads.html dosen't
distribute compiled versions of Spark for 2.12 with the Hadoop binaries
included.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/4680?email_source=notifications&email_token=AAFFQ6AN3FJQ7ZE7EOTXLW3QOACSFA5CNFSM4IE5CQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA6ZM2Q#issuecomment-540907114,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAFFQ6EJRRMTNY7R7JVALTDQOACSFANCNFSM4IE5CQGA
.

CodingCat on 11 Oct 2019

@CodingCat @thesuperzapper I thought #4574 would allow for compiling XGBoost with both Scala 2.11 and 2.12? In that case, we should compile XGBoost with 2.11 and upload JAR to Maven.

hcho3 on 11 Oct 2019

Removed:

[ ] Release Gpu memory after training #4668

I don't think we can get to there right now.

trivialfis on 11 Oct 2019

@thesuperzapper It will be come easier to develop against the Apache Spark master (3.0) branch and Scala 2.12 after Spark releases a 3.0 preview (targeted pretty soon this fall). I'd expect a much bigger shift to Scala 2.12 in the Spark community after the final 3.0 release (targeted early 2020), but you're right that there isn't a ton of 2.12 usage now. I created https://github.com/dmlc/xgboost/issues/4926 to solicit discussion around the upcoming Spark release.

jkbradley on 11 Oct 2019

@CodingCat @thesuperzapper I thought #4574 would allow for compiling XGBoost with both Scala 2.11 and 2.12? In that case, we should compile XGBoost with 2.11 and upload JAR to Maven.

4574 does not allow to cross compile.

What it allows is for someone to check out the code, manually override scala version and recompile

So someone may compile a jar with 2.11 and upload to Maven
I had a pull request with migration to SBT which would allow to cross compile
I also know the trick how to support a cross compilation in Maven (we used it in our company). I can share if you are interested

trams on 11 Oct 2019

@hcho3 Is it possible to use CPack for easing the installation for OSX? Please ignore this comment if it's not possible.

trivialfis on 16 Oct 2019

Does it support Multi objective learning?

douglasren on 22 Oct 2019

@douglasren Sadly no. Could you start a new issue so we can discuss it? The term "multi objective" can vary depending on contexts, like one objective function for multiple outputs, multiple objectives with one output or multiple objectives with multiple outputs?

trivialfis on 22 Oct 2019

I would like to cast my vote towards an interim release as well.

EricSpeidel on 29 Nov 2019

👍3

5146 fixes #4477.

hcho3 on 23 Dec 2019

Removed:

[ ] PySpark API support (https://github.com/dmlc/xgboost/issues/3370) (https://github.com/dmlc/xgboost/pull/4656) .

trivialfis on 23 Dec 2019

An interim release would be great as the macOS installation is still a pain right now

TylerADavis on 8 Jan 2020

👍2

Can we get documented support for learning to rank (pairwise) with XGBoost4J-Spark? Currently, there is no concrete solution to how to specify training data. There's some confusion around partitioning by groupID and training data needing to follow same partition strategy, but it's quite vague.
An example or clear documentation would be really helpful!

dubeyrahul on 16 Jan 2020

I'd like to cast my vote to an interim release as well. We're looking forward to the next version mostly for the missing value fix by @cpfarrell (see https://github.com/dmlc/xgboost/pull/4805).

Is there a time estimate related to the next release (major or interim)?

PS: @thesuperzapper we're using 2.11 and 2.12 and an interim release would be extremely helpful for us

lucagiovagnoli on 24 Jan 2020

@hcho3 Can we make create a release branch and have a week or so for testing?

trivialfis on 30 Jan 2020

👍1

Yes!

hcho3 on 30 Jan 2020

👍1

@hcho3 In addition to a branch, we can also make an official release candidate on GitHub Releases so that the community can have more confidence to test it as well.