Xgboost: [Roadmap] XGBoost 1.0.0 Roadmap

Created on 18 Jul 2019  Â·  52Comments  Â·  Source: dmlc/xgboost

@dmlc/xgboost-committer please add your items here by editing this post. Let's ensure that

  • each item has to be associated with a ticket

  • major design/refactoring are associated with a RFC before committing the code

  • blocking issue must be marked as blocking

  • breaking change must be marked as breaking

for other contributors who have no permission to edit the post, please comment here about what you think should be in 1.0.0

I have created three new types labels, 1.0.0, Blocking, Breaking

  • [x] Improve installation experience on Mac OSX (#4477)
  • [x] Remove old GPU objectives.
  • [x] Remove gpu_exact updater (deprecated) #4527
  • [x] Remove multi threaded multi gpu support (deprecated) #4531
  • [x] External memory for gpu and associated dmatrix refactoring #4357 #4354
  • [ ] Spark Checkpoint Performance Improvement (https://github.com/dmlc/xgboost/issues/3946)
  • [x] [BLOCKING] the sync mechanism in hist method in master branch is broken due to the inconsistent shape of tree in different workers (https://github.com/dmlc/xgboost/pull/4716, https://github.com/dmlc/xgboost/issues/4679)
  • [x] Per-node sync slows down distributed training with 'hist' (#4679)
  • [x] Regression tests including binary IO compatibility, output stability, performance regressions.
roadmap

Most helpful comment

Not a committer, but can we please target PySpark API for 1.0?
Issue: #3370
Current PR: #4656

All 52 comments

Not a committer, but can we please target PySpark API for 1.0?
Issue: #3370
Current PR: #4656

for other contributors who have no permission to edit the post, please comment here about what you think should be in 1.0.0

Also, should we target moving exclusively to the Scala based Rabit tracker (for Spark) in 1.0?

I am also not a committer but me and the company I work in is very interested in fixing the performance issue with checkpointing (or at least mitigate it) #3946

@trams @thesuperzapper I think this is an overview for everyone to have a feeling for what's coming next. It would be difficult to list everything coming since XGBoost is a community driven project. Just open a PR when it's ready.

Not a committer, but can we please target PySpark API for 1.0?

@thesuperzapper Let's track the progress. I certainly hope that I can start testing it. :-)

There is also the secondary consideration, that we might not be ready for 1.0, and the API guarantees that come with that, for example, we could instead do 0.10.0 next?

@thesuperzapper 1.0 is not gonna be a final version. It's just we are trying to do semantic versioning.

Added some gpu related items.

would like to get native xgb fix included.
https://github.com/dmlc/xgboost/issues/4753

JSON is removed from the list. See https://github.com/dmlc/xgboost/pull/4683#issuecomment-520485615

I raised an issue for my above suggestion: #4781 (To remove the python Rabit tracker)

FeatureImportance in the Spark version will be great as well (i.e. easily have the feature Importance)
https://github.com/dmlc/xgboost/pull/988

Added regression test.

@chenqin I'd like to hear from you about regression tests, since you have experience with managing ML in production. Any suggestions?

@chenqin I'd like to hear from you about regression tests, since you have experience with managing ML in production. Any suggestions?

I think we should cover regression test on various of workloads and benchmark against prediction accuracy and stability (equal or better) than previous version within approximate same time. Two candidates on top of my head are

https://archive.ics.uci.edu/ml/datasets/HIGGS

sparse Dmatrix
https://www.kaggle.com/c/ClaimPredictionChallenge

We can try various of tree methods and configurations to ensure good coverage

tree_method, configurations / dataset / standalone or cluster

declaimer:
I think it worth clarify a bit.

  • Release regression is not something we already done in the company I worked.
  • The data sets I proposed is arbitrary which may not used as benchmark to claim one framework better than another. (this is most concerning when I saw biased benchmarks from time to time)

  • In fact, the essence of tune and uncover proper features/settings have always been more important. Unfortunately we may not cover this in regression tests.

May be more organized plan is to build a automation tool where user can take and benchmark various settings against their private data-set and model in their own data center.

We should add fixing #4779 as a requirement to ship 1.0

I add #4899 as a cleanup step.

@dmlc/xgboost-committer Since we have quite a few tasks left for 1.0, maybe we should make an interim release 0.91?

@hcho3 Or perhaps 0.10.0

@thesuperzapper That will confuse version system. I don't mind a 0.91 release, but still I want to see proper procedures for regression tests.

@trivialfis If master has API changes, shouldn't we bump a major version, which I guess would look like 0.100.0

@thesuperzapper The 1.0.0 version is the first version we would adopt semantic versioning scheme, so no, semantic versioning won't apply to the interim release. It's a bit tricky, since we have quite a lot to do until 1.0.0 is released.

If we want a 0.91, we should review all changes and ensure that 0.91 is an
incremental update based on 0.90, and as such, we don’t hurt our roadmap of
1.0.0 by shifting several features to 0.9x or any other version

My suggestion would be release 1.0.0.preview.1, some other project also
does this before a major release

On Sat, Oct 5, 2019 at 10:19 AM Philip Hyunsu Cho notifications@github.com
wrote:

@thesuperzapper https://github.com/thesuperzapper The 1.0.0 version is
the first version we would adopt semantic versioning scheme, so no,
semantic versioning won't apply to the interim release.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/4680?email_source=notifications&email_token=AAFFQ6GBEQSXJKFW6QDPN53QNDEALA5CNFSM4IE5CQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANXH7Q#issuecomment-538670078,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFFQ6BYMDES3537PDMGE5DQNDEALANCNFSM4IE5CQGA
.

@CodingCat 1.0.0.preview.1 is an interesting suggestion. Does Maven accept this version?

yes, you can have non-numeric letters in version number

On Sat, Oct 5, 2019 at 11:11 AM Philip Hyunsu Cho notifications@github.com
wrote:

@CodingCat https://github.com/CodingCat 1.0.0.preview.1 is an
interesting suggestion. Does Maven accept this version?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/4680?email_source=notifications&email_token=AAFFQ6H64Y75JBSSDRVYIS3QNDKFNA5CNFSM4IE5CQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANYPSQ#issuecomment-538675146,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFFQ6BHKVVMQIDMRPY4DSTQNDKFNANCNFSM4IE5CQGA
.

An interim release is a good idea, there are a lot of improvements since 0.9.

Got it, I will do some plumbing in the CI system in the next few days, and then prepare 1.0.0.preview.1 release.

@CodingCat How about 0.100 or 0.95? "Preview" sounds like the 1.0.0 release is just around the corner, but we have quite a few major features (PySpark) on the line.

Does it support weight xgboost ?

I am not worrying about the impression of 1.0.0 to users

Spark 3.0 preview is releasing in this month, but formal release is next
April (around spark summit) maybe

On Tue, Oct 8, 2019 at 11:41 AM Philip Hyunsu Cho notifications@github.com
wrote:

@CodingCat https://github.com/CodingCat How about 0.100 or 0.95?
"Preview" sounds like the 1.0.0 release is just around the corner, but we
have quite a few major features (PySpark) on the line.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/4680?email_source=notifications&email_token=AAFFQ6AOGIWIB6W6TW3R5W3QNTH6TA5CNFSM4IE5CQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAVF7MA#issuecomment-539647920,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAFFQ6HF52HBR7ZNSKLIY3TQNTH6TANCNFSM4IE5CQGA
.

@CodingCat at least from the point of view of xgboost4j-spark, that 1.0.0 preview won't be useful for most people, as almost no one is running Spark on 2.12. Additionally, you can't easily get a compiled binary as https://spark.apache.org/downloads.html dosen't distribute compiled versions of Spark for 2.12 with the Hadoop binaries included.

Then we should release nothing?

On Thu, Oct 10, 2019 at 10:05 PM Mathew Wicks notifications@github.com
wrote:

@CodingCat https://github.com/CodingCat at least from the point of view
of xgboost4j-spark, that 1.0.0 preview won't be useful for most people, as
almost no one is running Spark on 2.12. Additionally, you can't easily get
a compiled binary as https://spark.apache.org/downloads.html dosen't
distribute compiled versions of Spark for 2.12 with the Hadoop binaries
included.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/xgboost/issues/4680?email_source=notifications&email_token=AAFFQ6AN3FJQ7ZE7EOTXLW3QOACSFA5CNFSM4IE5CQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA6ZM2Q#issuecomment-540907114,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAFFQ6EJRRMTNY7R7JVALTDQOACSFANCNFSM4IE5CQGA
.

@CodingCat @thesuperzapper I thought #4574 would allow for compiling XGBoost with both Scala 2.11 and 2.12? In that case, we should compile XGBoost with 2.11 and upload JAR to Maven.

Removed:

  • [ ] Release Gpu memory after training #4668

I don't think we can get to there right now.

@thesuperzapper It will be come easier to develop against the Apache Spark master (3.0) branch and Scala 2.12 after Spark releases a 3.0 preview (targeted pretty soon this fall). I'd expect a much bigger shift to Scala 2.12 in the Spark community after the final 3.0 release (targeted early 2020), but you're right that there isn't a ton of 2.12 usage now. I created https://github.com/dmlc/xgboost/issues/4926 to solicit discussion around the upcoming Spark release.

@CodingCat @thesuperzapper I thought #4574 would allow for compiling XGBoost with both Scala 2.11 and 2.12? In that case, we should compile XGBoost with 2.11 and upload JAR to Maven.

4574 does not allow to cross compile.

What it allows is for someone to check out the code, manually override scala version and recompile

So someone may compile a jar with 2.11 and upload to Maven
I had a pull request with migration to SBT which would allow to cross compile
I also know the trick how to support a cross compilation in Maven (we used it in our company). I can share if you are interested

@hcho3 Is it possible to use CPack for easing the installation for OSX? Please ignore this comment if it's not possible.

Does it support Multi objective learning?

@douglasren Sadly no. Could you start a new issue so we can discuss it? The term "multi objective" can vary depending on contexts, like one objective function for multiple outputs, multiple objectives with one output or multiple objectives with multiple outputs?

I would like to cast my vote towards an interim release as well.

5146 fixes #4477.

Removed:

  • [ ] PySpark API support (https://github.com/dmlc/xgboost/issues/3370) (https://github.com/dmlc/xgboost/pull/4656) .

An interim release would be great as the macOS installation is still a pain right now

Can we get documented support for learning to rank (pairwise) with XGBoost4J-Spark? Currently, there is no concrete solution to how to specify training data. There's some confusion around partitioning by groupID and training data needing to follow same partition strategy, but it's quite vague.
An example or clear documentation would be really helpful!

I'd like to cast my vote to an interim release as well. We're looking forward to the next version mostly for the missing value fix by @cpfarrell (see https://github.com/dmlc/xgboost/pull/4805).

Is there a time estimate related to the next release (major or interim)?

PS: @thesuperzapper we're using 2.11 and 2.12 and an interim release would be extremely helpful for us

@hcho3 Can we make create a release branch and have a week or so for testing?

Yes!

@hcho3 In addition to a branch, we can also make an official release candidate on GitHub Releases so that the community can have more confidence to test it as well.

This sounds awesome! Really looking forward to the next release. Let me know if we can help. We're definitely going to test it out at Yelp.

I will cut a new branch release_1.0.0 after https://github.com/dmlc/xgboost/pull/5248 is merged. Thanks everyone for your patience.

Release candidate is now available for Python: https://github.com/dmlc/xgboost/issues/5253. You can try it today by running

pip3 install xgboost==1.0.0rc1

1.0.0 is now out:

pip3 install xgboost==1.0.0
Was this page helpful?
0 / 5 - 0 ratings