Lightgbm: RFC: version changes + more frequent releases

Created on 7 Jul 2020 · 10Comments · Source: microsoft/LightGBM

I'd like to open this request for comment to discuss a proposal.

After releasing v 3.0.0 (#3071 ), I'd like to propose that we use 4-part version numbers for language wrappers, broken down like this:

So, for example, if you see version 3.1.0.8 of the R package, that means "the 8th released version of the R package which wraps LightGBM version 3.1.0".

Example

The examples below don't propose that every new merge to master becomes a release, but the changes below are examples used to show what might cause different components of a 4-part version number to change.

Event 1: 3.0.0 is release

LightGBM version set to 3.0.0
lightgbm (Python) 3.0.0.0 released to PyPi
{lightgbm} (R) 3.0.0.0 released to CRAN
LightGBM (lib for .NET extensions) 3.0.0.0 released to NuGet

Event 2: bug fix to LightGBM, like fixing #3209

LightGBM version set to 3.0.1
lightgbm (Python) 3.0.1.0 released to PyPi
{lightgbm} (R) 3.0.1.0 released to CRAN
LightGBM (lib for .NET extensions) 3.0.1.0 released to NuGet

Event 3: bug fix in {lightgbm} (R), like #3117

{lightgbm} (R) 3.0.1.1 released to CRAN

Event 4: LightGBM adds a new type of boosting, like #2644

LightGBM version set to 3.1.0
lightgbm (Python) 3.1.0.0 released to PyPi
{lightgbm} (R) 3.1.0.0 released to CRAN
LightGBM (lib for .NET extensions) 3.1.0.0 released to NuGet

How this makes LightGBM better

This approach would allow us to release fixes to individual components of LightGBM more frequently.

This would allow us to avoid the current situation, where the PyPi package (for example), has not had an update in 7 months: https://pypi.org/project/lightgbm/#history. More frequent updates allow our users to rely on package managers more, instead of building from GitHub, which I think is a better user experience.

Releasing more frequently would also reduce the gap between the current state of this repo and the documentation at https://lightgbm.readthedocs.io/en/latest/, so that that documentation is more likely to answer a user's questions accurately.

Allowing the version numbers to be different between R and Python (for example), is important since this two libraries are at very different stages in their development. The R package is still somewhat immature and there is a lot of work ahead for it, while the Python package is fairly mature and stable by comparison. A 4-part version number would allow the R package to be more frequently updated than the Python package, while preserving the use of the first three version components for LightGBM itself..

question

Source

jameslamb

👍2

Most helpful comment

@AlbertoEAF

We're already at 3.0.0.99 after all.

I believe that current 4-part versioning has a bit different semantics. https://github.com/microsoft/LightGBM/pull/3344#issuecomment-684913196

maybe it would be a good time to launch a new one :)

Already is in progress: #3484! 🙂

StrikerRUS on 7 Nov 2020

🎉1 😄1

All 10 comments

will this conflict with semantic versioning? https://semver.org/

guolinke on 7 Jul 2020

Good suggestion!
If I'm not mistaken something similar @imatiach-msft uses for JAVA binding in MMLSpark: https://github.com/microsoft/LightGBM/issues/3041#issuecomment-624217161.

However, I vote for the consistent version number across all official LightGBM components. 4-part version numbers will greatly increase the maintenance burden. Also, it will be very hard to make separate changelogs across all components, because you will need to list all commits multiple times and keep track of them per component.

This approach would allow us to release fixes to individual components of LightGBM more frequently.

I'm not sure that we are able to do that due to the lack of time and other resources. Instead, I suggest to get back and try to stick to bi-monthly releases. I believe, it'll be enough for the most of our users.

StrikerRUS on 7 Jul 2020

👍1

@StrikerRUS yes, I do almost exactly this, except instead of a fourth version I extend the third version, eg 2.3.150 corresponds to 2.3.1. I like this proposed versioning schema and can migrate to it for the JAVA wrapper, I'm open to any new ideas. I can't really keep to 2.3.1 because the JAVA releases are separate and I sometimes have blocking issues that span both (JAVA JNI + native jar) and mmlspark Scala wrapper code - and waiting for the next official LightGBM release to create the jar would be an extra burden, especially since MMLSpark is not as stable as LightGBM and often users hit new blocking issues. I kept it to 3 versions ..* when I originally released because that seems to be the standard way for semantic versioning.

imatiach-msft on 7 Jul 2020

👍1

will this conflict with semantic versioning? https://semver.org/

Even if it is permissible under all world conventions, it is rare enough that most CI CD systems have not been tested for it. One needs a sizeable collection of packages installed in their environment to encounter first case of this kind. We do, so I can confirm that the incidence of 4-part tokens among python packages used for data science and machine learning is around 1.5%. Three of these packages are even very well known (at least in the ML community).

Here's the list of such packages (among 850 we have installed in our largest container heavily influenced by Kaggle Kernels):

dill
ephem                     
gettext                   
h2o                       
lime                      
mkl-random
msgpack-numpy
opencv-python 
pkginfo                   
ppft                      
pystan                    
singledispatch            
typing

mirekphd on 7 Aug 2020

Can we please try to separate the red herring of 4-part versions with the urgent bug of no releases having been made for 8 months, which was raised e.g. in #3274?

mirekphd on 7 Aug 2020

@mirekphd
I think the delay of the current release is due to many new changes in the 3.0 version.
3.0 provides about 2x speed-up in CPU, and many new (breaking) features. There are still some on-going works, so we will release a pre-release now, and continued to work on the rest items.
It is not the usual case, normally, we will release by monthly or bi-monthly.

BTW, currently, the release process is manually. It will be better if we can fully automate it, so that we can have a more frequent release.

guolinke on 7 Aug 2020

3.0 provides about 2x speed-up in CPU, and many new (breaking) features. There are still some on-going works, so we will release a pre-release now, and continued to work on the rest items.

Excellent news! I did not know that such large improvements were still possible! It means that in v3.0.0 CPU training will most likely overtake GPU training...:) the difference in favor of GPU is so small even for huge datasets and under the new CUDA implementation, as we saw in #3160

By the way, I happen to know that there is still a room for substantial improvement in your CPU implementation for a very frequent use case, but now I will wait for your 3.0.0 release to see if my ideas will still work in that version too before making them public.

mirekphd on 7 Aug 2020

@mirekphd
the remaining works of 3.0 are the more new features, the CPU efficiency part is almost done.
you can have a try, we just released 3.0.0rc1 .

guolinke on 7 Aug 2020

🎉1

Hello, just to be sure, are we migrating to the 4-part versioning or no? We're already at 3.0.0.99 after all.

But yes, having more releases would be nice, maybe it would be a good time to launch a new one :)

Should we close this issue?