I am opening this issue to call for a vote on license issue on R-package. The Apache license serves great and we will continue keep it so for the main xgboost project. However, for the R-package, there has been some difficulties since many popular R packages are actually GPLv2, making it hard to use them in R-package. Note that Apache is compatible with GPLv3
Currently, the major training routine for xgboost does not touch these packages, while the addons for visualization and analysis uses some packages. As open source project, we want to respect the these packages that we depend on. I am open this issue to call for a vote on what to do
Both options only affect codes in R-packages. They are also likely won't impact R users of xgboost, as BSD is equally permissive.
This vote will be open for one week. Any one and reply to express their opinions
Among the dependencies, ggplot2
and stringr
are two GPL-2 packages. ggplot2
appears in the plotting functions, and stringr
appears in the plotting functions and some file-parsing related functions, e.g. xgb.dump.
I suggest to use stringi
to replace stringr
, and separate the license for the ggplot2
-related functions. These functions are in two separate files, the statement is managable.
I am not a lawyer by any means (cue @pommedeterresautee ), but my understanding is that these specific licensing incompatibilities might only kick in in situations when someone down the road would try to bundle and redistribute these packages together. I.e., I don't think that there is a licensing issue with the xgboost R package itself, since it is not _linking_ to, not redistributing, but it is only _executing_ some GPL-2 packages which remain external. Also, e.g., https://cran.r-project.org/web/packages/h2o is Apache-licensed and it still uses ggplot2, and it doesn't have any separate license for that piece of code (and I might guess it was approved by their lawyers).
Switching to stringi
should be fairly straightforward. The 'suggests' dependence on vcd
could be fairly easily removed... However, there are more of the chain-dependencies that lead to GPL-2 packages, e.g., data.table
> chron
.
I would be ok with the BSD license. However, it's not clear to me whether the xgboost C++ code used by the R package could be under two different licenses.
I am not sure to understand why it would be important to keep apache license for the main routine?
modified BSD for the whole package makes sense to me.
Kind regards,
Michaël
@tqchen In option 2 the BSD licensed R package would not be distributable in binary form because it would link to both Apache 2.0 and GPL-2 licensed code IMO.
Relicensing any Apache 2.0 covered code under BSD requires permission from the copyright owner in both options.
@pommedeterresautee Relicensing the whole package under BSD would make the binary packages covered by GPL 2.0 due to the dependencies, but the binaries would at least be distributable.
The binary code of the library is fine since it does not explicitly link against any of these dependent library in R. The distribution of the R routines that calls these library can be viewed as source code, which runs on R interpreters and have nothing to do with binary form
To make it short, if a script calls a lib under GPL 2, the interpreter is functionnally independant of the lib it calls, and because of that, the script and the interpreter are not contaminated by the licence of the lib automatically (no technical coupling risk). Then you need to perform a functional analysis of the lib: is the lib providing something very specific to the whole project and can't be replaced by another non GPL lib?
Using an interface and making it easy to replace the gpl2 code is a strong proof of independance and guarantee no contamination (of course doesn t work if the only purpose of the interface is to avoid GPL2).
In present situation:
It seems there are sufficient evidences that XGBoost is independant from GPL 2 code. Using a generic interface would be a plus.
As a reminder:
https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.
So here is what I can suggest based on current situation. Write a option_util.R that provides all the features we need from ggplot and stringr, currently we directly re-direct to these libraries, but make sure it can be modified to be implemented by other ways. In this way all the sensitive code are restricted to this one file. @hetong007
Thanks for the nice summary, Michael.
I personally find that it's fairly clear that the current xgboost package has no strict binding / irreplaceable gpl2-dependencies, and that no change is _really_ required to substantiate the current licensing. However, not needing any direct or indirect gpl2 dependencies for the core xgboost R-package functionality to work might be useful in some situations (note that the chron
gpl2 dependency was moved from Imports to Suggests in the dev version of data.table
: https://github.com/Rdatatable/data.table/issues/1558 ). So I wouldn't mind doing the following:
xgb.plot.importance
and xgb.ggplot.importance
. For uncomplicated exploratory graphics in this context, the base R functionality is fully sufficient and is even preferable in some situations (e.g., remote connection over ssh).Switching the R-package to BSD would not be useful, as it wouldn't address the license conflict for someone who would want to bundle and distribute xgboost together with some gpl2 code, since the main xgboost library would remain Apache; plus, it would be an extra nuisance to whoever already got approval from their legal for the current license.
In general I think legal rules should have no impact at all on the product.
stringr
, clearly we can replace it with stringi
ggplot
if it is installed and the simple plot()
function otherwise. ggplot
would move to suggest. Anyway, any serious user will have ggplot
installed. Adding a new intermediate function may make sense, but should not be added for legal reasons IMO.BSD licence has no more purpose if we can easily prove no gpl2 contamination (which is the case).
There is only an issue if there is a GPL v2 contamination on XGBoost. Fair use is not usable there, unless proven (hardly).
Hence, the question of contamination. If there is presence of GPL v2 code in the whole code, the supposition is having the whole code being GPL v2:
The major question first to check the legal question is whether the two packages are using a shared address space to work together and to be directly linked together:
Assuming we suppose we use R memory and stay in the shades, we have to look somewhere else.
GPL v2 quotes "substantial parts" when dealing with two licenses. Does the Apache-licensed code requires to use a substantial part of the GPL v2-licensed code? If yes, it requires permission, else you are free to do whatever you want.
A quick look at GPL v2 unveils also their definition of what is an extension of a program and what is not (whether they are not the same licenses, the same usage, etc.): sharing data structures and using function calls. Hence, the current legal status fails at this point.
However, there is a special exception: whether there is a clear separation between the former (Apache code) and the latter (GPL v2 code), both in substantiality and form.
Is there substantiality? No. Therefore, the question about the form is left as is. Thus, the question is more about how dynamically are linked the Apache code and the GPL v2 code, which brings us back to the beginning.
Therefore, the line is more about how are the two libraries linked:
Here, there is usage of a component (and not working as an external module), therefore the issue remains about how dynamically there is a link between the Apache-licensed code and the GPL v2-licensed code.
Possible, and the most easiest ways to circumvent this:
Therefore, as also @pommedeterresautee pointed out, there is no point in changing the license to BSD. But the provided evidence is not enough to prove non-contamination. The issue with extrapolating to independence is that it does mean "not working together at any specific point in time when executed", but not "segregated enough functionally".
@Laurae2 It seems you are interpreting the FAQ of the GPL 2.
It is important to remind that the FAQ is not legally binding, only the licence itself is. Of course, the FAQ may be used by a judge/lawyer to understand the intent of the GPL author but it can't create new rights/obligations or enforce a way to understand the mechanism of the licence.
It means that the only thing we need to guarantee (regarding legal obligations) is that the Apache code is independent from the GPL code.
Regarding your points, only the graphical part is under GPL, meaning there is no real issue regarding the memory address used by the data... Moreover, all the matrix (data) used in R are converted to XGBoost format before being used by the core XGBoost part.
Regarding the Substantiality and Form point, the FAQ states:
However, in many cases you can distribute the GPL-covered software alongside your proprietary system. To do this validly, you must make sure that the free and non-free programs communicate at arms length, that they are not combined in a way that would make them effectively a single program.
The difference between this and “incorporating” the GPL-covered software is partly a matter of substance and partly form. The substantive part is this: if the two programs are combined so that they become effectively two parts of one program, then you can't treat them as two separate programs. So the GPL has to cover the whole thing.
If the two programs remain well separated, like the compiler and the kernel, or like an editor and a shell, then you can treat them as two separate programs—but you have to do it properly. The issue is simply one of form: how you describe what you are doing. Why do we care about this? Because we want to make sure the users clearly understand the free status of the GPL-covered software in the collection.
ggplot lives in its own package. Only calls to its functions through R interpreter are done: data are sent from XGBoost to ggplot and ggplot draw on screen the graph requested. There is no variable/data returned by ggplot sent back to XGBoost for remaining treatment. The job of ggplot is accessory to the main purpose of XGBoost (building ML model).
You write:
Is there substantiality? No. Therefore, the question about the form is left as is. Thus, the question is more about how dynamically are linked the Apache code and the GPL v2 code, which brings us back to the beginning.
Why is there no substantiality?
Another interesting point about ggplot is that it is only a suggested package (DESCRIPTION file):
Suggests:
knitr,
rmarkdown,
ggplot2 (>= 1.0.1),
DiagrammeR (>= 0.8.1),
Ckmeans.1d.dp (>= 3.3.1),
vcd (>= 1.3),
testthat,
igraph (>= 1.0.1)
It means that you can use XGBoost without having to install ggplot and that it won't be installed with xgboost until the user ask specifically for it.
With the PR from @khotilov there will be an interface between xgboost and ggplot.
There are lots of elements showing that the free and non-free programs communicate at arms length
.
@tqchen @khotilov @hetong007
No more need to switch the licence -> Can we close the issue?
Thanks guys for helpful discussion
Most helpful comment
Thanks for the nice summary, Michael.
I personally find that it's fairly clear that the current xgboost package has no strict binding / irreplaceable gpl2-dependencies, and that no change is _really_ required to substantiate the current licensing. However, not needing any direct or indirect gpl2 dependencies for the core xgboost R-package functionality to work might be useful in some situations (note that the
chron
gpl2 dependency was moved from Imports to Suggests in the dev version ofdata.table
: https://github.com/Rdatatable/data.table/issues/1558 ). So I wouldn't mind doing the following:xgb.plot.importance
andxgb.ggplot.importance
. For uncomplicated exploratory graphics in this context, the base R functionality is fully sufficient and is even preferable in some situations (e.g., remote connection over ssh).Switching the R-package to BSD would not be useful, as it wouldn't address the license conflict for someone who would want to bundle and distribute xgboost together with some gpl2 code, since the main xgboost library would remain Apache; plus, it would be an extra nuisance to whoever already got approval from their legal for the current license.