Submitting author: @sylvaticus (Antonello Lobianco)
Repository: https://github.com/sylvaticus/BetaML.jl
Version: v0.2.2
Editor: @terrytangyuan
Reviewers: @ablaom, @ppalmes
Managing EiC: Kevin M. Moerman
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
Author instructions
Thanks for submitting your paper to JOSS @sylvaticus. Currently, there isn't an JOSS editor assigned to your paper.
The author's suggestion for the handling editor is @VivianePons.
@sylvaticus if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission (please start at the bottom of the list).
Editor instructions
The JOSS submission bot @whedon is here to help you find and assign reviewers and start the main review. To find out what @whedon can do for you type:
@whedon commands
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks.
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
For a list of things I can do to help you, just type:
@whedon commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@whedon generate pdf
Failed to discover a Statement of need
section in paper
Software report (experimental):
github.com/AlDanial/cloc v 1.84 T=0.19 s (208.0 files/s, 89373.9 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Julia 18 625 632 2452
Markdown 8 178 0 394
Jupyter Notebook 4 0 11796 311
TeX 1 12 7 104
YAML 5 11 2 63
TOML 2 3 0 47
Python 1 35 41 42
-------------------------------------------------------------------------------
SUM: 39 864 12478 3413
-------------------------------------------------------------------------------
Statistical information for the repository '2512' was gathered on 2020/07/23.
The following historical commit information, by author, was found:
Author Commits Insertions Deletions % of changes
Antonello Lobianco 4 657 539 100.00
Below are the number of rows from each author that have survived and are still
intact in the current revision:
Author Rows Stability Age % in comments
Antonello Lobianco 118 18.0 0.7 18.64
@bmcfee @terrytangyuan @kakiac I realize you are handling several submissions already but could one of you edit this one?
I'd like to hold off on taking any more until the ones I'm already on make more progress.
@whedon invite @terrytangyuan as editor
@terrytangyuan has been invited to edit this submission.
@whedon assign @terrytangyuan as editor
OK, the editor is @terrytangyuan
@sylvaticus if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission.
@terrytangyuan Thank you, but I don't have any particular reviewer to suggest... maybe denizyuret, tpapp, dpsanders or h-Klok (all active in various way on ML in the Julia community) ??
👋 Hi @denizyuret @tpapp @dpsanders @h-Klok if you would like to review for this submission, please let us know here! We need at least two reviewers.
Hello, in the meantime I did update the library, adding a module on Decision Trees/Random Forests.
What is the standard procedure for Joss submissions? Should I also update the paper, or that remains the paper for the version of the software as at the time of the submission?
@sylvaticus You will be able to update both the version of the software and the paper.
Ping @denizyuret @tpapp @dpsanders @h-Klok again in case this is losing track.
Sorry, ML is not my area of expertise.
@terrytangyuan Just a ping here about this pre-review issue. Looks like we are in need of some reviewers here. Thanks!
It's not my area of expertise either. You could try asking e.g. Mike Innes, Anthony Blaom and Thibaud Lenart, who have worked on other ML tools in the Julia ecosystem.
@dpsanders Thanks! Do you have their GitHub IDs by any chance?
MikeInnes, ablaom, tlienart
@MikeInnes @ablaom @tlienart Please let us know if you'd like to review for this submission. Thanks!
I could do this by mid November, if that is agreeable.
[Paper's author here]
Thank you.
I have updated the paper to include the Decision Trees / Random Forests methods that I have added in the meantime.
@whedon generate pdf
Uh, how do I regenerate the PDF after I changed the paper.md file ? I thought it was enough to cite "@whedon generate pdf" in this issue comment..
@whedon generate pdf
Does the @whedon
command is ran straight away, or it takes some time before it is ran ?
commands to whedon have to be the only or first thing in a new comment
@whedon generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@terrytangyuan - looks like we have one reviewer here (thanks @ablaom!) and just need to find one more before we can move this to review.
@whedon add @ablaom as reviewer
OK, @ablaom is now a reviewer
Hi @shibabrat @dawbarton @mdavezac @StanczakDominik @amritagos, would any of you be interested in reviewing this submission?
Hey, sorry, but I don't feel well-versed enough in classical ML methods to really be of any use here - and I'm kind of swamped atm, so I have to pass :(
Good luck!
@sylvaticus @terrytangyuan
Here is my review:
The package under review provides pure-julia implementations of two
tree-based models, three clustering models, a perceptron model (with 3
variations) and a basic neural network model. In passing, it should be
noted that all or almost all of these algorithms have existing julia
implementations (e.g., DecisionTree.jl, Clustering.jl, Flux.jl). The package
is used in a course on Machine Learning but integration between the
package and the course is quite loose, as far as I could ascertain
(more on this below).
Apart from a library of loss functions, the package provides no other
tools. In particular, there are no functions to automate resampling
(such as cross-validation), hyper parameter optimization, and no model
composition (pipelining). The quality of the model implementations
looks good to me, although the author warns us that "the code is not
heavily optimized and GPU [for neural networks] is not supported "
For context, consider the following multi-paradigm ML
toolboxes written in Julia which are relatively mature, by Julia standards:
package | number of models | resampling | hyper-parameter optimization | composition
-----------------|------------------|-------------|------------------------------|-------------
ScikitLearn.jl | > 150 | yes | yes | basic
AutoMLPipeline.jl| > 100 | no | no | medium
MLJ.jl | 151 | yes | yes | advanced
In addition to these are several excellent and mature packages
dedicated to neural-networks, the most popular being the AD-driven
Flux.jl package. So far, these provide limited meta-functionality,
although MLJ now provides an interface to certain classes of Flux
models (MLJFlux) and
ScikitLearn.jl provides interfaces to python neural network models
sufficient for small datasets and pedagogical use.
Disclaimer: I am a designer/contributor to MLJ.
According to the JOSS requirements,
Submissions should "Have an obvious research application." In its
current state of maturity, BetaML is not a serious competitor to the
frameworks above, for contributing directly to research. However, the
author argues that it has pedagogical advantages over existing tools.
I don't think there are many rigorous machine learning courses
or texts closely integrated with a models and tools implemented in julia
and it would be useful to have more of these. The degree of integration in this
case was difficult for me to ascertain because I couldn't see how to
access the course notes without formally registering for the course
(which is, however, free). I was also disappointed to find only one
link from doc-strings to course materials; from this "back door" to
the course notes I could find no reference back to the package,
however. Perhaps there is better integration in course exercises? I
couldn't figure this out.
The remaining argument for BetaML's pedagogical value rests on a
number of perceived drawbacks of existing toolboxes, for the
beginner. Quoting from the JOSS manuscript:
"For example the popular Deep Learning library Flux (Mike Innes,
2018), while extremely performant and flexible, adopts some
designing choices that for a beginner could appear odd, for example
avoiding the neural network object from the training process, or
requiring all parameters to be explicitly defined. In BetaML we
made the choice to allow the user to experiment with the
hyperparameters of the algorithms learning them one step at the
time. Hence for most functions we provide reasonable default
parameters that can be overridden when needed."
"To help beginners, many parameters and functions have pretty
longer but more explicit names than usual. For example the Dense
layer is a DenseLayer, the RBF kernel is radialKernel, etc."
"While avoiding the problem of “reinventing the wheel”, the
wrapping level unin- tentionally introduces some complications for
the end-user, like the need to load the models and learn
MLJ-specific concepts as model or machine. We chose instead to
bundle the main ML algorithms directly within the package. This
offers a complementary approach that we feel is more
beginner-friendly."
Let me respond to these:
These cricitism only apply to dedicated neural network
packages, such as Flux.jl; all of the toolboxes listed
above provide default hyper parameters for every model. In the case
of neural networks, user-friendly interaction close to the kind
sought here is available either by using the MLJFlux.jl models
(available directly through MLJ) or by using the python models
provided through ScikitLearn.jl.
Yes, shorter names are obstacles for the beginner but hardly
insurmountable. For example, one could provide a cheat sheet
summarizing the models and other functionality needed for the
machine learning course (and omitting all the rest).
Yes, not needing to load in model code is slightly more
friendly. On the other hand, in MLJ for example, one can load and
instantiate a model with a single macro. So the main complication
is having to ensure relevant libraries are in your environment. But
this could be solved easily with a BeginnerPackage
which curates
all the necessary dependencies. I am not convinced beginners should
find the idea of separating hyper-parameters and learned parameters
(the "machines" in MLJ) that daunting. I suggest the author's
criticism may have more to do with their lack of familiarity than a
difficulty for newcomers, who do not have the same preconceptions
from using other frameworks. In any case, the point is moot, as one
can interact with MLJ models directly via a "model" interface and
ignore machines. To see this, I have
translated part of a
BetaML notebook into MLJ syntax. There's hardly any difference - if
anything the presentation is simpler (less hassle when splitting
data horizontally and vertically).
In summary, while existing toolboxes might present a course instructor
with a few challenges, these are hardly game-changers. The advantages of
introducing a student to a powerful, mature, professional toolbox ab
initio far outweigh any drawbacks, in my view.
To meet the requirements of JOSS, I think either: (i) The BetaML
package needs to demonstrate tighter integration with easily
accessible course materials; or (ii) BetaML needs very substantial
enhancements to make it competitive with existing toolboxes.
Frankly, a believe a greater service to the Julia open-source software
community would be to integrate the author's course materials with one
of the mature ML toolboxes. In the case of MLJ, I would be more than
happy to provide guidance for such a project.
I didn't have too much trouble installing the package or running the
demos, except when running a notebook on top of an existing Julia
environment (see commment below).
The README.md should provide links to the toolboxes listed in
the table above, for the student who "graduates" from BetaML.
Some or most intended users will be new to Julia, so I suggest
including with the installation instructions something about how to
set up a julia environment that includes BetaML. Something like
this, for example.
I found it weird that the front-facing demo is an unsupervised
model. A more "Hello World" example might be to train a Decision
Tree.
The way users load the built-in datasets seems pretty awkward. Maybe
just define some functions to do this? E.g.,
load_bike_sharing()
. Might be instructive to have examples where
data is pulled in using RDatasets
, UrlDownload
or similar?
A cheat-sheet summarizing the model fitting functions and the loss
functions would be helpful. Or you could have functions models()
and
loss_functions()
that list these.
I found it pretty annoying to split data by hand the way this is
done in the notebooks and even beginners might find this
annoying. One utility function here would go a long way to making
life easier here (something like the partition
function in the
MLJ, which you are welcome to lift).
The notebooks are not portable as they do not come with a
Manifest.toml. One suggestion on how to handle this is
here
but you should add a comment in the notebook explaining that the
notebook is only valid if it is accompanied by the Manifest.toml. I
think an even better solution is provided by InstantiateFromUrl.jl
but I haven't tried this yet.
The name em
for the expectation-maximization clustering algorithm
is very terse, and likely to conflict with a user variable. I admit, I had
to dig up the doc-string to find out what it was.
I would like to thank @ablaom for his deep review of the software whose shortcomings I fully acknowledge.
I would like to give the reviewer a detailed reply, but how does it work? Should I just continue on this thread ? Or should I wait for a "decision" from the editor ?
I'm really sorry for replying late. I'm really swamped and won't be able to review. So sorry.
Suggestions for further reviewers: @ppalmes, @cstjean. These are authors of AutoMLPipelines.jl and ScikitLearn.jl respectively, the "competing" toolboxes to MLJ, with which I am associated. So independent for sure.
Let's wait for another reviewer to join and chime in. Thanks @ablaom for the feedback and suggesting additional reviewers!
👋🏻 @ppalmes and @cstjean would you be interested in reviewing for this submission?
hi all,
thanks @ablaom for suggesting my name. yes, i'm interested to help in the review.
@whedon add @ppalmes as reviewer
OK, @ppalmes is now a reviewer
@whedon start review
OK, I've started the review over in https://github.com/openjournals/joss-reviews/issues/2849.
Thanks. @ablaom and @ppalmes let's continue the discussion in #2849.