Joss-reviews: [PRE REVIEW]: BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia

Created on 23 Jul 2020  ·  44Comments  ·  Source: openjournals/joss-reviews

Submitting author: @sylvaticus (Antonello Lobianco)
Repository: https://github.com/sylvaticus/BetaML.jl
Version: v0.2.2
Editor: @terrytangyuan
Reviewers: @ablaom, @ppalmes
Managing EiC: Kevin M. Moerman

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Author instructions

Thanks for submitting your paper to JOSS @sylvaticus. Currently, there isn't an JOSS editor assigned to your paper.

The author's suggestion for the handling editor is @VivianePons.

@sylvaticus if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission (please start at the bottom of the list).

Editor instructions

The JOSS submission bot @whedon is here to help you find and assign reviewers and start the main review. To find out what @whedon can do for you type:

@whedon commands
Julia Jupyter Notebook Python pre-review

All 44 comments

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

Failed to discover a Statement of need section in paper

Software report (experimental):

github.com/AlDanial/cloc v 1.84  T=0.19 s (208.0 files/s, 89373.9 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Julia                           18            625            632           2452
Markdown                         8            178              0            394
Jupyter Notebook                 4              0          11796            311
TeX                              1             12              7            104
YAML                             5             11              2             63
TOML                             2              3              0             47
Python                           1             35             41             42
-------------------------------------------------------------------------------
SUM:                            39            864          12478           3413
-------------------------------------------------------------------------------


Statistical information for the repository '2512' was gathered on 2020/07/23.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Antonello Lobianco               4           657            539          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Antonello Lobianco          118           18.0          0.7               18.64

@bmcfee @terrytangyuan @kakiac I realize you are handling several submissions already but could one of you edit this one?

I'd like to hold off on taking any more until the ones I'm already on make more progress.

@whedon invite @terrytangyuan as editor

@terrytangyuan has been invited to edit this submission.

@whedon assign @terrytangyuan as editor

OK, the editor is @terrytangyuan

@sylvaticus if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission.

@terrytangyuan Thank you, but I don't have any particular reviewer to suggest... maybe denizyuret, tpapp, dpsanders or h-Klok (all active in various way on ML in the Julia community) ??

👋 Hi @denizyuret @tpapp @dpsanders @h-Klok if you would like to review for this submission, please let us know here! We need at least two reviewers.

Hello, in the meantime I did update the library, adding a module on Decision Trees/Random Forests.
What is the standard procedure for Joss submissions? Should I also update the paper, or that remains the paper for the version of the software as at the time of the submission?

@sylvaticus You will be able to update both the version of the software and the paper.

Ping @denizyuret @tpapp @dpsanders @h-Klok again in case this is losing track.

Sorry, ML is not my area of expertise.

@terrytangyuan Just a ping here about this pre-review issue. Looks like we are in need of some reviewers here. Thanks!

It's not my area of expertise either. You could try asking e.g. Mike Innes, Anthony Blaom and Thibaud Lenart, who have worked on other ML tools in the Julia ecosystem.

@dpsanders Thanks! Do you have their GitHub IDs by any chance?

MikeInnes, ablaom, tlienart

@MikeInnes @ablaom @tlienart Please let us know if you'd like to review for this submission. Thanks!

I could do this by mid November, if that is agreeable.

[Paper's author here]
Thank you.
I have updated the paper to include the Decision Trees / Random Forests methods that I have added in the meantime.

@whedon generate pdf

Uh, how do I regenerate the PDF after I changed the paper.md file ? I thought it was enough to cite "@whedon generate pdf" in this issue comment..

@whedon generate pdf

Does the @whedon command is ran straight away, or it takes some time before it is ran ?

commands to whedon have to be the only or first thing in a new comment

@whedon generate pdf

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

@terrytangyuan - looks like we have one reviewer here (thanks @ablaom!) and just need to find one more before we can move this to review.

@whedon add @ablaom as reviewer

OK, @ablaom is now a reviewer

Hi @shibabrat @dawbarton @mdavezac @StanczakDominik @amritagos, would any of you be interested in reviewing this submission?

Hey, sorry, but I don't feel well-versed enough in classical ML methods to really be of any use here - and I'm kind of swamped atm, so I have to pass :(

Good luck!

@sylvaticus @terrytangyuan

Here is my review:

What the package provides

The package under review provides pure-julia implementations of two
tree-based models, three clustering models, a perceptron model (with 3
variations) and a basic neural network model. In passing, it should be
noted that all or almost all of these algorithms have existing julia
implementations (e.g., DecisionTree.jl, Clustering.jl, Flux.jl). The package
is used in a course on Machine Learning but integration between the
package and the course is quite loose, as far as I could ascertain
(more on this below).

Apart from a library of loss functions, the package provides no other
tools. In particular, there are no functions to automate resampling
(such as cross-validation), hyper parameter optimization, and no model
composition (pipelining). The quality of the model implementations
looks good to me, although the author warns us that "the code is not
heavily optimized and GPU [for neural networks] is not supported "

Existing machine learning toolboxes in Julia

For context, consider the following multi-paradigm ML
toolboxes written in Julia which are relatively mature, by Julia standards:

package | number of models | resampling | hyper-parameter optimization | composition
-----------------|------------------|-------------|------------------------------|-------------
ScikitLearn.jl | > 150 | yes | yes | basic
AutoMLPipeline.jl| > 100 | no | no | medium
MLJ.jl | 151 | yes | yes | advanced

In addition to these are several excellent and mature packages
dedicated to neural-networks, the most popular being the AD-driven
Flux.jl package. So far, these provide limited meta-functionality,
although MLJ now provides an interface to certain classes of Flux
models (MLJFlux) and
ScikitLearn.jl provides interfaces to python neural network models
sufficient for small datasets and pedagogical use.

Disclaimer: I am a designer/contributor to MLJ.

According to the JOSS requirements,
Submissions should "Have an obvious research application."
In its
current state of maturity, BetaML is not a serious competitor to the
frameworks above, for contributing directly to research. However, the
author argues that it has pedagogical advantages over existing tools.

Value as pedagogical tool

I don't think there are many rigorous machine learning courses
or texts closely integrated with a models and tools implemented in julia
and it would be useful to have more of these. The degree of integration in this
case was difficult for me to ascertain because I couldn't see how to
access the course notes without formally registering for the course
(which is, however, free). I was also disappointed to find only one
link from doc-strings to course materials; from this "back door" to
the course notes I could find no reference back to the package,
however. Perhaps there is better integration in course exercises? I
couldn't figure this out.

The remaining argument for BetaML's pedagogical value rests on a
number of perceived drawbacks of existing toolboxes, for the
beginner. Quoting from the JOSS manuscript:

  1. "For example the popular Deep Learning library Flux (Mike Innes,
    2018), while extremely performant and flexible, adopts some
    designing choices that for a beginner could appear odd, for example
    avoiding the neural network object from the training process, or
    requiring all parameters to be explicitly defined. In BetaML we
    made the choice to allow the user to experiment with the
    hyperparameters of the algorithms learning them one step at the
    time. Hence for most functions we provide reasonable default
    parameters that can be overridden when needed."

  2. "To help beginners, many parameters and functions have pretty
    longer but more explicit names than usual. For example the Dense
    layer is a DenseLayer, the RBF kernel is radialKernel, etc."

  3. "While avoiding the problem of “reinventing the wheel”, the
    wrapping level unin- tentionally introduces some complications for
    the end-user, like the need to load the models and learn
    MLJ-specific concepts as model or machine. We chose instead to
    bundle the main ML algorithms directly within the package. This
    offers a complementary approach that we feel is more
    beginner-friendly."

Let me respond to these:

  1. These cricitism only apply to dedicated neural network
    packages, such as Flux.jl; all of the toolboxes listed
    above provide default hyper parameters for every model. In the case
    of neural networks, user-friendly interaction close to the kind
    sought here is available either by using the MLJFlux.jl models
    (available directly through MLJ) or by using the python models
    provided through ScikitLearn.jl.

  2. Yes, shorter names are obstacles for the beginner but hardly
    insurmountable. For example, one could provide a cheat sheet
    summarizing the models and other functionality needed for the
    machine learning course (and omitting all the rest).

  3. Yes, not needing to load in model code is slightly more
    friendly. On the other hand, in MLJ for example, one can load and
    instantiate a model with a single macro. So the main complication
    is having to ensure relevant libraries are in your environment. But
    this could be solved easily with a BeginnerPackage which curates
    all the necessary dependencies. I am not convinced beginners should
    find the idea of separating hyper-parameters and learned parameters
    (the "machines" in MLJ) that daunting. I suggest the author's
    criticism may have more to do with their lack of familiarity than a
    difficulty for newcomers, who do not have the same preconceptions
    from using other frameworks. In any case, the point is moot, as one
    can interact with MLJ models directly via a "model" interface and
    ignore machines. To see this, I have
    translated part of a
    BetaML notebook into MLJ syntax. There's hardly any difference - if
    anything the presentation is simpler (less hassle when splitting
    data horizontally and vertically).

In summary, while existing toolboxes might present a course instructor
with a few challenges, these are hardly game-changers. The advantages of
introducing a student to a powerful, mature, professional toolbox ab
initio far outweigh any drawbacks, in my view.

Conclusions

To meet the requirements of JOSS, I think either: (i) The BetaML
package needs to demonstrate tighter integration with easily
accessible course materials; or (ii) BetaML needs very substantial
enhancements to make it competitive with existing toolboxes.

Frankly, a believe a greater service to the Julia open-source software
community would be to integrate the author's course materials with one
of the mature ML toolboxes. In the case of MLJ, I would be more than
happy to provide guidance for such a project.


Minor suggestions for improvements

I didn't have too much trouble installing the package or running the
demos, except when running a notebook on top of an existing Julia
environment (see commment below).

  • The README.md should provide links to the toolboxes listed in
    the table above, for the student who "graduates" from BetaML.

  • Some or most intended users will be new to Julia, so I suggest
    including with the installation instructions something about how to
    set up a julia environment that includes BetaML. Something like
    this, for example.

  • I found it weird that the front-facing demo is an unsupervised
    model. A more "Hello World" example might be to train a Decision
    Tree.

  • The way users load the built-in datasets seems pretty awkward. Maybe
    just define some functions to do this? E.g.,
    load_bike_sharing(). Might be instructive to have examples where
    data is pulled in using RDatasets, UrlDownload or similar?

  • A cheat-sheet summarizing the model fitting functions and the loss
    functions would be helpful. Or you could have functions models() and
    loss_functions() that list these.

  • I found it pretty annoying to split data by hand the way this is
    done in the notebooks and even beginners might find this
    annoying. One utility function here would go a long way to making
    life easier here (something like the partition function in the
    MLJ, which you are welcome to lift).

  • The notebooks are not portable as they do not come with a
    Manifest.toml. One suggestion on how to handle this is
    here
    but you should add a comment in the notebook explaining that the
    notebook is only valid if it is accompanied by the Manifest.toml. I
    think an even better solution is provided by InstantiateFromUrl.jl
    but I haven't tried this yet.

  • The name em for the expectation-maximization clustering algorithm
    is very terse, and likely to conflict with a user variable. I admit, I had
    to dig up the doc-string to find out what it was.

I would like to thank @ablaom for his deep review of the software whose shortcomings I fully acknowledge.

I would like to give the reviewer a detailed reply, but how does it work? Should I just continue on this thread ? Or should I wait for a "decision" from the editor ?

I'm really sorry for replying late. I'm really swamped and won't be able to review. So sorry.

Suggestions for further reviewers: @ppalmes, @cstjean. These are authors of AutoMLPipelines.jl and ScikitLearn.jl respectively, the "competing" toolboxes to MLJ, with which I am associated. So independent for sure.

Let's wait for another reviewer to join and chime in. Thanks @ablaom for the feedback and suggesting additional reviewers!

👋🏻 @ppalmes and @cstjean would you be interested in reviewing for this submission?

hi all,

thanks @ablaom for suggesting my name. yes, i'm interested to help in the review.

@whedon add @ppalmes as reviewer

OK, @ppalmes is now a reviewer

@whedon start review

OK, I've started the review over in https://github.com/openjournals/joss-reviews/issues/2849.

Thanks. @ablaom and @ppalmes let's continue the discussion in #2849.

Was this page helpful?
0 / 5 - 0 ratings