Joss-reviews: [PRE REVIEW]: BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia

Created on 23 Jul 2020 · 44Comments · Source: openjournals/joss-reviews

Submitting author: @sylvaticus (Antonello Lobianco)
Repository: https://github.com/sylvaticus/BetaML.jl
Version: v0.2.2
Editor: @terrytangyuan
Reviewers: @ablaom, @ppalmes
Managing EiC: Kevin M. Moerman

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Author instructions

Thanks for submitting your paper to JOSS @sylvaticus. Currently, there isn't an JOSS editor assigned to your paper.

The author's suggestion for the handling editor is @VivianePons.

@sylvaticus if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission (please start at the bottom of the list).

Editor instructions

The JOSS submission bot @whedon is here to help you find and assign reviewers and start the main review. To find out what @whedon can do for you type:

@whedon commands

Julia Jupyter Notebook Python pre-review

Source

whedon

All 44 comments

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon on 23 Jul 2020

Failed to discover a Statement of need section in paper

whedon on 23 Jul 2020

Software report (experimental):

github.com/AlDanial/cloc v 1.84  T=0.19 s (208.0 files/s, 89373.9 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Julia                           18            625            632           2452
Markdown                         8            178              0            394
Jupyter Notebook                 4              0          11796            311
TeX                              1             12              7            104
YAML                             5             11              2             63
TOML                             2              3              0             47
Python                           1             35             41             42
-------------------------------------------------------------------------------
SUM:                            39            864          12478           3413
-------------------------------------------------------------------------------


Statistical information for the repository '2512' was gathered on 2020/07/23.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Antonello Lobianco               4           657            539          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Antonello Lobianco          118           18.0          0.7               18.64

whedon on 23 Jul 2020

:point_right: Check article proof :page_facing_up: :point_left:

whedon on 23 Jul 2020

@bmcfee @terrytangyuan @kakiac I realize you are handling several submissions already but could one of you edit this one?

Kevin-Mattheus-Moerman on 23 Jul 2020

I'd like to hold off on taking any more until the ones I'm already on make more progress.

bmcfee on 24 Jul 2020

👍1

@whedon invite @terrytangyuan as editor

Kevin-Mattheus-Moerman on 25 Jul 2020

@terrytangyuan has been invited to edit this submission.

whedon on 25 Jul 2020

@whedon assign @terrytangyuan as editor

terrytangyuan on 25 Jul 2020

OK, the editor is @terrytangyuan

whedon on 25 Jul 2020

terrytangyuan on 25 Jul 2020

@terrytangyuan Thank you, but I don't have any particular reviewer to suggest... maybe denizyuret, tpapp, dpsanders or h-Klok (all active in various way on ML in the Julia community) ??

sylvaticus on 25 Jul 2020

👋 Hi @denizyuret @tpapp @dpsanders @h-Klok if you would like to review for this submission, please let us know here! We need at least two reviewers.

terrytangyuan on 25 Jul 2020

Hello, in the meantime I did update the library, adding a module on Decision Trees/Random Forests.
What is the standard procedure for Joss submissions? Should I also update the paper, or that remains the paper for the version of the software as at the time of the submission?

sylvaticus on 30 Aug 2020

@sylvaticus You will be able to update both the version of the software and the paper.

terrytangyuan on 4 Sep 2020

👍1

Ping @denizyuret @tpapp @dpsanders @h-Klok again in case this is losing track.

terrytangyuan on 4 Sep 2020

👍1

Sorry, ML is not my area of expertise.

tpapp on 5 Sep 2020

@terrytangyuan Just a ping here about this pre-review issue. Looks like we are in need of some reviewers here. Thanks!

kthyng on 5 Oct 2020

It's not my area of expertise either. You could try asking e.g. Mike Innes, Anthony Blaom and Thibaud Lenart, who have worked on other ML tools in the Julia ecosystem.

dpsanders on 5 Oct 2020

@dpsanders Thanks! Do you have their GitHub IDs by any chance?

terrytangyuan on 5 Oct 2020

MikeInnes, ablaom, tlienart

dpsanders on 5 Oct 2020

@MikeInnes @ablaom @tlienart Please let us know if you'd like to review for this submission. Thanks!

terrytangyuan on 5 Oct 2020

I could do this by mid November, if that is agreeable.

ablaom on 13 Oct 2020

👍1

[Paper's author here]
Thank you.
I have updated the paper to include the Decision Trees / Random Forests methods that I have added in the meantime.

@whedon generate pdf

sylvaticus on 19 Oct 2020

Uh, how do I regenerate the PDF after I changed the paper.md file ? I thought it was enough to cite "@whedon generate pdf" in this issue comment..

@whedon generate pdf

Does the @whedon command is ran straight away, or it takes some time before it is ran ?

sylvaticus on 19 Oct 2020

commands to whedon have to be the only or first thing in a new comment

danielskatz on 19 Oct 2020

👍1

@whedon generate pdf

danielskatz on 19 Oct 2020

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon on 19 Oct 2020

@terrytangyuan - looks like we have one reviewer here (thanks @ablaom!) and just need to find one more before we can move this to review.

arfon on 25 Oct 2020

👍1

@whedon add @ablaom as reviewer

arfon on 25 Oct 2020

OK, @ablaom is now a reviewer

whedon on 25 Oct 2020

Hi @shibabrat @dawbarton @mdavezac @StanczakDominik @amritagos, would any of you be interested in reviewing this submission?

terrytangyuan on 26 Oct 2020

Hey, sorry, but I don't feel well-versed enough in classical ML methods to really be of any use here - and I'm kind of swamped atm, so I have to pass :(

Good luck!

StanczakDominik on 26 Oct 2020

@sylvaticus @terrytangyuan

Here is my review:

What the package provides

The package under review provides pure-julia implementations of two
tree-based models, three clustering models, a perceptron model (with 3
variations) and a basic neural network model. In passing, it should be
noted that all or almost all of these algorithms have existing julia
implementations (e.g., DecisionTree.jl, Clustering.jl, Flux.jl). The package
is used in a course on Machine Learning but integration between the
package and the course is quite loose, as far as I could ascertain
(more on this below).

Apart from a library of loss functions, the package provides no other
tools. In particular, there are no functions to automate resampling
(such as cross-validation), hyper parameter optimization, and no model
composition (pipelining). The quality of the model implementations
looks good to me, although the author warns us that "the code is not
heavily optimized and GPU [for neural networks] is not supported "

Existing machine learning toolboxes in Julia

For context, consider the following multi-paradigm ML
toolboxes written in Julia which are relatively mature, by Julia standards:

package | number of models | resampling | hyper-parameter optimization | composition
-----------------|------------------|-------------|------------------------------|-------------
ScikitLearn.jl | > 150 | yes | yes | basic
AutoMLPipeline.jl| > 100 | no | no | medium
MLJ.jl | 151 | yes | yes | advanced

In addition to these are several excellent and mature packages
dedicated to neural-networks, the most popular being the AD-driven
Flux.jl package. So far, these provide limited meta-functionality,
although MLJ now provides an interface to certain classes of Flux
models (MLJFlux) and
ScikitLearn.jl provides interfaces to python neural network models
sufficient for small datasets and pedagogical use.

Disclaimer: I am a designer/contributor to MLJ.

According to the JOSS requirements,
Submissions should "Have an obvious research application." In its
current state of maturity, BetaML is not a serious competitor to the
frameworks above, for contributing directly to research. However, the
author argues that it has pedagogical advantages over existing tools.

Value as pedagogical tool

I don't think there are many rigorous machine learning courses
or texts closely integrated with a models and tools implemented in julia
and it would be useful to have more of these. The degree of integration in this
case was difficult for me to ascertain because I couldn't see how to
access the course notes without formally registering for the course
(which is, however, free). I was also disappointed to find only one
link from doc-strings to course materials; from this "back door" to
the course notes I could find no reference back to the package,
however. Perhaps there is better integration in course exercises? I
couldn't figure this out.

The remaining argument for BetaML's pedagogical value rests on a
number of perceived drawbacks of existing toolboxes, for the
beginner. Quoting from the JOSS manuscript:

"For example the popular Deep Learning library Flux (Mike Innes,
2018), while extremely performant and flexible, adopts some
designing choices that for a beginner could appear odd, for example
avoiding the neural network object from the training process, or
requiring all parameters to be explicitly defined. In BetaML we
made the choice to allow the user to experiment with the
hyperparameters of the algorithms learning them one step at the
time. Hence for most functions we provide reasonable default
parameters that can be overridden when needed."
"To help beginners, many parameters and functions have pretty
longer but more explicit names than usual. For example the Dense
layer is a DenseLayer, the RBF kernel is radialKernel, etc."
"While avoiding the problem of “reinventing the wheel”, the
wrapping level unin- tentionally introduces some complications for
the end-user, like the need to load the models and learn
MLJ-specific concepts as model or machine. We chose instead to
bundle the main ML algorithms directly within the package. This
offers a complementary approach that we feel is more
beginner-friendly."

Let me respond to these:

These cricitism only apply to dedicated neural network
packages, such as Flux.jl; all of the toolboxes listed
above provide default hyper parameters for every model. In the case
of neural networks, user-friendly interaction close to the kind
sought here is available either by using the MLJFlux.jl models
(available directly through MLJ) or by using the python models
provided through ScikitLearn.jl.
Yes, shorter names are obstacles for the beginner but hardly
insurmountable. For example, one could provide a cheat sheet
summarizing the models and other functionality needed for the
machine learning course (and omitting all the rest).
Yes, not needing to load in model code is slightly more
friendly. On the other hand, in MLJ for example, one can load and
instantiate a model with a single macro. So the main complication
is having to ensure relevant libraries are in your environment. But
this could be solved easily with a BeginnerPackage which curates
all the necessary dependencies. I am not convinced beginners should
find the idea of separating hyper-parameters and learned parameters
(the "machines" in MLJ) that daunting. I suggest the author's
criticism may have more to do with their lack of familiarity than a
difficulty for newcomers, who do not have the same preconceptions
from using other frameworks. In any case, the point is moot, as one
can interact with MLJ models directly via a "model" interface and
ignore machines. To see this, I have
translated part of a
BetaML notebook into MLJ syntax. There's hardly any difference - if
anything the presentation is simpler (less hassle when splitting
data horizontally and vertically).

In summary, while existing toolboxes might present a course instructor
with a few challenges, these are hardly game-changers. The advantages of
introducing a student to a powerful, mature, professional toolbox ab
initio far outweigh any drawbacks, in my view.

Conclusions

To meet the requirements of JOSS, I think either: (i) The BetaML
package needs to demonstrate tighter integration with easily
accessible course materials; or (ii) BetaML needs very substantial
enhancements to make it competitive with existing toolboxes.

Frankly, a believe a greater service to the Julia open-source software
community would be to integrate the author's course materials with one
of the mature ML toolboxes. In the case of MLJ, I would be more than
happy to provide guidance for such a project.

Minor suggestions for improvements

I didn't have too much trouble installing the package or running the
demos, except when running a notebook on top of an existing Julia
environment (see commment below).

The README.md should provide links to the toolboxes listed in
the table above, for the student who "graduates" from BetaML.
Some or most intended users will be new to Julia, so I suggest
including with the installation instructions something about how to
set up a julia environment that includes BetaML. Something like
this, for example.
I found it weird that the front-facing demo is an unsupervised
model. A more "Hello World" example might be to train a Decision
Tree.
The way users load the built-in datasets seems pretty awkward. Maybe
just define some functions to do this? E.g.,
load_bike_sharing(). Might be instructive to have examples where
data is pulled in using RDatasets, UrlDownload or similar?
A cheat-sheet summarizing the model fitting functions and the loss
functions would be helpful. Or you could have functions models() and
loss_functions() that list these.
I found it pretty annoying to split data by hand the way this is
done in the notebooks and even beginners might find this
annoying. One utility function here would go a long way to making
life easier here (something like the partition function in the
MLJ, which you are welcome to lift).
The notebooks are not portable as they do not come with a
Manifest.toml. One suggestion on how to handle this is
here
but you should add a comment in the notebook explaining that the
notebook is only valid if it is accompanied by the Manifest.toml. I
think an even better solution is provided by InstantiateFromUrl.jl
but I haven't tried this yet.
The name em for the expectation-maximization clustering algorithm
is very terse, and likely to conflict with a user variable. I admit, I had
to dig up the doc-string to find out what it was.

ablaom on 18 Nov 2020

I would like to thank @ablaom for his deep review of the software whose shortcomings I fully acknowledge.

I would like to give the reviewer a detailed reply, but how does it work? Should I just continue on this thread ? Or should I wait for a "decision" from the editor ?

sylvaticus on 19 Nov 2020

I'm really sorry for replying late. I'm really swamped and won't be able to review. So sorry.

amritagos on 19 Nov 2020

Suggestions for further reviewers: @ppalmes, @cstjean. These are authors of AutoMLPipelines.jl and ScikitLearn.jl respectively, the "competing" toolboxes to MLJ, with which I am associated. So independent for sure.

ablaom on 19 Nov 2020

Let's wait for another reviewer to join and chime in. Thanks @ablaom for the feedback and suggesting additional reviewers!

👋🏻 @ppalmes and @cstjean would you be interested in reviewing for this submission?

terrytangyuan on 19 Nov 2020

hi all,

thanks @ablaom for suggesting my name. yes, i'm interested to help in the review.

ppalmes on 19 Nov 2020

👍1

@whedon add @ppalmes as reviewer

terrytangyuan on 19 Nov 2020

OK, @ppalmes is now a reviewer

whedon on 19 Nov 2020

@whedon start review

terrytangyuan on 19 Nov 2020

OK, I've started the review over in https://github.com/openjournals/joss-reviews/issues/2849.

whedon on 19 Nov 2020

Thanks. @ablaom and @ppalmes let's continue the discussion in #2849.

terrytangyuan on 19 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[REVIEW]: GIMS: Graphical Interface for Materials Simulations

whedon · 9Comments

[REVIEW]: The Pulsar Signal Simulator: A Python package for simulating radio signal data from pulsars

whedon · 9Comments

[REVIEW]: tidytext: Text Mining and Analysis Using Tidy Data Principles in R

whedon · 10Comments

[PRE REVIEW]: xphyle: Extraordinarily simple file handling

whedon · 12Comments

[REVIEW]: cartography: Create and Integrate Maps in your R Workflow

whedon · 12Comments