Joss-reviews: [REVIEW]: MLJ: A Julia package for composable machine learning

Created on 28 Sep 2020 · 87Comments · Source: openjournals/joss-reviews

Submitting author: @ablaom (Anthony Blaom)
Repository: https://github.com/alan-turing-institute/MLJ.jl
Version: v0.14.1
Editor: @terrytangyuan
Reviewer: @degleris1, @henrykironde
Archive: 10.5281/zenodo.4178917

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b91ac74fd3da4fcc9ac34dc4def63d3c"><img src="https://joss.theoj.org/papers/b91ac74fd3da4fcc9ac34dc4def63d3c/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b91ac74fd3da4fcc9ac34dc4def63d3c/status.svg)](https://joss.theoj.org/papers/b91ac74fd3da4fcc9ac34dc4def63d3c)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@degleris1 & @henrykironde, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @terrytangyuan know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @degleris1

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ablaom) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @henrykironde

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ablaom) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Julia Jupyter Notebook TeX accepted published recommend-accept review

Source

whedon

All 87 comments

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @degleris1, @henrykironde it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon on 28 Sep 2020

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon on 28 Sep 2020

@terrytangyuan could you give me permission to the checklist.

henrykironde on 28 Sep 2020

@arfon Could you take a look at the request above? Thanks.

terrytangyuan on 28 Sep 2020

@henrykironde - please click the link above (https://github.com/openjournals/joss-reviews/invitations) to accept the invite to the repository. Then you will be able to edit the checklist.

arfon on 28 Sep 2020

👍1

Contribution and authorship:

What cliteria was used for authorship selection?

License:

Could you edit the licence file and remove '>'

The paper should be between 250-1000 words:

I will need to obain a recommendation for the paper length from the team.

Installation instructions:

These can be made more clear. 
The Installation instructions shown in the readme file after 
the intro tend to usually clear to users.

List of dependencies?

The only issue with the dependencies has been stated in the README,
this makes users know of the status when installing 👍

Example usage:

I expect a user to be able to run a sample example by reading the `README` file in a few minutes.
I recommend that you move the `List of Wrapped Models` to any other location in the docs, but provide the link on the `README`
Then focus on a sample run/s. I have tried to run the code and faced some difficulties:

Functionality documentation:

Functions should have docs strings which are automatically turned into documentation.
A good example is listed

https://github.com/JuliaLang/Statistics.jl/blob/b384104d35ff0e7cf311485607b177223ed72b9a/src/Statistics.jl#L19

Community guidelines:

I should be able to see a developer doc section.
In case I want to clone this repo and change a few things. 
I should be able to install the source from the current directory and there after running the code with the changes.
I expect the user to also have guidelines on how to run tests.

I have provided a sample run here hope we can make it smooth,

1) Package fails to install from GitHub

docker run -it --rm julia
Unable to find image 'julia:latest' locally
latest: Pulling from library/julia
d121f8d1c412: Pull complete 
a27e46e0dbc6: Pull complete 
b6a50acaca53: Pull complete 
Digest: sha256:f8a867690b5e341a66b20af1b0c09216367805e0eb87f6e059c13b683804b681
Status: Downloaded newer image for julia:latest
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.2 (2020-09-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Pkg;

julia> Pkg.develop(PackageSpec(url="https://github.com/tlienart/OpenSpecFun_jll.jl"))
    Cloning git-repo `https://github.com/tlienart/OpenSpecFun_jll.jl`
  Resolving package versions...
 Installing known registries into `~/.julia`
######################################################################## 100.0%
      Added registry `General` to `~/.julia/registries/General`
Downloading artifact: OpenSpecFun
Updating `~/.julia/environments/v1.5/Project.toml`
  [efe28fd5] + OpenSpecFun_jll v0.5.3+1 `~/.julia/dev/OpenSpecFun_jll`
Updating `~/.julia/environments/v1.5/Manifest.toml`
  [efe28fd5] + OpenSpecFun_jll v0.5.3+1 `~/.julia/dev/OpenSpecFun_jll`
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
    .....
  [4ec0a83e] + Unicode

julia> using MLJ
ERROR: ArgumentError: Package MLJ not found in current path:
- Run `import Pkg; Pkg.add("MLJ")` to install the MLJ package.

Stacktrace:
 [1] require(::Module, ::Symbol) at ./loading.jl:893

2) Package installed from the release but example does not mention Pkg.add("EvoTrees")

julia> import Pkg; Pkg.add("MLJ")
   Updating registry at `~/.julia/registries/General`
  Resolving package versions...
  Installed MLJModelInterface ──────────── v0.3.5
  .....
julia> using MLJ
[ Info: Precompiling MLJ [add582a8-e3ab-11e8-2d5e-e98b27df1bc7]
[ Info: Model metadata loaded from registry. 

julia> X, y = @load_reduced_ames;

julia> booster = @load EvoTreeRegressor

ERROR: LoadError: ArgumentError: Package EvoTrees not found in current path:
- Run `import Pkg; Pkg.add("EvoTrees")` to install the EvoTrees package.

After a while, I managed to set up the example correct

julia> using MLJ
[ Info: Precompiling MLJ [add582a8-e3ab-11e8-2d5e-e98b27df1bc7]
[ Info: Model metadata loaded from registry. 

julia> Pkg.add("EvoTrees")
  Resolving package versions...
  Installed CEnum ────────── v0.4.1
  ...........
  Installed GPUArrays ────── v5.2.1
  Installed CUDA ─────────── v1.3.3
Updating `~/.julia/environments/v1.5/Project.toml`
  [f6006082] + EvoTrees v0.5.0
Updating `~/.julia/environments/v1.5/Manifest.toml`
  [621f4979] + AbstractFFTs v0.5.0
  ........
  [872c559c] + NNlib v0.7.4
  [a759f4b9] + TimerOutputs v0.5.6

julia> X, y = @load_reduced_ames;

julia> booster = @load EvoTreeRegressor
[ Info: Precompiling EvoTrees [f6006082-12f8-11e9-0c9c-0d5d367ab1e5]
EvoTreeRegressor(
    loss = EvoTrees.Linear(),
    nrounds = 10,
    λ = 0.0f0,
    γ = 0.0f0,
    η = 0.1f0,
    max_depth = 5,
    min_weight = 1.0f0,
    rowsample = 1.0f0,
    colsample = 1.0f0,
    nbins = 64,
    α = 0.5f0,
    metric = :mse,
    rng = MersenneTwister(UInt32[0x000001bc]) @ 1002) @281

julia> booster.max_depth = 2
2

julia> booster.nrounds=50
50

julia> pipe = @pipeline ContinuousEncoder booster
Pipeline255(
    continuous_encoder = ContinuousEncoder(
            drop_last = false,
            one_hot_ordered_factors = false),
    evo_tree_regressor = EvoTreeRegressor(
            loss = EvoTrees.Linear(),
            nrounds = 50,
            λ = 0.0f0,
            γ = 0.0f0,
            η = 0.1f0,
            max_depth = 2,
            min_weight = 1.0f0,
            rowsample = 1.0f0,
            colsample = 1.0f0,
            nbins = 64,
            α = 0.5f0,
            metric = :mse,
            rng = MersenneTwister(UInt32[0x000001bc]) @ 1002)) @701

julia> max_depth_range = range(pipe,
                               :(evo_tree_regressor.max_depth),
                               lower = 1,
                               upper = 10)
MLJBase.NumericRange(Int64, :(evo_tree_regressor.max_depth), ... )

julia> self_tuning_pipe = TunedModel(model=pipe,
                                     tuning=RandomSearch(),
                                     ranges = max_depth_range,
                                     resampling=CV(nfolds=3, rng=456),
                                     measure=l1,
                                     acceleration=CPUThreads(),
                                     n=50)
DeterministicTunedModel(
    model = Pipeline255(
            continuous_encoder = ContinuousEncoder @343,
            evo_tree_regressor = EvoTreeRegressor{Float32,…} @281),
    tuning = RandomSearch(
            bounded = Distributions.Uniform,
            positive_unbounded = Distributions.Gamma,
            other = Distributions.Normal,
            rng = Random._GLOBAL_RNG()),
    resampling = CV(
            nfolds = 3,
            shuffle = true,
            rng = MersenneTwister(UInt32[0x000001c8]) @ 1002),
    measure = l1(),
    weights = nothing,
    operation = MLJModelInterface.predict,
    range = NumericRange(
            field = :(evo_tree_regressor.max_depth),
            lower = 1,
            upper = 10,
            origin = 5.5,
            unit = 4.5,
            scale = :linear),
    selection_heuristic = MLJTuning.NaiveSelection(nothing),
    train_best = true,
    repeats = 1,
    n = 50,
    acceleration = CPUThreads{Int64}(1),
    acceleration_resampling = CPU1{Nothing}(nothing),
    check_measure = true) @976

julia> mach = machine(self_tuning_pipe, X, y)
Machine{DeterministicTunedModel{RandomSearch,…}} @419 trained 0 times.
  args: 
    1:  Source @746 ⏎ `Table{Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}, AbstractArray{Multiclass{25},1}, AbstractArray{Multiclass{15},1}, AbstractArray{OrderedFactor{10},1}}}`
    2:  Source @854 ⏎ `AbstractArray{Continuous,1}`


julia> evaluate!(mach,
                       measures=[l1, l2],
                       resampling=CV(nfolds=6, rng=123),
                       acceleration=CPUProcesses(), verbosity=2)
[ Info: Distributing evaluations among 1 workers.

[ Info: Training Machine{DeterministicTunedModel{RandomSearch,…}} @419.
 .........
Evaluating over 50 metamodels: 100%[=========================] Time: 0:00:11
Evaluating over 6 folds: 100%[=========================] Time: 0:02:26
┌───────────┬───────────────┬────────────────────────────────────────────────────────┐
│ _.measure │ _.measurement │ _.per_fold                                             │
├───────────┼───────────────┼────────────────────────────────────────────────────────┤
│ l1        │ 16700.0       │ [17000.0, 16400.0, 14500.0, 16200.0, 16400.0, 19500.0] │
│ l2        │ 6.35e8        │ [6.13e8, 6.81e8, 4.35e8, 5.63e8, 5.98e8, 9.18e8]       │
└───────────┴───────────────┴────────────────────────────────────────────────────────┘
_.per_observation = [[[27700.0, 21400.0, ..., 11200.0], [12100.0, 1330.0, ..., 13200.0], [6490.0, 22000.0, ..., 13800.0], [12400.0, 7140.0, ..., 13000.0], [50800.0, 22700.0, ..., 1550.0], [32800.0, 4940.0, ..., 1110.0]], [[7.68e8, 4.59e8, ..., 1.26e8], [1.46e8, 1.77e6, ..., 1.73e8], [4.22e7, 4.86e8, ..., 1.9e8], [1.55e8, 5.09e7, ..., 1.7e8], [2.58e9, 5.13e8, ..., 2.42e6], [1.07e9, 2.44e7, ..., 1.24e6]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
julia>

If we set up the docs well, we can reduce the time users spend when they land on the project to getting sample runs

henrykironde on 2 Oct 2020

Just finished my review. The package looks fantastic. I love the idea of scientific types. Everything (installation, tests, simple demos, defining a custom model) worked smoothly on my machine.

The paper does seem a bit long for JOSS. Could you say why?

Some really minor comments. These are things that you by no means have to address.

As has been pointed out, the installation instructions are 2 link clicks away from the README. I don't have a problem with it, but it might be nice to have something right in the README, e.g.

Open Julia and press ] to open the package manager, then run
pkg> add MLJ
to install MLJ.
In the FAQ, I noticed you used "his code" to refer to the user. It might be nice to change it to "their code" or "the user's code." As far as I can tell, it's actually the only gendered pronoun in entire docs / paper.
Is there a good reason why I have to type verbosity to Int? It seems a little bit random from a user perspective, but I'm sure there's a good reason for it.

degleris1 on 9 Oct 2020

@terrytangyuan, what do you think about the paper length submitted

henrykironde on 13 Oct 2020

@henrykironde Thank you for taking the time to review this software and for the
helpful feedback.

I'm am copying my co-authors in case they wish to make additional comments.

@tlienart @vollmersj @darenasc @fkiraly

Contribution and authorship:

What cliteria was used for authorship selection?

Anyone making significant contribution to the design of the software.

License:

Could you edit the licence file and remove '>'

Done.

The paper should be between 250-1000 words:

I will need to obain a recommendation for the paper length from the team.

It would be good to know if there's some flexibility here. This is a
relatively large collection of packages. (You can get a sense of this
by looking at the size of the documentation, and from the code organisation). Of course, the paper is
not trying to duplicate documentation but is focused on outlining the
high-level design and feature set.

Functionality documentation:

Functions should have docs strings which are automatically turned into documentation.
A good example is listed

Sorry, not sure I follow. Function docstrings are indeed automatically
interpolated into the
documentation
(using Documenter.jl). The only function docstrings missing that I am
aware of are those for "private" methods. Are you suggesting adding
these to the docs?

In any case, an index of all referenced function docstrings has been
been added to a new section of the docs.

* Installation instructions:*

These can be made more clear. 
The Installation instructions shown in the readme file after 
the intro tend to usually clear to users.

* Example usage:*

I expect a user to be able to run a sample example by reading the `README` file in a few minutes.
I recommend that you move the `List of Wrapped Models` to any other location in the docs, but provide the link on the `README`
Then focus on a sample run/s. I have tried to run the code and faced some difficulties:

* Community guidelines:*

I should be able to see a developer doc section.
In case I want to clone this repo and change a few things. 
I should be able to install the source from the current directory and there after running the code with the changes.
I expect the user to also have guidelines on how to run tests.

Some context: Following a suggestion in
https://github.com/alan-turing-institute/MLJ.jl/issues/530 it was
decided that the landing page for general users of this software
should be the GitHub Pages
site, rather
than the README.md, which is common for larger packages such as this
one.

To address the issues you raise here, we've made the following changes to the
readme and manual (github pages) (see the October 12th/13th
commits)

The "lightning tour" is now self-contained, in the sense that code required for
installing all needed libraries is included.
A link to the installation instruction for Julia language has been added to that section.
The
Installation now provides instructions on how to test the MLJ installation.
As suggested, the list of supported models has been moved from the
README.md to a new section of the manual. The reamining information in the README.md is only
for developers of the core MLJ ecosystem, as now explained
there. Developers who specifically want to integrate new third party
models into MLJ are now directed from the README.md to the relevant
section of the manual (which includes a detailed
description of the model API). Conversely, the manual now has a link at the top of
the page "For Developers" which directs the to the
README/reposistory. A Customizing
Behavior section has been added to the readme for those that want to replace
a component library with a local fork they have edited.

ablaom on 13 Oct 2020

@degleris1 Thank you for taking the time to review this software and for the
positive feedback.

I'm am copying my co-authors in case they wish to make additional comments.

@tlienart @vollmersj @darenasc @fkiraly

Some really minor comments. These are things that you by no means have to address.

As has been pointed out, the installation instructions are 2 link clicks away from the README. I don't have a problem with it, but it might be nice to have something right in the README, e.g.

See my preceding comment about github pages being the landing page for general users. Also, the lightning tour at top of landing page now includes installation instructions.

In the FAQ, I noticed you used "his code" to refer to the user. It might be nice to change it to "their code" or "the user's code." As far as I can tell, it's actually the only gendered pronoun in entire docs / paper.

Good catch. Corrected thanks.

Is there a good reason why I have to type verbosity to Int? It seems a little bit random from a user perspective, but I'm sure there's a good reason for it.

Very observant. There used to be a method ambiguity, which has now been removed. I've updated the manual accordingly.

ablaom on 13 Oct 2020

@openjournals/joss-eics Is there any requirement on the length of the JOSS paper?

terrytangyuan on 13 Oct 2020

yes, see https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain

we typically are somewhat flexible on the upper bound - but this paper is too long. Detailed content should be in the documentation, not the paper. The paper is not the user guide.

danielskatz on 13 Oct 2020

we typically are somewhat flexible on the upper bound - but this paper is too long. Detailed content should be in the documentation, not the paper. The paper is not the user guide.

I don't agree that the paper duplicates the documentation. Rather it discusses the high level design, focusing on those aspects that are novel or different from other machine learning toolboxes.

That said, after more carefully reading the link above, it seems this forum is not intended for design discussions. Contributions are more in the way of "announcements" than "papers". Yes?

Over the past few days I have worked to condense the material: The arXiv version is 7.5 pages (excluding references) - the revised version in the same format is 5.5 pages. If this is an acceptable length - and you are happy with a "design paper" after all, let me know and I will update the markdown version. Otherwise we will withdraw the manuscript and try a more suitable forum for publication.

In any case, my thanks again to the referees and JOSS team for your time.

ablaom on 16 Oct 2020

👍1

Right - JOSS papers are not about the process of developing the software, they are about the produced software, aimed at users of the software.

danielskatz on 16 Oct 2020

@whedon generate pdf

danielskatz on 16 Oct 2020

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon on 16 Oct 2020

I'll turn this back over to the editor and reviewers for further discussion, if needed

danielskatz on 16 Oct 2020

@terrytangyuan Although it is still a pretty long paper, I think the length is understandable given the scope of the package. The content is indeed aimed at the user, describing the key functionality of the package. I strongly support accepting the paper.

degleris1 on 18 Oct 2020

👍1

I am okay with it. @henrykironde Any concerns on this or any other items in the checklist?

terrytangyuan on 19 Oct 2020

Yes, the last concern is on Contribution and authorship, I wanted to get feedback on the criteria for authorship that was used.

henrykironde on 19 Oct 2020

@ablaom Could you share your criteria on the authorship of this paper?

terrytangyuan on 19 Oct 2020

This was answered above.

Contribution and authorship:

What cliteria was used for authorship selection?
Anyone making significant contribution to the design of the software.

ablaom on 19 Oct 2020

@henrykironde Do you have more specific questions?

terrytangyuan on 19 Oct 2020

Thanks @ablaom, I do think the author list needs to be improved. Comparing the these individuals
https://github.com/alan-turing-institute/MLJ.jl/graphs/contributors
darenasc https://github.com/alan-turing-institute/MLJ.jl/commits?author=darenasc
OkonSamuel https://github.com/alan-turing-institute/MLJ.jl/commits?author=OkonSamuel
vollmersj https://github.com/alan-turing-institute/MLJ.jl/commits?author=vollmersj

Based on the code functionality and bugs fixes (Not including documentation, and number of commits)I do think
the package author=OkonSamuel should also be among the list of authors unless the author declines. Let me know what you think

henrykironde on 19 Oct 2020

@henrykironde Do you mean someone else instead of OkonSamuel? Looks like he only contributed typo fixes and link updates in the documentation.

terrytangyuan on 19 Oct 2020

You need of course to look at all the repos. There are a dozen or so now. The core code lives at MLJBase.j, the largest repo. OkonSamuel has indeed made substantial contributions to the project - maintenance, issue resolution, new model interfaces, adding functionality, and a smaller amount of design work. As I said, the criterion for inclusion was around substantial design rather than code (one author contributed very substantially to design but made, I think, 0 commits).

I would be very happy to add OkonSamuel but need to check with the other authors.

ablaom on 20 Oct 2020

👍1

As I said, the criterion for inclusion was around substantial design rather than code (one author contributed very substantially to design but made, I think, 0 commits

Thanks, for making this clear beyond doubt.
At this point, this LGTM and the I do acknowledge and reccommend that the folks have done a good job on the software and the purpose for which it serves holds a great deal of water.

henrykironde on 20 Oct 2020

👍1

@terrytangyuan all yours and thanks everyone.

henrykironde on 20 Oct 2020

Thanks everyone!

terrytangyuan on 20 Oct 2020

@whedon check references

terrytangyuan on 20 Oct 2020

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/b94608 is OK
- 10.1109/MCSE.2007.53 is OK
- 10.1016/j.cels.2017.10.001 is OK
- 10.1016/J.NEUCOM.2018.03.067 is OK
- 10.5334/jors.148 is OK
- 10.5281/zenodo.59499 is OK
- 10.1016/j.jecp.2010.03.005 is OK
- 10.1145/1656274.1656278 is OK
- 10.1109/ANZIIS.1994.396988 is OK
- 10.1109/MCSE.2011.37 is OK
- 10.1007/b97277 is OK
- 10.1007/978-3-319-29854-2 is OK
- 10.1007/3-540-70659-3_2 is OK
- 10.1145/2786984.2786995 is OK
- 10.1137/141000671 is OK
- 10.21105/joss.00602 is OK
- 10.18637/jss.v028.i05 is OK
- 10.5281/zenodo.3730565 is OK
- 10.5334/jors.151 is OK

MISSING DOIs

- 10.1111/j.1467-9892.1992.tb00102.x may be a valid DOI for title: Data-Dependent Estimation Of Prediction Functions
- 10.1016/j.csda.2017.11.003 may be a valid DOI for title: A note on the validity of cross-validation for evaluating autoregressive time series prediction
- 10.1016/s0304-4076(00)00030-0 may be a valid DOI for title: Consistent cross-validatory model-selection for dependent data: hv-block cross-validation
- 10.1093/biomet/81.2.351 may be a valid DOI for title: A cross-validatory method for dependent data
- 10.1007/978-0-387-79361-0_1 may be a valid DOI for title: Machine learning techniques—reductions between prediction quality metrics
- 10.21105/joss.01169 may be a valid DOI for title: scikit-posthocs: Pairwise multiple comparison tests in Python
- 10.1007/3-540-27752-8 may be a valid DOI for title: New introduction to multiple time series analysis
- 10.1007/978-3-642-24797-2_2 may be a valid DOI for title: Supervised sequence labelling
- 10.1016/j.ijforecast.2006.01.001 may be a valid DOI for title: 25 years of time series forecasting
- 10.1109/jproc.2015.2494118 may be a valid DOI for title: Learning reductions that really work
- 10.1109/dsaa.2015.7344858 may be a valid DOI for title: Deep feature synthesis: Towards automating data science endeavors
- 10.1163/1574-9347_bnp_e612900 may be a valid DOI for title: Keras
- 10.1111/j.1467-9892.2009.00643.x may be a valid DOI for title: Time series analysis: forecasting and control
- 10.1142/9789812565402_0001 may be a valid DOI for title: Segmenting time series: A survey and novel approach
- 10.1016/j.ijforecast.2019.04.014 may be a valid DOI for title: The M4 Competition: 100,000 time series and 61 forecasting methods
- 10.1016/j.ijforecast.2018.06.001 may be a valid DOI for title: The M4 Competition: Results, findings, conclusion and way forward
- 10.1371/journal.pone.0194889 may be a valid DOI for title: Statistical and Machine Learning forecasting methods: Concerns and ways forward
- 10.1109/mcse.2010.118 may be a valid DOI for title: Cython: The best of both worlds
- 10.1007/s10994-008-5058-6 may be a valid DOI for title: Robust reductions from ranking to classification
- 10.1145/1102351.1102358 may be a valid DOI for title: Error limiting reductions between classification tasks
- 10.1007/s10618-019-00617-3 may be a valid DOI for title: Proximity Forest: an effective and scalable distance-based classifier for time series
- 10.1007/s10618-020-00679-8 may be a valid DOI for title: TS-CHIEF: A Scalable and Accurate Forest Algorithm for Time Series Classification
- 10.3233/ida-184333 may be a valid DOI for title: On Time Series Classification with Dictionary-Based Classifiers
- 10.1109/icdm.2018.00119 may be a valid DOI for title: Matrix Profile XII: MPDist: A Novel Time Series Distance Measure to allow Data Mining in more Challenging Scenarios
- 10.1007/s10618-018-0573-y may be a valid DOI for title: Constrained distance based clustering for time-series: a comparative and experimental study
- 10.1109/jas.2019.1911747 may be a valid DOI for title: The UCR Time Series Archive
- 10.1137/1.9781611975321.26 may be a valid DOI for title: Efficient search of the best warping window for Dynamic Time Warping
- 10.1137/1.9781611975321.26 may be a valid DOI for title: Efficient search of the best warping window for Dynamic Time Warping
- 10.1007/s10618-016-0483-9 may be a valid DOI for title: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances
- 10.1109/icdm.2017.8356939 may be a valid DOI for title: Efficient Discovery of Time Series Motifs with Large Length Range in Million Scale Time Series
- 10.1109/icdm.2016.0179 may be a valid DOI for title: Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets
- 10.1016/j.is.2015.04.007 may be a valid DOI for title: Time-series clustering – A decade review
- 10.1109/icdm.2016.0133 may be a valid DOI for title: HIVE-COTE: The Hierarchical Vote Collective of Transformation-based Ensembles for Time Series Classification
- 10.1109/icdm.2014.92 may be a valid DOI for title: Dual-domain hierarchical classification of phonetic time series
- 10.1137/1.9781611972832.64 may be a valid DOI for title: Time Series Classification under More Realistic Assumption
- 10.1007/s10618-007-0064-z may be a valid DOI for title: Experiencing SAX: a novel symbolic representation of time series
- 10.1007/978-3-319-24465-5_10 may be a valid DOI for title: Time Series Classification with Representation Ensembles
- 10.1109/icmla.2016.0010 may be a valid DOI for title: Improved Time Series Classification with Representation Diversity and SVM
- 10.1109/tkde.2015.2492558 may be a valid DOI for title: Classifying Time Series Using Local Descriptors with Hybrid Sampling
- 10.1109/icdm.2005.79 may be a valid DOI for title: HOT SAX: efficiently finding the most unusual time series subsequence
- 10.1109/icdm.2001.989531 may be a valid DOI for title: An online algorithm for segmenting time series
- 10.1145/347090.347109 may be a valid DOI for title: Deformable Markov model templates for time-series pattern matching
- 10.1016/b978-012088469-8.50070-x may be a valid DOI for title: On the marriage of Lp-norms and edit distance
- 10.1109/tpami.2008.76 may be a valid DOI for title: Time Warp Edit Distance with Stiffness Adjustment for Time Series Matching 
- 10.1109/tkde.2012.88 may be a valid DOI for title: The Move-Split-Merge Metric for Time Series
- 10.1016/j.knosys.2014.02.011 may be a valid DOI for title: Non-isometric transforms in time series classification using DTW
- 10.1007/s10618-015-0418-x may be a valid DOI for title: Using dynamic time warping distances as features for improved time series classification
- 10.1109/cvpr.2003.1211511 may be a valid DOI for title: Word image matching using dynamic time warping
- 10.1007/s10844-012-0196-5 may be a valid DOI for title: Rotation-invariant similarity in time series using bag-of-patterns representation
- 10.1109/icdm.2013.52 may be a valid DOI for title: SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model
- 10.1007/s10618-014-0377-7 may be a valid DOI for title: The BOSS is concerned with time series classification in the presence of noise
- 10.1007/s10618-010-0179-5 may be a valid DOI for title: Time series shapelets: a novel technique that allows accurate, interpretable and fast classification
- 10.1007/s10618-016-0473-y may be a valid DOI for title: Generalized random shapelet forests
- 10.1007/978-3-319-46307-0_17 may be a valid DOI for title: Early Random Shapelet Forest
- 10.1007/978-3-642-32639-4_58 may be a valid DOI for title: Alternative quality measures for time series shapelets
- 10.1007/s10618-013-0322-1 may be a valid DOI for title: Classification of time series by shapelet transformation
- 10.1007/978-3-319-22729-0_20 may be a valid DOI for title: Binary Shapelet Transform for Multiclass Time Series Classification
- 10.1007/978-3-319-22729-0_20 may be a valid DOI for title: Binary Shapelet Transform for Multiclass Time Series Classification
- 10.1109/dsaa.2015.7344782 may be a valid DOI for title: Random-shapelet: an algorithm for fast shapelet discovery
- 10.1007/s10115-015-0905-9 may be a valid DOI for title: Fast classification of univariate and multivariate time series through shapelet discovery
- 10.1007/s10618-015-0411-4 may be a valid DOI for title: Accelerating the discovery of unsupervised-shapelets
- 10.1007/1-84628-102-4_18 may be a valid DOI for title: Support Vector Machines of Interval-based Features for Time Series Classification
- 10.1109/tpami.2013.72 may be a valid DOI for title: A Bag-of-Features Framework to Classify Time Series
- 10.1007/s10618-015-0425-y may be a valid DOI for title: Time series representation and similarity based on local autopatterns
- 10.1137/1.9781611972825.27 may be a valid DOI for title: Transformation Based Ensembles for Time Series Classification.
- 10.1109/icde.2016.7498418 may be a valid DOI for title: Time-Series Classification with COTE: The Collective of Transformation-Based Ensembles
- 10.1007/s10618-014-0361-2 may be a valid DOI for title: Time Series Classification with Ensembles of Elastic Distance Measures
- 10.1145/2487575.2487700 may be a valid DOI for title: Model-based kernel for efficient time series analysis
- 10.1016/j.csda.2007.06.001 may be a valid DOI for title: Time series clustering and classification by the autoregressive metric
- 10.1007/s00357-013-9135-6 may be a valid DOI for title: A run length transformation for discriminating between auto regressive time series
- 10.1109/tkde.2014.2316504 may be a valid DOI for title: Highly comparative feature-based time-series classification
- 10.1109/icdm.2013.128 may be a valid DOI for title: Time Series Classification Using Compression Distance of Recurrence Plots
- 10.1137/1.9781611972757.50 may be a valid DOI for title: Three Myths about Dynamic Time Warping Data Mining
- 10.1109/icdm.2010.21 may be a valid DOI for title: Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs
- 10.1016/b978-155860869-6/50043-3 may be a valid DOI for title: Exact indexing of dynamic time warping
- 10.1145/376284.375680 may be a valid DOI for title: Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases
- 10.1007/s10618-012-0250-5 may be a valid DOI for title: Experimental comparison of representation methods and distance measures for time series data
- 10.1016/j.bspc.2013.06.004 may be a valid DOI for title: Bag-of-words representation for biomedical time series classification 
- 10.1142/s0129065716500374 may be a valid DOI for title: Generalized Models for the classification of abnormal movements in daily life and its applicability to epilepsy convulsions recognition
- 10.1016/j.csda.2005.04.012 may be a valid DOI for title: A periodogram-based metric for time series classification
- 10.15200/winn.153459.98975 may be a valid DOI for title: A Comparison Between Differential Equation Solver Suites In MATLAB, R, Julia, Python, C, Mathematica, Maple, and Fortran

INVALID DOIs

- None

whedon on 20 Oct 2020

@ablaom Could you fix the issue with missing DOIs above? Once those are fixed, we should be good to move on.

terrytangyuan on 20 Oct 2020

👍1

Will do.

If there is no objection, I will also update the markdown to match the shortened version discussed above.

ablaom on 21 Oct 2020

👍1

@whedon generate pdf

ablaom on 21 Oct 2020

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon on 21 Oct 2020

@whedon check references

terrytangyuan on 21 Oct 2020

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/ANZIIS.1994.396988 is OK
- 10.1137/141000671 is OK
- 10.21105/joss.00602 is OK
- 10.18637/jss.v028.i05 is OK
- 10.5281/zenodo.3730565 is OK
- 10.5334/jors.151 is OK

MISSING DOIs

- 10.15200/winn.153459.98975 may be a valid DOI for title: A Comparison Between Differential Equation Solver Suites In MATLAB, R, Julia, Python, C, Mathematica, Maple, and Fortran

INVALID DOIs

- None

whedon on 21 Oct 2020

@terrytangyuan I have:

removed the plethora of redundant bib entries in paper.bib
added doi entries where these exist
updated the markdown paper.md to the shorter version

Note that url hypertext links do not appear in the bibliography. We have replaced the url=... entries in paper.bib with adsurl=... entries because that is what we see in the JOSS template (but I have no idea what "adsurl" means) and because I get compile errors if I use url. Are you able to shed light on this?

ablaom on 21 Oct 2020

I don't have much knowledge in the JOSS template. cc'ing @openjournals/joss-eics to help answer the question above on adsurl.

terrytangyuan on 21 Oct 2020

👍1

url should be used, adsurl shouldn't be necessary (this is an astronomy-specific service that the author of the example paper used).

That said, there's something really weird going on with your BibTeX file here @ablaom. When I try and compile your locally I also get weird errors, but if I reduce the BibTeX file to _just_ this entry:

@misc{Quinn,
  author = {J. Quinn},
  title = {Tables.jl: {A}n interface for tables in {J}ulia},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/JuliaData/Tables.jl}
}

...then this entry works fine. In getting this to work I had to remove all of the whitespace and reformat the entries so I'm wondering if there's some weird hidden characters in your BibTeX file somewhere.

I don't have time to dig into this further at this point sorry but can try and take another look over the weekend.

Also, to help you debug locally, this GitHub Action we've been working on may help you: https://github.com/marketplace/actions/open-journals-pdf-generator

arfon on 22 Oct 2020

👍1

@arfon Thanks for investigating!

ablaom on 23 Oct 2020

cc @darenasc

ablaom on 26 Oct 2020

@whedon generate pdf

ablaom on 28 Oct 2020

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon on 28 Oct 2020

@terrytangyuan @arfon

Thanks to @darenasc I believe the bib issue is now sorted.

ablaom on 28 Oct 2020

@whedon check references

terrytangyuan on 28 Oct 2020

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/ANZIIS.1994.396988 is OK
- 10.1137/141000671 is OK
- 10.21105/joss.00602 is OK
- 10.18637/jss.v028.i05 is OK
- 10.5281/zenodo.3730565 is OK
- 10.5334/jors.151 is OK

MISSING DOIs

- 10.15200/winn.153459.98975 may be a valid DOI for title: A Comparison Between Differential Equation Solver Suites In MATLAB, R, Julia, Python, C, Mathematica, Maple, and Fortran

INVALID DOIs

- None

whedon on 28 Oct 2020

Looks great! There still seems to be one missing DOI though. Could you fix that as well?

terrytangyuan on 28 Oct 2020

@whedon check references

ablaom on 28 Oct 2020

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1109/ANZIIS.1994.396988 is OK
- 10.1137/141000671 is OK
- 10.21105/joss.00602 is OK
- 10.18637/jss.v028.i05 is OK
- 10.5281/zenodo.3730565 is OK
- 10.15200/winn.153459.98975 is OK
- 10.5334/jors.151 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon on 28 Oct 2020

@terrytangyuan All DOIs in now.

@darenasc Thanks

ablaom on 28 Oct 2020

At this point could you make a new release of this software that includes the changes that have resulted from this review. Then, please make an archive of the software in Zenodo/figshare/other service and update this thread with the DOI of the archive? For the Zenodo/figshare archive, please make sure that:

The title of the archive is the same as the JOSS paper title
That the authors of the archive are the same as the JOSS paper authors
I can then move forward with accepting the submission.

terrytangyuan on 28 Oct 2020

https://zenodo.org/record/4178918#.X5-xvS2r1TY

ablaom on 2 Nov 2020

❤1

or this one https://doi.org/10.5281/zenodo.4178917