Joss-reviews: [REVIEW]: Osprey: Hyperparameter Optimization for Machine Learning

Created on 1 Jul 2016 · 61Comments · Source: openjournals/joss-reviews

Submitting author: @cxhernandez (Carlos Xavier Hernández)
Repository: https://github.com/msmbuilder/osprey
Version: v1.1.0
Editor: @arfon
Reviewer: @rasbt
Archive: 10.5281/zenodo.61670

Status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/659cfc3ec56b30c607f705983e592406"><img src="http://joss.theoj.org/papers/659cfc3ec56b30c607f705983e592406/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/659cfc3ec56b30c607f705983e592406/status.svg)](http://joss.theoj.org/papers/659cfc3ec56b30c607f705983e592406)

Reviewer questions

Conflict of interest

[ ] As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

General checks
[ ] Repository: Is the source code for this software available at the repository url?
[ ] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[ ] Version: Does the release version given match the GitHub release (v1.0.0)?

Functionality
[ ] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: Have any performance claims of the software been confirmed?

Documentation
[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

Paper PDF: 10.21105.joss.00034.pdf

[ ] Authors: Does the paper.md file include a list of authors with their affiliations?
[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] References: Do all archival references that should have a DOI list one (e.g. papers, datasets, software)?

accepted published recommend-accept review

Source

whedon

Most helpful comment

See this interesting thread, where I recruit a potential reviewer for this paper:
https://twitter.com/rasbt/status/756604243465895936

labarba on 23 Jul 2016

👍2 🎉1

All 61 comments

Hi @cxhernandez. Thanks for submitting this package to JOSS. Unfortunately it doesn't look like there's a paper.md file in this repository.

Could you please take a look at the submission guidelines and make sure there's a paper present in the repository. We can then move forward in the review process.

arfon on 1 Jul 2016

Hey @arfon,

Sorry about that! I've added the paper to the repository now, and it can be found here.

Thanks for your consideration!

cxhernandez on 1 Jul 2016

/ cc @openjournals/joss-reviewers - would anyone be willing to review this submission?

If you would like to review this submission then please comment on this thread so that others know you're doing a review (so as not to duplicate effort). Something as simple as :hand: I am reviewing this will suffice.

Reviewer instructions

Please work through the checklist at the start of this issue.
If you need any further guidance/clarification take a look at the reviewer guidelines here http://joss.theoj.org/about#reviewer_guidelines
Please make a publication recommendation at the end of your review

Any questions, please ask for help by commenting on this issue! 🚀

arfon on 2 Jul 2016

@jakevdp - do you think you might be able to find us a reviewer for this submission?

arfon on 16 Jul 2016

See this interesting thread, where I recruit a potential reviewer for this paper:
https://twitter.com/rasbt/status/756604243465895936

labarba on 23 Jul 2016

👍2 🎉1

@labarba - do you think you could follow up with this person to see if they're still willing/able to help out here?

arfon on 29 Jul 2016

I did, four days ago. And he replied, but no indication of timing for the review. I'll ping again.
https://twitter.com/lorenaabarba/status/757185650600861696

labarba on 29 Jul 2016

👍1

@labarba @arfon Thanks for asking me to review the software; it's a really interesting package that may turn out to be useful in some research projects. Being new to JOSS, I just read through the guidelines and posted a summary below. It would be nice if you could give me "permission" to edit the review sheet at the top.

@cxhernandez
Thanks for your submission. Although it seems like I listed a fair amount of points below, I must say that I find osprey really appealing! It would be great though if you could address these points, and I am happy to discuss them further :). I think the most important thing would be to fix the unit tests so that they work with current scikit-learn versions (or set scikit-learn=0.16 (?) as a hard requirement) so that I can look into the package further and try out some of the examples (and take it to a whirl on some real world data).

Software

[ ] documentation

Please add a link to the source code (GitHub repo) to the documentation.

The Read-the-Docs style documentation at http://msmbuilder.org/osprey/1.0.0/ looks nice to me. It would be nice to link back to the GitHub repository though. Also, I noticed that this is the documentation for version 1.0 and in "Installation" it says that the release version can be installed via pip install osprey, which is v0.4.

[ ] tests

There's currently only a 63% coverage; I am not sure what the JOSS requirements are, but I would recommend extending the unit tests to at least 90%.
~~Also, I am unable to run the unit tests on my system~~

>>> from sklearn.utils import check_arrays
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'check_arrays'

~~I think this is a left-over from a very old scikit-learn version (< 0.16); the check_arrays function has been removed a long time ago (there's a different check_array now).~~

~~Also, I think the unitests are currently not running on Travis CI?~~

[ ] License

There is a Apache 2.0 license template in the repo, however, it needs to be filled out with the necessary information to make it a valid license. E.g., it currently says:

[...] Copyright [yyyy] [name of copyright owner]

Also, the "license badge" in the README links to the PyPI package not to the license file itself.

I would suggest to mention the license (and add a link) on the documentation website as well.

The JOSS paper

The JOSS paper (the PDF associated with this submission) should only include:

[x] A list of the authors of the software
[ ] Author affiliations

Author affilations are listed, however, I suggest removing the redundancy, i.e., "Stanford University" is currently listed 7 times. It would be sufficient to list it one time and change the author "indices" accordingly.

[x] A short summary describing the high-level functionality of the software

The authors provide a nicely written summary explaining the purpose, functionality, and underlying concepts in a brief, yet informative way including references to relevant literature. Great!

[ ] A list of key references including a link to the software archive

The "Software Repository:" and "Software Archive:" rows are currently blank and the respective links need to be provided.

Documentation

[x] A high-level overview of this documentation should be included in a README file (or equivalent).

Nice README file!

A statement of need

[x] The authors should clearly state what problems the software is designed to solve and who the target audience is.

The purpose of the software is summarized nicely

Installation instructions

[x] There should be a clearly-stated list of dependencies. Ideally these should be handled with an automated package management solution.

The authors provide a list of dependencies in the body of the README file. I would suggest to add these to an additional requirements.txt file so that they can be installed via pip in one swoop.

Only NumPy is listed as dependency, not SciPy -- the latter is marked as "optional". However, both are required by scikit-learn, which is listed as a dependency as well. I suggest either listing both SciPy and NumPy as hard dependencies or neither of them (since they are sklearn dependencies).

Also, it would be nice to mention the version numbers of the packages that are required (e.g., right now, unit tests don't run with scikit-learn > 0.16)

Example usage

[x] The authors should include examples of how to use the software (ideally to solve real-world analysis problems).

The examples are brief but clear and sufficient :).

API documentation

[x] Reviewers should check that the software API is documented to a suitable level.

Only very few functions have proper docstrings, but I want to give it an "okay" here since the user mostly interacts with osprey via a config file, which is then executed via osprey worker config.yaml; more docstrings would be nice for contributors though.

Tests

[x] Authors are strongly encouraged to include an automated test suite covering the core functionality of their software.

~~Tests exist but currently don't run on Travis (?) and they don't run with current package version of e.g., scikit-learn like mentioned earlier.~~

Community guidelines

There should be clear guidelines for third-parties wishing to:

[ ] Contribute to the software

I couldn't find any guidelines.

[ ] Report issues or problems with the software

Also, I didn't see anything here. Putting a link to GitHub issue tracker would be sufficient to address this though.

[ ] Seek support

Didn't find any information on this.

Examples

Include here some examples of well-documented software for people to review.

E.g., scikit-learn, Keras, or TensorFlow

Functionality

[ ] Reviewers are expected to install the software they are reviewing and to verify the core functionality of the software.

See unit test issues I mentioned above.

rasbt on 29 Jul 2016

Thanks @rasbt for reviewing our work! I've tried to address everything point-by-point:

Software

Documentation

It would be nice to link back to the GitHub repository though.

Thanks for the suggestion! I'll add a link to the GitHub repo on the homepage.

Also, I noticed that this is the documentation for version 1.0 and in "Installation" it says that the release version can be installed via pip install osprey, which is v0.4.

Good catch! We have not updated the PyPI package yet. In the meantime, the recommended way to install the latest version is via our conda repo (conda install -c omnia osprey). We'll update our PyPI package ASAP

Tests

There's currently only a 63% coverage; I am not sure what the JOSS requirements are, but I would recommend extending the unit tests to at least 90%.

We're working on improving our test coverage; however, if you look closely, a good deal of the "untested" code (as reported by coveralls) is part of our CLI codebase, for which there are, in fact, unit tests.

Also, I am unable to run the unit tests on my system....

This is probably from running v0.4, this import is no longer in the latest version (v1.0).

Also, I think the unitests are currently not running on Travis CI?

Unit tests have indeed been running on Travis CI.

License

Also, the "license badge" in the README links to the PyPI package not to the license file itself.

I would suggest to mention the license (and add a link) on the documentation website as well.

I'll make these fixes ASAP and add a link on the homepage as well.

Paper

I suggest removing the redundancy, i.e., "Stanford University" is currently listed 7 times

The affiliations listing is taken directly from the JOSS LaTeX template. However, I completely agree that this is redundant information and will update the affiliations accordingly.

The "Software Repository:" and "Software Archive:" rows are currently blank and the respective links need to be provided.

I'll update the "Software Repository" and "Software Archive" ASAP

Installation instructions

I would suggest to add these to an additional requirements.txt file so that they can be installed via pip in one swoop.

I'll add this ASAP

Only NumPy is listed as dependency, not SciPy -- the latter is marked as "optional".

Good point! I can update the README to list SciPy as a full dependency.

Also, it would be nice to mention the version numbers of the packages that are required (e.g., right now, unit tests don't run with scikit-learn > 0.16)

Osprey v1.0 should work with latest versions of its dependencies (unless specified). This is also probably a result of using v0.4.

API Documentation

more docstrings would be nice for contributors though

Agreed. The documentation has been tailored more for users at the moment, but we plan to include sections on creating/adding custom classes for dataset_loaders, etc.

Tests

Tests exist but currently don't run on Travis (?) and they don't run with current package version of e.g., scikit-learn like mentioned earlier.

They should run on Travis CI. Not sure what evidence there is to the contrary.

Community guidelines

This is something we are admittedly lacking; however, I can add a section about contributing to our README and homepage. All development occurs on GitHub, so we rely on the issue tracker for troubleshooting and support, as well as accepting software contributions via pull request.

cxhernandez on 30 Jul 2016

Wow, that was quick :) Thanks for addressing and commenting on these points so thoroughly!

Osprey v1.0 should work with latest versions of its dependencies (unless specified). This is also probably a result of using v0.4.

Good catch! We have not updated the PyPI package yet. In the meantime, the recommended way to install the latest version is via our conda repo (conda install -c omnia osprey). We'll update our PyPI package ASAP

Hm, I had an older version indeed. However, I did install the conda version; I also suggest updating it to 1.0.

~$ conda install -c omnia osprey
Fetching package metadata .........
Solving package specifications: ..........

Package plan for installation in environment /Users/Sebastian/miniconda3:

The following NEW packages will be INSTALLED:

    osprey: 0.4-py35_0 omnia

Proceed ([y]/n)?

We're working on improving our test coverage; however, if you look closely, a good deal of the "untested" code (as reported by coveralls) is part of our CLI codebase, for which there are, in fact, unit tests.

Sounds good! However, I am bit confused now, if there are unit tests, why is the coverage 0.0 for these then? Also, why did you prepend underscores (_) in some cases? E.g.,

def _test_dump_1():
    out = subprocess.check_output(
        [OSPREY_BIN, 'dump', 'config.yaml', '-o', 'json'])
    if sys.version_info >= (3, 0):
        out = out.decode()
    json.loads(out)

Is it due to some complications with travis and subprocess?

Unit tests have indeed been running on Travis CI.

Ah, sorry about that ... I see it now. Have only looked at the travis.yaml and somehow assumed that it wasn't running nose at all.

The affiliations listing is taken directly from the JOSS LaTeX template. However, I completely agree that this is redundant information and will update the affiliations accordingly.

Now that you mentioned it, I noticed the same problem with other papers as well. @arfon Maybe it'd be worthwhile updating the template to make it easier on the authors in future submissions?

Osprey v1.0 should work with latest versions of its dependencies (unless specified). This is also probably a result of using v0.4.

Nice. One more thing, though :P Could you please add the minimum version numbers to the README (and maybe the requirements.txt file for people who install it via pip) for the dependencies? E.g., something like

scipy>=0.17
numpy>=1.10.4
scikit-learn>=0.17
...

This is something we are admittedly lacking; however, I can add a section about contributing to our README and homepage. All development occurs on GitHub, so we rely on the issue tracker for troubleshooting and support, as well as accepting software contributions via pull request.

That's fine and probably the best solution! I would just add a sentence to the README (and website) like

"In case you encounter any issues with this package, please consider submitting a ticket to the GitHub Issue Tracker. We also welcome feature requests, and please don't hesitate to submit pull requests for bug fixes and improvements."

Or maybe consider setting up a "Contributing" file & website page with some more info on "how-to contribute", which would be even nicer :)

Let me know when you updated the conda installer and PyPI version; I am happy to take the 1.0 to a test drive ;)

rasbt on 30 Jul 2016

@arfon Maybe it'd be worthwhile updating the template to make it easier on the authors in future submissions?

@rasbt - do you know if there's a way to handle this in LaTeX automatically? I agree this is less-than-ideal but I'm not exactly sure how to fix this.

arfon on 30 Jul 2016

Let me know when you updated the conda installer and PyPI version; I am happy to take the 1.0 to a test drive ;)

@rasbt: It should now be available through conda and pip. Sorry about that!

cxhernandez on 30 Jul 2016

No worries, @cxhernandez thanks for updating it, it seems to work fine now and I will have a look at the remaining points in the review (I will go through the previous points again and check off the list items)

[ ] One more thing that I'd recommend you to address:

In the README.md (and on PyPI), it says

If you have an Anaconda Python distribution, installation is as easy as:

$ conda install -c omnia osprey
You can also install with pip:

$ pip install git+git://github.com/pandegroup/osprey.git
Alternatively, you can install directly from this GitHub repo:

I would change the line

pip install git+git://github.com/pandegroup/osprey.git

pip install osprey

so that people install the latest stable version by default, not the developer version. I would mention the "developer version" via pip install git+git://github.com/pandegroup/osprey.git on a separate line then.

rasbt on 1 Aug 2016

@arfon

I usually do this by assigning the same author index to 2 authors if they have the same affiliation:

\documentclass{article}

\usepackage{authblk}

\title{My paper}

\author[1]{Firstname1 Lastname1}
\author[2]{Firstname2 Lastname2}
\author[1]{Firstname3 Lastname3}

\affil[1]{Department of Math, University A}
\affil[2]{Department of Math, University B}

\begin{document}

\maketitle

\end{document}

screen shot 2016-08-01 at 3 59 09 pm

I only looked at the template briefly, but I think that's going to be tricky, you are right. I am not sure how to do this directly in the template, and the only spontaneous, ugly workaround I could think of would be to write a temporary tex file and modify this with a custom script

pandoc: .md -> .tex
script: fix numbers for duplicate author affiliations
pandoc: .tex -> .pdf

rasbt on 1 Aug 2016

You could use a style like revtex to automatically condense duplicate affiliations, or look at their .sty file to figure out what trickery they're using to do so

mpharrigan on 1 Aug 2016

@cxhernandez

please add

[ ] cython
[ ] mdtraj
[ ] msmbuilder
[ ] hdf5
[ ] numexpr
[ ] tables
[ ] matplotlib
[ ] numpydoc
[ ] bokeh
[ ] pandas

to the list of requirements. I wasn't able to run the unit tests without them.

rasbt on 1 Aug 2016

👍1

@cxhernandez

[ ] Seems to work well so far :). However, I am currently not able to reproduce the results, how can I set the random state in the configuration file? I couldn't find a note in the documentation.
[ ] Also, I am curious how I can run osprey on my own dataset. What format should the file be in? Say I have an iris.csv file from the UCI repo, how would I feed it to osprey?

rasbt on 1 Aug 2016

@rasbt: We don't currently have the option to set the random state in the configuration file, but that's a great suggestion and I'll look into implementing it!

cxhernandez on 1 Aug 2016

Please. I think that's very, very important for reproducible research.

rasbt on 1 Aug 2016

👍2

@cxhernandez

I am currently having some other issues with alternative search strategies. E.g., the following one works fine:

estimator:
  entry_point: sklearn.svm.SVC

strategy:
  name: random

search_space:
  C:
    min: 0.1
    max: 10
    type: float

  gamma:
    min: 1e-5
    max: 1
    warp: log
    type: float

cv: 5

dataset_loader:
  name: sklearn_dataset
  params:
    method: load_iris


trials:
    uri: sqlite:////Users/Sebastian/Desktop/osprey-trials.db

However, when I switch from

strategy:
  name: random

strategy:
  name: grid

I get

Loading trials database: sqlite:////Users/Sebastian/Desktop/osprey-trials.db...
History contains: 10 trials
Choosing next hyperparameters with grid...
Error: GridSearchStrategy is defined only for all-enum search space

When I choose

strategy:
  name: gp

    return [var.point_to_gp(point_dict[var.name]) for var in self]
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/search_space.py", line 112, in <listcomp>
    return [var.point_to_gp(point_dict[var.name]) for var in self]
KeyError: 'C'

Maybe I am doing something wrong ... would be nice if you could look into that.

rasbt on 1 Aug 2016

Loading trials database: sqlite:////Users/Sebastian/Desktop/osprey-trials.db...
History contains: 10 trials
Choosing next hyperparameters with grid...
Error: GridSearchStrategy is defined only for all-enum search space

In order to use GridSearchStrategy, you would need to convert your search-space into enumerables using a jump variable:

search_space:
  C:
    min: 0.1
    max: 10
    num: 10
    type: jump
    var_type: float

  gamma:
    min: 1e-5
    max: 1
    num: 10
    warp: log
    type: jump
    var_type: float

I realize this is not at all clear from our current documentation, however. Also, somewhat embarrassingly, I just now found a bug in how jump vars are parsed. Working on a PR now for a fix (it won't work otherwise) and additional docs + tests.

When I choose

strategy:
  name: gp

    return [var.point_to_gp(point_dict[var.name]) for var in self]
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/search_space.py", line 112, in <listcomp>
    return [var.point_to_gp(point_dict[var.name]) for var in self]
KeyError: 'C'

Not sure what's going on with gp. I can't reproduce this error locally. I'll add it as a unit test as well.

Thanks again for all your testing, @rasbt!

cxhernandez on 2 Aug 2016

In order to use GridSearchStrategy, you would need to convert your search-space into enumerables using a jump variable

Oh yeah, that makes total sense, how would it now the increment space otherwise (unless it is hard coded). Thanks! (but adding it to the doc would be useful like you said).

Also, somewhat embarrassingly, I just now found a bug in how jump vars are parsed.

Ah, sorry to hear that this review causes so much extra work, but I am happy to hear that it's been productive so far and that you discovered that bug :). And please take your time with the fixes; I think it would be worthwhile bumping the version to e.g., 1.1 and also updating the documentation in one swoop then before we get back to the review.

rasbt on 3 Aug 2016

I think it would be worthwhile bumping the version to e.g., 1.1 and also updating the documentation in one swoop then before we get back to the review.

Agreed! I'll let you know when it's ready to be tested (shouldn't take too long, hopefully). Thanks again!

cxhernandez on 3 Aug 2016

@arfon, @mpharrigan, @rasbt

I figured out a hack for the affiliations. If you add these lines to the template:

$if(authors)$
  $for(authors)$
  $if(authors.affiliation)$
    \author[$authors.affiliation$]{$authors.name$}
  $else$
    \author{$authors.name$}
  $endif$
  $endfor$
$endif$

$if(affiliations)$
  $for(affiliations)$
    \affil[$affiliations.index$]{$affiliations.name$}
  $endfor$
$endif$

and then have this in the paper.md header:

authors:
  - name: Robert T. McGibbon
    affiliation: 1
  - name: Carlos X. Hernández
    orcid: 0000-0002-8146-5904
    affiliation: 1
  - name: Matthew P. Harrigan
    affiliation: 1
  - name: Steven Kearnes
    affiliation: 1
  - name: Mohammad M. Sultan
    affiliation: 1
  - name: Stanislaw Jastrzebski
    affiliation: 2
  - name: Brooke E. Husic
    affiliation: 1
  - name: Vijay S. Pande
    affiliation: 1
affiliations:
  - name: Stanford University
    index: 1
  - name: Jagiellonian University
    index: 2

You get the desired result:

cxhernandez on 5 Aug 2016

👍1

Nice! I think it would be better to have the correct indices on the affiliations as well (currently both "1"s in the example above)

Here, you need to replace the [1] in

$if(affiliations)$
  $for(affiliations)$
    \affil[1]{$affiliations$}
  $endfor$
$endif$

by a counter variable 1, ..., n_affiliations

rasbt on 5 Aug 2016

Ah! I jumped the gun! Can you even use counters in pandoc templates? I'm googling, but nothing's coming up.

cxhernandez on 5 Aug 2016

I think you can do something like this:

affiliations:
  - {name: Stanford University, index: 1}
  - {name: Jagiellonian University, index: 2}

and then

$if(affiliations)$
  $for(affiliations)$
    \affil[$affiliations.index$]{$affiliations.name$}
  $endfor$
$endif$

rasbt on 5 Aug 2016

👍1

Ah, okay. I've updated my comment above.

cxhernandez on 5 Aug 2016

👍1

Looks nice!

rasbt on 5 Aug 2016

Alright, I've incorporated the majority of the suggestions listed above into the master branch. I'm holding off on making a new release (v1.1.0) until this review is complete (in case further changes are necessary).

Also, I am curious how I can run osprey on my own dataset. What format should the file be in? Say I have an iris.csv file from the UCI repo, how would I feed it to osprey?

Sorry @rasbt, I must have missed this question before. At the moment, .csv is not supported in our current stable of dataset_loaders (although it should be trivial to add); however, we currently support .npy, hdf5, joblib files, and pretty much every major MD trajectory format (through mdtraj). For the iris dataset, you might find the SklearnDatasetLoader handy:

dataset_loader:
  name: sklearn_dataset
  params:
    method: load_iris

cxhernandez on 5 Aug 2016

Sounds great, I will try to give it a look this weekend then!

however, we currently support .npy, hdf5, joblib

Still, I am wondering how the input format looks like? E.g., for classifiers, is it samples per row and features per column, and the class label in the last column (or first column, or separate object? in joblib and npy files)? I.e., let's say that I am a new user and I want to run osprey on my own non-" MD trajectory" dataset (as far as I understand from the paper, it is meant to be a general tool, not protein chemistry specific tool?). How do I do it; I think a little bit more guidance would be required.

For the iris dataset, you might find the SklearnDatasetLoader handy:

Sure, but I asked because I was curious how to prepare a general CSV-format dataset to run it in osprey. Let's pretend it's my private research dataset. How would I get it into osprey?

rasbt on 5 Aug 2016

Still, I am wondering how the input format looks like?

The way this works at the moment is similar to sklearn.datasets.load_files. You would break up your dataset into many different files, each file representing a unique class label (eg. 'class_1.h5', ..., 'class_n. h5'). The datasets should be organized so that each column represents a feature and each row is an observation. I can add more documentation on this.

cxhernandez on 5 Aug 2016

It would probably be helpful to add a tutorial/example using a non-biophysics dataset

mpharrigan on 6 Aug 2016

👍1

It would be nice @cxhernandez if you could add the instructions for preparing a CSV file for osprey. Or maybe not necessarily a CSV file, but a NumPy data array of shape [n_rows, n_samples] and for the supervised methods a target variable array. Sorry for all the extra work, but In my opinion, this would be necessary to make osprey accessible available to a general ML audience and to pass the JOSS requirements, such as

Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?

rasbt on 6 Aug 2016

👍1

@rasbt: We've added more complete instructions here

Let me know what you think!

cxhernandez on 16 Aug 2016

Nice, I will try to give it a whirl over the next couple of days.
Thanks for adding the random_state param. Just skimming through the documentation, I read that

In case you need reproducible Osprey trials, you can also include an optional random seed as seen below:

Example:
random_seed: 42
Please note that this makes parallel trials redundant and, thus, not recommended when scaling across multiple jobs.

So that means that every parallel trial will be run with the same random state? I think a simple workaround would be to increment the random seed for each parallel run so that they are different (and useful) but still reproducible.

rasbt on 16 Aug 2016

I think a simple workaround would be to increment the random seed for each parallel run so that they are different (and useful) but still reproducible.

Exactly. One would have to create a separate configuration file for each parallel worker in this case; I can add a note about this.

cxhernandez on 16 Aug 2016

@rasbt: I've added the ability to specify the random seed through the CLI, which should make it much more convenient to generate many reproducible workers

cxhernandez on 30 Aug 2016

thanks again for the update, Carlos. Haven't had a chance to look at the updated osprey package -- just got back from yet another conference, yesterday. The review is high on my to do list though, and I hope I'll be able to give you some feedback some time this week :)

rasbt on 30 Aug 2016

👍1

Hey, just got the latest release from GitHub and were trying to run the unit tests via nose and pumped into another little problem

    return _load(spec)
  File "<frozen importlib._bootstrap>", line 693, in _load
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/Users/Sebastian/Desktop/osprey-master/osprey/__init__.py", line 2, in <module>
    from .version import version as _version
ImportError: No module named 'osprey.version'

----------------------------------------------------------------------
Ran 1 test in 0.015s

FAILED (errors=1)
(osprey) ~/Desktop/osprey-master$

Now, I am wondering, which Python version(s) does osprey support? 2.7, 3.5, others? I would recommend adding a note about that in the README and on the documentation page itself in the Installation section

rasbt on 31 Aug 2016

Another little problem I bumped into when I was trying to run Osprey on a CSV version of Iris; hope you could help with that :)

I've setup the parameter file as follows

estimator:
  entry_point: sklearn.linear_model.LogisticRegression

strategy:
  name: grid

search_space:
  logistic__C:
    min: 1e-3
    max: 1e3
    type: float
    warp: log
  logistic__penalty:
    choices:
      - l1
      - l2
    type: enum

dataset_loader:
  name: dsv
  params:
    filenames: /Users/Sebastian/Desktop/iris.csv
    delimiter: ','
    y_col: 4
    usecols: 0, 1, 2, 3

cv: 5

trials:
    uri: sqlite:///osprey-trials.db

random_seed: 42

I also tried adding

    skip_header: 0
    skip_footer: 0

but got the same error message. Also, I tried usecols: 0, 1, 2, 3, 4 instead of usecols: 0, 1, 2, 3 but it didn't, same error message:

(osprey) ~/Desktop$ osprey worker --n-iters 10  osprey-config.yaml
======================================================================
= osprey is a tool for machine learning hyperparameter optimization. =
======================================================================

osprey version:      1.1.0dev0
time:                August 30, 2016 10:25 PM
hostname:            xxx
cwd:                 /Users/Sebastian/Desktop
pid:                 xxx

Loading config file:     osprey-config.yaml...

Loading dataset...

An unexpected error has occurred with osprey (version 1.1.0dev0), please
consider sending the following traceback to the osprey GitHub issue tracker at:
        https://github.com/pandegroup/osprey/issues

Traceback (most recent call last):
  File "/Users/Sebastian/miniconda3/envs/osprey/bin/osprey", line 9, in <module>
    load_entry_point('osprey==1.1.0.dev0', 'console_scripts', 'osprey')()
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/cli/main.py", line 37, in main
    args_func(args, p)
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/cli/main.py", line 42, in args_func
    args.func(args, p)
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/cli/parser_worker.py", line 8, in func
    execute(args, parser)
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/execute_worker.py", line 45, in execute
    X, y = config.dataset()
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/config.py", line 284, in dataset
    X, y = loader.load()
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/dataset_loaders.py", line 146, in load
    data = self.transform(self.loader(fn))
  File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/dataset_loaders.py", line 131, in transform
    return X[::self.stride, x_idx], X[::self.stride, y_idx].ravel()
IndexError: index 1 is out of bounds for axis 1 with size 1

The iris.csv is formatted as follows:

5.1,3.5,1.4,0.2,0
4.9,3,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
5,3.6,1.4,0.2,0
5.4,3.9,1.7,0.4,0
4.6,3.4,1.4,0.3,0
5,3.4,1.5,0.2,0
...

(sepal length, sepal width, petal length, petal width, class label)

I also attached the iris.csv for your reference as a zip since GitHub doesn't let me upload CSVs for some reason

iris.csv.zip

rasbt on 31 Aug 2016

This error seems to stem from how numpy.genfromtxt handles the DOS carriage return characters in your CSV file. I replaced them with a newline character and the data-loading seems to work just as intended. I'm inclined to report this to the NumPy folks, as this shouldn't be a difficult fix.

As far as the configuration file goes, the hyperparameters should be defined as simply C and penalty as the estimator is not part of a pipeline where LogisticRegression is defined as 'logistic'. Also, since you're using the grid search strategy, C should be defined as a jump variable to convert the float into an enumerable:

  C:
    min: 1e-3
    max: 1e3
    num: 10
    type: jump
    var_type: float
    warp: log

cxhernandez on 31 Aug 2016

Now, I am wondering, which Python version(s) does osprey support? 2.7, 3.5, others? I would recommend adding a note about that in the README and on the documentation page itself in the Installation section

I can add this to the README and docs . Osprey should support Python 2.7+.

As far as that particular error goes, you need to run python setup.py install (or develop) prior to nose testing so that the version module can be generated. Might be worth adding a testing section to the docs as well.

cxhernandez on 31 Aug 2016

This error seems to stem from how numpy.genfromtxt handles the DOS carriage return characters in your CSV file.

Ah, you are right, I exported the Iris feature array via numpy and then noticed that I forgot the class labels, which I quickly added via Excel. Excel must have replaced the newline chars by these weird DOS carriage return characters. Looked normal to me in my text editor (atom), but I had the same 1-line problem when using cat. Thanks for the hint!

As far as the configuration file goes, the hyperparameters should be defined as simply C and penalty as the estimator is not part of a pipeline where LogisticRegression is defined as 'logistic'. Also, since you're using the grid search strategy, C should be defined as a jump variable to convert the float into an enumerable:

Sorry about that, I remember you mentioned that earlier, and I just didn't pay attention when I copy&pasted the hyperparam settings from the documentations and changed it to grid

rasbt on 31 Aug 2016

Works, nicely now! :).

Have one more question about the random state though. Where exactly is it used? Only in the cross-validator? Because I noticed that the random_state is set to "null" in the estimator itself

screen shot 2016-08-31 at 1 21 02 pm

The param file I was using looks as follows (where I set the random state at the bottom)

stimator:
  entry_point: sklearn.linear_model.LogisticRegression

strategy:
  name: grid

search_space:
  solver:
    choices:
      - newton-cg
    type: enum
  multi_class:
    choices:
      - multinomial
    type: enum
  C:
    min: 1e-3
    max: 1e3
    num: 10
    type: jump
    var_type: float
    warp: log
  penalty:
    choices:
      - l2
    type: enum

dataset_loader:
  name: dsv
  params:
    filenames: /Users/Sebastian/Desktop/iris-2.csv
    delimiter: ','
    y_col: 4
    usecols: 0, 1, 2, 3

cv: 5

trials:
    uri: sqlite:///osprey-trials.db

random_seed: 42

However, adding

  random_state:
    choices:
      - 123
    type: enum

seems to do exactly what I want, so it's not necessary to fix something, I am just curious :)

rasbt on 31 Aug 2016

👍1

Ah, I am having troubles with nosetests again :P
So, I ran python setup.py develop and it solved the version issue, another thing though is that somehow nosetests is not grabbing the tests.

~/Desktop/osprey-master$ nosetests

----------------------------------------------------------------------
Ran 0 tests in 0.031s

OK
~/Desktop/osprey-master$ ls
LICENSE         basesetup.py        osprey          requirements.txt
README.md       devtools        osprey.egg-info     setup.py
__pycache__     docs            paper
~/Desktop/osprey-master$ ls osprey/tests/
__init__.py         test_config.py          test_search_space.py
__pycache__         test_dataset_loader.py      test_strategies.py
test_cli_skeleton.py        test_entry_point.py     test_trials.py
test_cli_worker_and_dump.py test_fit_estimator.py       test_utils.py
~/Desktop/osprey-master$ cd ~/code/siteinterlock
~/code/siteinterlock$ nosetests
....................
----------------------------------------------------------------------
Ran 20 tests in 5.663s

OK

Maybe I am doing something silly or overlooking something ... :/

rasbt on 31 Aug 2016

So, I ran python setup.py develop and it solved the version issue, another thing though is that somehow nosetests is not grabbing the tests.

Try this:

(msmb-dev) [cxh@local ~/Developer/osprey]$ cd
(msmb-dev) [cxh@local ~]$ nosetests osprey
test_pylearn2_autoencoder (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_autoencoder: this test requires pylearn2
test_pylearn2_classifier (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_classifier: this test requires pylearn2
test_pylearn2_dataset_loader (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_dataset_loader: this test requires pylearn2
test_pylearn2_dataset_loader_one_hot (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_dataset_loader_one_hot: this test requires pylearn2
test_pylearn2_regressor (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_regressor: this test requires pylearn2
osprey.tests.test_cli_skeleton.test_1 ... create  config.yaml
....

For some reason, the tests need to be run outside of the source directory.

cxhernandez on 31 Aug 2016

Have one more question about the random state though. Where exactly is it used? Only in the cross-validator? Because I noticed that the random_state is set to "null" in the estimator itself

It's used here to seed NumPy's random generator But you make a good point that this should also be passed to the estimator explicitly.

cxhernandez on 31 Aug 2016

Ok I'm self-appointing as the review-thread police -- can @cxhernandez and @rasbt carry on this debugging on a separate issue filed on the osprey issue tracker please?

jkahn on 31 Aug 2016

👍2

Okay, I won't go much into debugging then, I think it's fine its current state (these are just points you may want to add in future versions). I checked off all the points, and the only thing I would do then for version 1.1.0 is to add the required Python versions in the dependency list (e.g., mentioning that it works for both Python 2.7.x and 3.5)

Software

[x] documentation
[x] tests

There's currently only a 65% coverage; I am not sure what the JOSS requirements are, personally I would recommend extending the unit tests to at least 90%.

[x] License

The JOSS paper

The JOSS paper (the PDF associated with this submission) should only include:

[x] A list of the authors of the software
[x] Author affiliations
[x] A short summary describing the high-level functionality of the software
[x] A list of key references including a link to the software archive

Documentation

[x] A high-level overview of this documentation should be included in a README file (or equivalent).

A statement of need

[x] The authors should clearly state what problems the software is designed to solve and who the target audience is.

Installation instructions

[ ] There should be a clearly-stated list of dependencies. Ideally these should be handled with an automated package management solution.

Adding the required/supported Python version would be great.

Example usage

[x] The authors should include examples of how to use the software (ideally to solve real-world analysis problems).

API documentation

[x] Reviewers should check that the software API is documented to a suitable level.

Tests

[x] Authors are strongly encouraged to include an automated test suite covering the core functionality of their software.

Community guidelines

There should be clear guidelines for third-parties wishing to:

[x] Contribute to the software
[x] Report issues or problems with the software
[x] Seek support

Examples

Include here some examples of well-documented software for people to review.

E.g., scikit-learn, Keras, or TensorFlow

Functionality

[x] Reviewers are expected to install the software they are reviewing and to verify the core functionality of the software.

Other Notes

[ ] Once everything is set, make new stable release
[ ] change version label accordingly

rasbt on 31 Aug 2016

@arfon, @rasbt: I've added the most recent suggestions and have v1.1.0 ready to be merged, pending anything else.

cxhernandez on 1 Sep 2016

Thanks @cxhernandez. @rasbt - would you mind taking another look at this and if things look good we can move forward with accepting this into JOSS.

arfon on 4 Sep 2016

Thanks @cxhernandez. @rasbt - would you mind taking another look at this and if things look good we can move forward with accepting this into JOSS.

@arfon Sure no problem. This is 1.1.0 version has been the one I have been reviewing, the pull request (once merged) just changes the labels (as far as I can see based on the file changes). I would suggest changing the version tag (to 1.1.0) in the paper as well though, since this will be the one uploaded to JOSS if I understand correctly.

rasbt on 4 Sep 2016

I would suggest changing the version tag (to 1.1.0) in the paper as well though, since this will be the one uploaded to JOSS if I understand correctly.

Will do. So are we good to accept @rasbt?

arfon on 5 Sep 2016

Will do. So are we good to accept @rasbt?

Yes, it looks good to me :).

rasbt on 5 Sep 2016

🎉1

@cxhernandez - last thing:

At this point could you make an archive of the reviewed software in Zenodo/figshare/other service and update this thread with the DOI of the archive? I can then move forward with accepting the submission.

arfon on 5 Sep 2016

Hey @arfon,

I've archived the software to http://dx.doi.org/10.5281/zenodo.61670

Thanks!

cxhernandez on 7 Sep 2016

Thanks for the review @rasbt!

@cxhernandez your paper is now accepted into JOSS and your DOI is http://dx.doi.org/10.21105/joss.00034 🚀 🎉 💥

arfon on 8 Sep 2016

🎉2

Woo! Thanks @arfon, @labarba, and @rasbt!

cxhernandez on 8 Sep 2016

👍2

Congrats! :). Although the review was maybe a bit tedious, I hope it also helped improving the package regarding the maybe too many suggestions I brought up ;). In any case, I am really glad to hear that the software package was accepted and that we have another good & useful open source package out there!

rasbt on 8 Sep 2016

👍1

The software (and docs) undeniably benefitted from the review process! I look forward to submitting more stuff 😄

Thanks again!

cxhernandez on 8 Sep 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[REVIEW]: The Experiment Factory: Reproducible Experiment Containers

whedon · 12Comments

[REVIEW]: Noisyopt

whedon · 6Comments

[REVIEW]: PyGDH ("pigged"): Python Grid Discretization Helper

whedon · 11Comments

[PRE REVIEW]: MSMExplorer: Data Visualizations for Biomolecular Dynamics

whedon · 12Comments

[REVIEW]: cartography: Create and Integrate Maps in your R Workflow

whedon · 12Comments

Joss-reviews: [REVIEW]: Osprey: Hyperparameter Optimization for Machine Learning

Status

Reviewer questions

Conflict of interest

General checks

Functionality

Documentation

Software paper

Most helpful comment

All 61 comments

Software

The JOSS paper

Documentation

A statement of need

Installation instructions

Example usage

API documentation

Tests

Community guidelines

Examples

Functionality

Software

Documentation

Tests

License

Paper

Installation instructions

API Documentation

Tests

Community guidelines

Software

The JOSS paper

Documentation

A statement of need

Installation instructions

Example usage

API documentation

Tests

Community guidelines

Examples

Functionality

Other Notes

Related issues