Submitting author: @cxhernandez (Carlos Xavier Hern谩ndez)
Repository: https://github.com/msmbuilder/osprey
Version: v1.1.0
Editor: @arfon
Reviewer: @rasbt
Archive: 10.5281/zenodo.61670
Status badge code:
HTML: <a href="http://joss.theoj.org/papers/659cfc3ec56b30c607f705983e592406"><img src="http://joss.theoj.org/papers/659cfc3ec56b30c607f705983e592406/status.svg"></a>
Markdown: [](http://joss.theoj.org/papers/659cfc3ec56b30c607f705983e592406)
[ ] As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).
[ ] Repository: Is the source code for this software available at the repository url?
[ ] Version: Does the release version given match the GitHub release (v1.0.0)?
[ ] Installation: Does installation proceed as outlined in the documentation?
[ ] Performance: Have any performance claims of the software been confirmed?
[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
Paper PDF: 10.21105.joss.00034.pdf
paper.md
file include a list of authors with their affiliations?Hi @cxhernandez. Thanks for submitting this package to JOSS. Unfortunately it doesn't look like there's a paper.md
file in this repository.
Could you please take a look at the submission guidelines and make sure there's a paper present in the repository. We can then move forward in the review process.
Hey @arfon,
Sorry about that! I've added the paper to the repository now, and it can be found here.
Thanks for your consideration!
/ cc @openjournals/joss-reviewers - would anyone be willing to review this submission?
If you would like to review this submission then please comment on this thread so that others know you're doing a review (so as not to duplicate effort). Something as simple as :hand: I am reviewing this
will suffice.
Reviewer instructions
Any questions, please ask for help by commenting on this issue! 馃殌
@jakevdp - do you think you might be able to find us a reviewer for this submission?
See this interesting thread, where I recruit a potential reviewer for this paper:
https://twitter.com/rasbt/status/756604243465895936
@labarba - do you think you could follow up with this person to see if they're still willing/able to help out here?
I did, four days ago. And he replied, but no indication of timing for the review. I'll ping again.
https://twitter.com/lorenaabarba/status/757185650600861696
@labarba @arfon Thanks for asking me to review the software; it's a really interesting package that may turn out to be useful in some research projects. Being new to JOSS, I just read through the guidelines and posted a summary below. It would be nice if you could give me "permission" to edit the review sheet at the top.
@cxhernandez
Thanks for your submission. Although it seems like I listed a fair amount of points below, I must say that I find osprey really appealing! It would be great though if you could address these points, and I am happy to discuss them further :). I think the most important thing would be to fix the unit tests so that they work with current scikit-learn versions (or set scikit-learn=0.16
(?) as a hard requirement) so that I can look into the package further and try out some of the examples (and take it to a whirl on some real world data).
Please add a link to the source code (GitHub repo) to the documentation.
The Read-the-Docs style documentation at http://msmbuilder.org/osprey/1.0.0/ looks nice to me. It would be nice to link back to the GitHub repository though. Also, I noticed that this is the documentation for version 1.0 and in "Installation" it says that the release version can be installed via pip install osprey
, which is v0.4.
There's currently only a 63% coverage; I am not sure what the JOSS requirements are, but I would recommend extending the unit tests to at least 90%.
Also, I am unable to run the unit tests on my system
>>> from sklearn.utils import check_arrays
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'check_arrays'
I think this is a left-over from a very old scikit-learn version (< 0.16); the check_arrays
function has been removed a long time ago (there's a different check_array
now).
Also, I think the unitests are currently not running on Travis CI?
There is a Apache 2.0 license template in the repo, however, it needs to be filled out with the necessary information to make it a valid license. E.g., it currently says:
[...] Copyright [yyyy] [name of copyright owner]
Also, the "license badge" in the README links to the PyPI package not to the license file itself.
I would suggest to mention the license (and add a link) on the documentation website as well.
The JOSS paper (the PDF associated with this submission) should only include:
Author affilations are listed, however, I suggest removing the redundancy, i.e., "Stanford University" is currently listed 7 times. It would be sufficient to list it one time and change the author "indices" accordingly.
The authors provide a nicely written summary explaining the purpose, functionality, and underlying concepts in a brief, yet informative way including references to relevant literature. Great!
The "Software Repository:" and "Software Archive:" rows are currently blank and the respective links need to be provided.
Nice README file!
The purpose of the software is summarized nicely
The authors provide a list of dependencies in the body of the README file. I would suggest to add these to an additional requirements.txt
file so that they can be installed via pip in one swoop.
Only NumPy is listed as dependency, not SciPy -- the latter is marked as "optional". However, both are required by scikit-learn, which is listed as a dependency as well. I suggest either listing both SciPy and NumPy as hard dependencies or neither of them (since they are sklearn dependencies).
Also, it would be nice to mention the version numbers of the packages that are required (e.g., right now, unit tests don't run with scikit-learn > 0.16)
The examples are brief but clear and sufficient :).
Only very few functions have proper docstrings, but I want to give it an "okay" here since the user mostly interacts with osprey via a config file, which is then executed via osprey worker config.yaml
; more docstrings would be nice for contributors though.
Tests exist but currently don't run on Travis (?) and they don't run with current package version of e.g., scikit-learn like mentioned earlier.
There should be clear guidelines for third-parties wishing to:
I couldn't find any guidelines.
Also, I didn't see anything here. Putting a link to GitHub issue tracker would be sufficient to address this though.
Didn't find any information on this.
Include here some examples of well-documented software for people to review.
E.g., scikit-learn, Keras, or TensorFlow
See unit test issues I mentioned above.
Thanks @rasbt for reviewing our work! I've tried to address everything point-by-point:
It would be nice to link back to the GitHub repository though.
Thanks for the suggestion! I'll add a link to the GitHub repo on the homepage.
Also, I noticed that this is the documentation for version 1.0 and in "Installation" it says that the release version can be installed via pip install osprey, which is v0.4.
Good catch! We have not updated the PyPI package yet. In the meantime, the recommended way to install the latest version is via our conda repo (conda install -c omnia osprey
). We'll update our PyPI package ASAP
There's currently only a 63% coverage; I am not sure what the JOSS requirements are, but I would recommend extending the unit tests to at least 90%.
We're working on improving our test coverage; however, if you look closely, a good deal of the "untested" code (as reported by coveralls) is part of our CLI codebase, for which there are, in fact, unit tests.
Also, I am unable to run the unit tests on my system....
This is probably from running v0.4, this import is no longer in the latest version (v1.0).
Also, I think the unitests are currently not running on Travis CI?
Unit tests have indeed been running on Travis CI.
Also, the "license badge" in the README links to the PyPI package not to the license file itself.
I would suggest to mention the license (and add a link) on the documentation website as well.
I'll make these fixes ASAP and add a link on the homepage as well.
I suggest removing the redundancy, i.e., "Stanford University" is currently listed 7 times
The affiliations listing is taken directly from the JOSS LaTeX template. However, I completely agree that this is redundant information and will update the affiliations accordingly.
The "Software Repository:" and "Software Archive:" rows are currently blank and the respective links need to be provided.
I'll update the "Software Repository" and "Software Archive" ASAP
I would suggest to add these to an additional requirements.txt file so that they can be installed via pip in one swoop.
I'll add this ASAP
Only NumPy is listed as dependency, not SciPy -- the latter is marked as "optional".
Good point! I can update the README to list SciPy as a full dependency.
Also, it would be nice to mention the version numbers of the packages that are required (e.g., right now, unit tests don't run with scikit-learn > 0.16)
Osprey v1.0 should work with latest versions of its dependencies (unless specified). This is also probably a result of using v0.4.
more docstrings would be nice for contributors though
Agreed. The documentation has been tailored more for users at the moment, but we plan to include sections on creating/adding custom classes for dataset_loaders
, etc.
Tests exist but currently don't run on Travis (?) and they don't run with current package version of e.g., scikit-learn like mentioned earlier.
They should run on Travis CI. Not sure what evidence there is to the contrary.
This is something we are admittedly lacking; however, I can add a section about contributing to our README and homepage. All development occurs on GitHub, so we rely on the issue tracker for troubleshooting and support, as well as accepting software contributions via pull request.
Wow, that was quick :) Thanks for addressing and commenting on these points so thoroughly!
Osprey v1.0 should work with latest versions of its dependencies (unless specified). This is also probably a result of using v0.4.
Good catch! We have not updated the PyPI package yet. In the meantime, the recommended way to install the latest version is via our conda repo (conda install -c omnia osprey). We'll update our PyPI package ASAP
Hm, I had an older version indeed. However, I did install the conda version; I also suggest updating it to 1.0.
~$ conda install -c omnia osprey
Fetching package metadata .........
Solving package specifications: ..........
Package plan for installation in environment /Users/Sebastian/miniconda3:
The following NEW packages will be INSTALLED:
osprey: 0.4-py35_0 omnia
Proceed ([y]/n)?
We're working on improving our test coverage; however, if you look closely, a good deal of the "untested" code (as reported by coveralls) is part of our CLI codebase, for which there are, in fact, unit tests.
Sounds good! However, I am bit confused now, if there are unit tests, why is the coverage 0.0 for these then? Also, why did you prepend underscores (_
) in some cases? E.g.,
def _test_dump_1():
out = subprocess.check_output(
[OSPREY_BIN, 'dump', 'config.yaml', '-o', 'json'])
if sys.version_info >= (3, 0):
out = out.decode()
json.loads(out)
Is it due to some complications with travis and subprocess?
Unit tests have indeed been running on Travis CI.
Ah, sorry about that ... I see it now. Have only looked at the travis.yaml and somehow assumed that it wasn't running nose at all.
The affiliations listing is taken directly from the JOSS LaTeX template. However, I completely agree that this is redundant information and will update the affiliations accordingly.
Now that you mentioned it, I noticed the same problem with other papers as well. @arfon Maybe it'd be worthwhile updating the template to make it easier on the authors in future submissions?
Osprey v1.0 should work with latest versions of its dependencies (unless specified). This is also probably a result of using v0.4.
Nice. One more thing, though :P Could you please add the minimum version numbers to the README (and maybe the requirements.txt file for people who install it via pip) for the dependencies? E.g., something like
scipy>=0.17
numpy>=1.10.4
scikit-learn>=0.17
...
This is something we are admittedly lacking; however, I can add a section about contributing to our README and homepage. All development occurs on GitHub, so we rely on the issue tracker for troubleshooting and support, as well as accepting software contributions via pull request.
That's fine and probably the best solution! I would just add a sentence to the README (and website) like
"In case you encounter any issues with this package, please consider submitting a ticket to the GitHub Issue Tracker. We also welcome feature requests, and please don't hesitate to submit pull requests for bug fixes and improvements."
Or maybe consider setting up a "Contributing" file & website page with some more info on "how-to contribute", which would be even nicer :)
Let me know when you updated the conda installer and PyPI version; I am happy to take the 1.0 to a test drive ;)
@arfon Maybe it'd be worthwhile updating the template to make it easier on the authors in future submissions?
@rasbt - do you know if there's a way to handle this in LaTeX automatically? I agree this is less-than-ideal but I'm not exactly sure how to fix this.
Let me know when you updated the conda installer and PyPI version; I am happy to take the 1.0 to a test drive ;)
@rasbt: It should now be available through conda and pip. Sorry about that!
No worries, @cxhernandez thanks for updating it, it seems to work fine now and I will have a look at the remaining points in the review (I will go through the previous points again and check off the list items)
In the README.md (and on PyPI), it says
If you have an Anaconda Python distribution, installation is as easy as:
$ conda install -c omnia osprey
You can also install with pip:
$ pip install git+git://github.com/pandegroup/osprey.git
Alternatively, you can install directly from this GitHub repo:
I would change the line
pip install git+git://github.com/pandegroup/osprey.git
to
pip install osprey
so that people install the latest stable version by default, not the developer version. I would mention the "developer version" via pip install git+git://github.com/pandegroup/osprey.git
on a separate line then.
@arfon
I usually do this by assigning the same author index to 2 authors if they have the same affiliation:
\documentclass{article}
\usepackage{authblk}
\title{My paper}
\author[1]{Firstname1 Lastname1}
\author[2]{Firstname2 Lastname2}
\author[1]{Firstname3 Lastname3}
\affil[1]{Department of Math, University A}
\affil[2]{Department of Math, University B}
\begin{document}
\maketitle
\end{document}
I only looked at the template briefly, but I think that's going to be tricky, you are right. I am not sure how to do this directly in the template, and the only spontaneous, ugly workaround I could think of would be to write a temporary tex file and modify this with a custom script
You could use a style like revtex
to automatically condense duplicate affiliations, or look at their .sty
file to figure out what trickery they're using to do so
@cxhernandez
please add
to the list of requirements. I wasn't able to run the unit tests without them.
@cxhernandez
@rasbt: We don't currently have the option to set the random state in the configuration file, but that's a great suggestion and I'll look into implementing it!
Please. I think that's very, very important for reproducible research.
@cxhernandez
I am currently having some other issues with alternative search strategies. E.g., the following one works fine:
estimator:
entry_point: sklearn.svm.SVC
strategy:
name: random
search_space:
C:
min: 0.1
max: 10
type: float
gamma:
min: 1e-5
max: 1
warp: log
type: float
cv: 5
dataset_loader:
name: sklearn_dataset
params:
method: load_iris
trials:
uri: sqlite:////Users/Sebastian/Desktop/osprey-trials.db
However, when I switch from
strategy:
name: random
to
strategy:
name: grid
I get
Loading trials database: sqlite:////Users/Sebastian/Desktop/osprey-trials.db...
History contains: 10 trials
Choosing next hyperparameters with grid...
Error: GridSearchStrategy is defined only for all-enum search space
When I choose
strategy:
name: gp
return [var.point_to_gp(point_dict[var.name]) for var in self]
File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/search_space.py", line 112, in <listcomp>
return [var.point_to_gp(point_dict[var.name]) for var in self]
KeyError: 'C'
Maybe I am doing something wrong ... would be nice if you could look into that.
Loading trials database: sqlite:////Users/Sebastian/Desktop/osprey-trials.db... History contains: 10 trials Choosing next hyperparameters with grid... Error: GridSearchStrategy is defined only for all-enum search space
In order to use GridSearchStrategy
, you would need to convert your search-space into enumerables using a jump
variable:
search_space:
C:
min: 0.1
max: 10
num: 10
type: jump
var_type: float
gamma:
min: 1e-5
max: 1
num: 10
warp: log
type: jump
var_type: float
I realize this is not at all clear from our current documentation, however. Also, somewhat embarrassingly, I just now found a bug in how jump
vars are parsed. Working on a PR now for a fix (it won't work otherwise) and additional docs + tests.
When I choose
strategy: name: gp
return [var.point_to_gp(point_dict[var.name]) for var in self] File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/search_space.py", line 112, in <listcomp> return [var.point_to_gp(point_dict[var.name]) for var in self] KeyError: 'C'
Not sure what's going on with gp
. I can't reproduce this error locally. I'll add it as a unit test as well.
Thanks again for all your testing, @rasbt!
In order to use GridSearchStrategy, you would need to convert your search-space into enumerables using a jump variable
Oh yeah, that makes total sense, how would it now the increment space otherwise (unless it is hard coded). Thanks! (but adding it to the doc would be useful like you said).
Also, somewhat embarrassingly, I just now found a bug in how jump vars are parsed.
Ah, sorry to hear that this review causes so much extra work, but I am happy to hear that it's been productive so far and that you discovered that bug :). And please take your time with the fixes; I think it would be worthwhile bumping the version to e.g., 1.1 and also updating the documentation in one swoop then before we get back to the review.
I think it would be worthwhile bumping the version to e.g., 1.1 and also updating the documentation in one swoop then before we get back to the review.
Agreed! I'll let you know when it's ready to be tested (shouldn't take too long, hopefully). Thanks again!
@arfon, @mpharrigan, @rasbt
I figured out a hack for the affiliations. If you add these lines to the template:
$if(authors)$
$for(authors)$
$if(authors.affiliation)$
\author[$authors.affiliation$]{$authors.name$}
$else$
\author{$authors.name$}
$endif$
$endfor$
$endif$
$if(affiliations)$
$for(affiliations)$
\affil[$affiliations.index$]{$affiliations.name$}
$endfor$
$endif$
and then have this in the paper.md
header:
authors:
- name: Robert T. McGibbon
affiliation: 1
- name: Carlos X. Hern谩ndez
orcid: 0000-0002-8146-5904
affiliation: 1
- name: Matthew P. Harrigan
affiliation: 1
- name: Steven Kearnes
affiliation: 1
- name: Mohammad M. Sultan
affiliation: 1
- name: Stanislaw Jastrzebski
affiliation: 2
- name: Brooke E. Husic
affiliation: 1
- name: Vijay S. Pande
affiliation: 1
affiliations:
- name: Stanford University
index: 1
- name: Jagiellonian University
index: 2
You get the desired result:
Nice! I think it would be better to have the correct indices on the affiliations as well (currently both "1"s in the example above)
Here, you need to replace the [1]
in
$if(affiliations)$
$for(affiliations)$
\affil[1]{$affiliations$}
$endfor$
$endif$
by a counter variable 1, ..., n_affiliations
Ah! I jumped the gun! Can you even use counters in pandoc templates? I'm googling, but nothing's coming up.
I think you can do something like this:
affiliations:
- {name: Stanford University, index: 1}
- {name: Jagiellonian University, index: 2}
and then
$if(affiliations)$
$for(affiliations)$
\affil[$affiliations.index$]{$affiliations.name$}
$endfor$
$endif$
Ah, okay. I've updated my comment above.
Looks nice!
Alright, I've incorporated the majority of the suggestions listed above into the master
branch. I'm holding off on making a new release (v1.1.0) until this review is complete (in case further changes are necessary).
Also, I am curious how I can run osprey on my own dataset. What format should the file be in? Say I have an iris.csv file from the UCI repo, how would I feed it to osprey?
Sorry @rasbt, I must have missed this question before. At the moment, .csv
is not supported in our current stable of dataset_loaders
(although it should be trivial to add); however, we currently support .npy
, hdf5
, joblib
files, and pretty much every major MD trajectory format (through mdtraj
). For the iris dataset, you might find the SklearnDatasetLoader
handy:
dataset_loader:
name: sklearn_dataset
params:
method: load_iris
Sounds great, I will try to give it a look this weekend then!
however, we currently support .npy, hdf5, joblib
Still, I am wondering how the input format looks like? E.g., for classifiers, is it samples per row and features per column, and the class label in the last column (or first column, or separate object? in joblib and npy files)? I.e., let's say that I am a new user and I want to run osprey on my own non-" MD trajectory" dataset (as far as I understand from the paper, it is meant to be a general tool, not protein chemistry specific tool?). How do I do it; I think a little bit more guidance would be required.
For the iris dataset, you might find the SklearnDatasetLoader handy:
Sure, but I asked because I was curious how to prepare a general CSV-format dataset to run it in osprey. Let's pretend it's my private research dataset. How would I get it into osprey?
Still, I am wondering how the input format looks like?
The way this works at the moment is similar to sklearn.datasets.load_files
. You would break up your dataset into many different files, each file representing a unique class label (eg. 'class_1.h5', ..., 'class_n. h5'). The datasets should be organized so that each column represents a feature and each row is an observation. I can add more documentation on this.
It would probably be helpful to add a tutorial/example using a non-biophysics dataset
It would be nice @cxhernandez if you could add the instructions for preparing a CSV file for osprey. Or maybe not necessarily a CSV file, but a NumPy data array of shape [n_rows, n_samples] and for the supervised methods a target variable array. Sorry for all the extra work, but In my opinion, this would be necessary to make osprey accessible available to a general ML audience and to pass the JOSS requirements, such as
Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
@rasbt: We've added more complete instructions here
Let me know what you think!
Nice, I will try to give it a whirl over the next couple of days.
Thanks for adding the random_state param. Just skimming through the documentation, I read that
In case you need reproducible Osprey trials, you can also include an optional random seed as seen below:
Example:
random_seed: 42
Please note that this makes parallel trials redundant and, thus, not recommended when scaling across multiple jobs.
So that means that every parallel trial will be run with the same random state? I think a simple workaround would be to increment the random seed for each parallel run so that they are different (and useful) but still reproducible.
I think a simple workaround would be to increment the random seed for each parallel run so that they are different (and useful) but still reproducible.
Exactly. One would have to create a separate configuration file for each parallel worker in this case; I can add a note about this.
@rasbt: I've added the ability to specify the random seed through the CLI, which should make it much more convenient to generate many reproducible workers
thanks again for the update, Carlos. Haven't had a chance to look at the updated osprey package -- just got back from yet another conference, yesterday. The review is high on my to do list though, and I hope I'll be able to give you some feedback some time this week :)
Hey, just got the latest release from GitHub and were trying to run the unit tests via nose
and pumped into another little problem
return _load(spec)
File "<frozen importlib._bootstrap>", line 693, in _load
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 665, in exec_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "/Users/Sebastian/Desktop/osprey-master/osprey/__init__.py", line 2, in <module>
from .version import version as _version
ImportError: No module named 'osprey.version'
----------------------------------------------------------------------
Ran 1 test in 0.015s
FAILED (errors=1)
(osprey) ~/Desktop/osprey-master$
Now, I am wondering, which Python version(s) does osprey support? 2.7, 3.5, others? I would recommend adding a note about that in the README and on the documentation page itself in the Installation section
Another little problem I bumped into when I was trying to run Osprey on a CSV version of Iris; hope you could help with that :)
I've setup the parameter file as follows
estimator:
entry_point: sklearn.linear_model.LogisticRegression
strategy:
name: grid
search_space:
logistic__C:
min: 1e-3
max: 1e3
type: float
warp: log
logistic__penalty:
choices:
- l1
- l2
type: enum
dataset_loader:
name: dsv
params:
filenames: /Users/Sebastian/Desktop/iris.csv
delimiter: ','
y_col: 4
usecols: 0, 1, 2, 3
cv: 5
trials:
uri: sqlite:///osprey-trials.db
random_seed: 42
I also tried adding
skip_header: 0
skip_footer: 0
but got the same error message. Also, I tried usecols: 0, 1, 2, 3, 4
instead of usecols: 0, 1, 2, 3
but it didn't, same error message:
(osprey) ~/Desktop$ osprey worker --n-iters 10 osprey-config.yaml
======================================================================
= osprey is a tool for machine learning hyperparameter optimization. =
======================================================================
osprey version: 1.1.0dev0
time: August 30, 2016 10:25 PM
hostname: xxx
cwd: /Users/Sebastian/Desktop
pid: xxx
Loading config file: osprey-config.yaml...
Loading dataset...
An unexpected error has occurred with osprey (version 1.1.0dev0), please
consider sending the following traceback to the osprey GitHub issue tracker at:
https://github.com/pandegroup/osprey/issues
Traceback (most recent call last):
File "/Users/Sebastian/miniconda3/envs/osprey/bin/osprey", line 9, in <module>
load_entry_point('osprey==1.1.0.dev0', 'console_scripts', 'osprey')()
File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/cli/main.py", line 37, in main
args_func(args, p)
File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/cli/main.py", line 42, in args_func
args.func(args, p)
File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/cli/parser_worker.py", line 8, in func
execute(args, parser)
File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/execute_worker.py", line 45, in execute
X, y = config.dataset()
File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/config.py", line 284, in dataset
X, y = loader.load()
File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/dataset_loaders.py", line 146, in load
data = self.transform(self.loader(fn))
File "/Users/Sebastian/miniconda3/envs/osprey/lib/python3.5/site-packages/osprey/dataset_loaders.py", line 131, in transform
return X[::self.stride, x_idx], X[::self.stride, y_idx].ravel()
IndexError: index 1 is out of bounds for axis 1 with size 1
The iris.csv is formatted as follows:
5.1,3.5,1.4,0.2,0
4.9,3,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
5,3.6,1.4,0.2,0
5.4,3.9,1.7,0.4,0
4.6,3.4,1.4,0.3,0
5,3.4,1.5,0.2,0
...
(sepal length, sepal width, petal length, petal width, class label)
I also attached the iris.csv for your reference as a zip since GitHub doesn't let me upload CSVs for some reason
This error seems to stem from how numpy.genfromtxt
handles the DOS carriage return characters in your CSV file. I replaced them with a newline character and the data-loading seems to work just as intended. I'm inclined to report this to the NumPy folks, as this shouldn't be a difficult fix.
As far as the configuration file goes, the hyperparameters should be defined as simply C
and penalty
as the estimator is not part of a pipeline where LogisticRegression
is defined as 'logistic'
. Also, since you're using the grid search strategy, C
should be defined as a jump variable to convert the float into an enumerable:
C:
min: 1e-3
max: 1e3
num: 10
type: jump
var_type: float
warp: log
Now, I am wondering, which Python version(s) does osprey support? 2.7, 3.5, others? I would recommend adding a note about that in the README and on the documentation page itself in the Installation section
I can add this to the README and docs . Osprey should support Python 2.7+.
As far as that particular error goes, you need to run python setup.py install
(or develop
) prior to nose testing so that the version
module can be generated. Might be worth adding a testing
section to the docs as well.
This error seems to stem from how numpy.genfromtxt handles the DOS carriage return characters in your CSV file.
Ah, you are right, I exported the Iris feature array via numpy and then noticed that I forgot the class labels, which I quickly added via Excel. Excel must have replaced the newline chars by these weird DOS carriage return characters. Looked normal to me in my text editor (atom), but I had the same 1-line problem when using cat
. Thanks for the hint!
As far as the configuration file goes, the hyperparameters should be defined as simply C and penalty as the estimator is not part of a pipeline where LogisticRegression is defined as 'logistic'. Also, since you're using the grid search strategy, C should be defined as a jump variable to convert the float into an enumerable:
Sorry about that, I remember you mentioned that earlier, and I just didn't pay attention when I copy&pasted the hyperparam settings from the documentations and changed it to grid
Works, nicely now! :).
Have one more question about the random state though. Where exactly is it used? Only in the cross-validator? Because I noticed that the random_state
is set to "null" in the estimator itself
The param file I was using looks as follows (where I set the random state at the bottom)
stimator:
entry_point: sklearn.linear_model.LogisticRegression
strategy:
name: grid
search_space:
solver:
choices:
- newton-cg
type: enum
multi_class:
choices:
- multinomial
type: enum
C:
min: 1e-3
max: 1e3
num: 10
type: jump
var_type: float
warp: log
penalty:
choices:
- l2
type: enum
dataset_loader:
name: dsv
params:
filenames: /Users/Sebastian/Desktop/iris-2.csv
delimiter: ','
y_col: 4
usecols: 0, 1, 2, 3
cv: 5
trials:
uri: sqlite:///osprey-trials.db
random_seed: 42
However, adding
random_state:
choices:
- 123
type: enum
seems to do exactly what I want, so it's not necessary to fix something, I am just curious :)
Ah, I am having troubles with nosetests again :P
So, I ran python setup.py develop
and it solved the version
issue, another thing though is that somehow nosetests is not grabbing the tests.
~/Desktop/osprey-master$ nosetests
----------------------------------------------------------------------
Ran 0 tests in 0.031s
OK
~/Desktop/osprey-master$ ls
LICENSE basesetup.py osprey requirements.txt
README.md devtools osprey.egg-info setup.py
__pycache__ docs paper
~/Desktop/osprey-master$ ls osprey/tests/
__init__.py test_config.py test_search_space.py
__pycache__ test_dataset_loader.py test_strategies.py
test_cli_skeleton.py test_entry_point.py test_trials.py
test_cli_worker_and_dump.py test_fit_estimator.py test_utils.py
~/Desktop/osprey-master$ cd ~/code/siteinterlock
~/code/siteinterlock$ nosetests
....................
----------------------------------------------------------------------
Ran 20 tests in 5.663s
OK
Maybe I am doing something silly or overlooking something ... :/
So, I ran python setup.py develop and it solved the version issue, another thing though is that somehow nosetests is not grabbing the tests.
Try this:
(msmb-dev) [cxh@local ~/Developer/osprey]$ cd
(msmb-dev) [cxh@local ~]$ nosetests osprey
test_pylearn2_autoencoder (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_autoencoder: this test requires pylearn2
test_pylearn2_classifier (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_classifier: this test requires pylearn2
test_pylearn2_dataset_loader (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_dataset_loader: this test requires pylearn2
test_pylearn2_dataset_loader_one_hot (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_dataset_loader_one_hot: this test requires pylearn2
test_pylearn2_regressor (osprey.plugins.tests.test_plugin_pylearn2.TestPluginPylearn2) ... SKIP: Skipping test: test_pylearn2_regressor: this test requires pylearn2
osprey.tests.test_cli_skeleton.test_1 ... create config.yaml
....
For some reason, the tests need to be run outside of the source directory.
Have one more question about the random state though. Where exactly is it used? Only in the cross-validator? Because I noticed that the random_state is set to "null" in the estimator itself
It's used here to seed NumPy's random generator But you make a good point that this should also be passed to the estimator explicitly.
Ok I'm self-appointing as the review-thread police -- can @cxhernandez and @rasbt carry on this debugging on a separate issue filed on the osprey issue tracker please?
Okay, I won't go much into debugging then, I think it's fine its current state (these are just points you may want to add in future versions). I checked off all the points, and the only thing I would do then for version 1.1.0 is to add the required Python versions in the dependency list (e.g., mentioning that it works for both Python 2.7.x and 3.5)
There's currently only a 65% coverage; I am not sure what the JOSS requirements are, personally I would recommend extending the unit tests to at least 90%.
The JOSS paper (the PDF associated with this submission) should only include:
Adding the required/supported Python version would be great.
There should be clear guidelines for third-parties wishing to:
Include here some examples of well-documented software for people to review.
E.g., scikit-learn, Keras, or TensorFlow
@arfon, @rasbt: I've added the most recent suggestions and have v1.1.0 ready to be merged, pending anything else.
Thanks @cxhernandez. @rasbt - would you mind taking another look at this and if things look good we can move forward with accepting this into JOSS.
Thanks @cxhernandez. @rasbt - would you mind taking another look at this and if things look good we can move forward with accepting this into JOSS.
@arfon Sure no problem. This is 1.1.0 version has been the one I have been reviewing, the pull request (once merged) just changes the labels (as far as I can see based on the file changes). I would suggest changing the version tag (to 1.1.0) in the paper as well though, since this will be the one uploaded to JOSS if I understand correctly.
I would suggest changing the version tag (to 1.1.0) in the paper as well though, since this will be the one uploaded to JOSS if I understand correctly.
Will do. So are we good to accept @rasbt?
Will do. So are we good to accept @rasbt?
Yes, it looks good to me :).
@cxhernandez - last thing:
At this point could you make an archive of the reviewed software in Zenodo/figshare/other service and update this thread with the DOI of the archive? I can then move forward with accepting the submission.
Thanks for the review @rasbt!
@cxhernandez your paper is now accepted into JOSS and your DOI is http://dx.doi.org/10.21105/joss.00034 馃殌 馃帀 馃挜
Woo! Thanks @arfon, @labarba, and @rasbt!
Congrats! :). Although the review was maybe a bit tedious, I hope it also helped improving the package regarding the maybe too many suggestions I brought up ;). In any case, I am really glad to hear that the software package was accepted and that we have another good & useful open source package out there!
The software (and docs) undeniably benefitted from the review process! I look forward to submitting more stuff 馃槃
Thanks again!
Most helpful comment
See this interesting thread, where I recruit a potential reviewer for this paper:
https://twitter.com/rasbt/status/756604243465895936