Joss-reviews: [REVIEW]: Pyradigm: PYthon based data structure to improve Dataset's InteGrity in Machine learning workflows

Created on 28 Aug 2017 · 35Comments · Source: openjournals/joss-reviews

Submitting author: @raamana (Pradeep Reddy Raamana)
Repository: https://github.com/raamana/pyradigm
Version: v0.3.0.3
Editor: @arokem
Reviewer: @ahwillia
Archive: 10.5281/zenodo.888108

Status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/c5c231486d699bca982ca7ebd9cf32d2"><img src="http://joss.theoj.org/papers/c5c231486d699bca982ca7ebd9cf32d2/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/c5c231486d699bca982ca7ebd9cf32d2/status.svg)](http://joss.theoj.org/papers/c5c231486d699bca982ca7ebd9cf32d2)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer questions

@ahwillia, please carry out your review in this issue by updating the checklist below (please make sure you're logged in to GitHub). The reviewer guidelines are available here: http://joss.theoj.org/about#reviewer_guidelines. Any questions/concerns please let @arokem know.

Conflict of interest

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: Does the release version given match the GitHub release (v0.3.0.3)?
[x] Authorship: Has the submitting author (@raamana) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: Have any performance claims of the software been confirmed?

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

accepted published recommend-accept review

Source

whedon

All 35 comments

Hello human, I'm @whedon. I'm here to help you with some common editorial tasks for JOSS. @ahwillia it looks like you're currently assigned as the reviewer for this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As as reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all JOSS reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

whedon on 28 Aug 2017

@arokem - is this the appropriate place for me to ask any questions to you?

At the moment, I'm wondering if I should limit the review narrowly to the checkboxed above or if more long form suggestions are welcome. For example, I think it would be nice if this package was compatible with Python 3+, and it shouldn't take that much effort to make that happen. Is that a reasonable request from the standpoint of JOSS? I feel a bit odd opening issues for these sorts of suggestions on the target repo.

Thanks for walking me through this.

ahwillia on 28 Aug 2017

Hi Alex,

Thank you for your time to review this and the positive support for the ideas and motivation for this project. Much appreciate it. I'm looking forward see your ideas and suggestions for how to improve its value and interoperability with other tools and packages.

Support for python 3+ is certainly on my TODO list, perhaps I should note it on the repo itself in its roadmap. It's just limited by my time or contributions from others interested.

In fact, I already have given it a try few weeks ago, either it's not so simple or I need to up my python module packaging and version compatibility game. If you have clear suggestions, I'll happily follow the suggestions, provided they wouldn't need take way too long.

It's a great suggestion for improvement, but I'm not sure if that's a requirement for acceptance. Perhaps @arokem can clarify.

raamana on 28 Aug 2017

Yep - I'm not a fan of requirements and reviewer gate keeping, but was trying to figure out the scope of things acceptable to comment on and suggest. I guess my question is whether a repo is "accepted" once I click all the boxes, or if I am expected to also make high-level suggestions and criticisms that the editor decides are either necessary or not. (The latter would be a more conventional review process.)

Shame to hear that Python3 wasn't straightforward to implement. But that's a minor comment anyways.

I'll try to block off an afternoon later this week to do the review in full. I'm just trying to feel out exactly what has to be done so I can finish it in one go.

ahwillia on 28 Aug 2017

To my understanding, @arokem can correct me if I am wrong, if you tick all the questions off (implying the questions therein are answered), I believe it will be accepted. However, the key section for your concerns in Functionality - where "Have the functional claims of the software been confirmed?" touches on the Python 3 compatibility. If pyradigm were to claim compatibility for python 3+ (which it doesn't) and fail to satisfy that, then it needs to be addressed.

raamana on 28 Aug 2017

I just updated the README to clarify this.

raamana on 28 Aug 2017

Thanks for submitting this @raamana! Most of this looks good, and +1 for positive and negative tests.

I do have a couple comments after a quick review this morning:

Documentation: I'd like to see better documentation, and have filed https://github.com/raamana/pyradigm/issues/2
Statement of need: could you include a paragraph about how this extends pandas? Would including pandas simplify the implementation?
Community guidelines: CONTRIBUTING.md describes what to contribute but does not describe how and misses the points mentioned in https://github.com/openjournals/joss-reviews/issues/382#issue-253207360. Could you clarify this?

I'll try to put in more time for review later this week.

stsievert on 28 Aug 2017

Thanks Scott. Full docs is definitely at the top of my plan. I will wait for comments from all the reviewers, esp. regarding the API and interoperability etc, to finalize the API, and document it thoroughly.

Let me answer the next two points shortly.

raamana on 28 Aug 2017

So I have updated the README and contribution guidelines as @stsievert suggested, take a look. Perhaps you can tick off the issues for the review as you see them done/to-be-done.

I will wait for @ahwillia's comments (and Scott's if any) before starting to document the API - this is one area all of your comments will be helpful.

raamana on 28 Aug 2017

Hi everyone!

@ahwillia : this is the place to ask me questions, though as you can see, it might take me a little while to answer...

At any rate, thanks for starting the discussion here. Let me see if I can answer the questions that have come up so far: extending support to Python 3 is not a requirement for publication, but I do agree that it would improve the software, so worth consider doing now that you are responding to a lot of external reviews.

On the other hand, documentation is a requirement, so you would have to address https://github.com/raamana/pyradigm/issues/2.

Thanks again, and please keep me in the loop as you go through the requirements. Either @ahwillia and @stsievert can tick the boxes above, but I will ask that you both also explicitly approve the submission at the end of the process.

arokem on 4 Sep 2017

@raamana - I'm okay to sign off on this once you address @stsievert's comments. Fixing up the documentation and giving a clear statement of purpose (with comparison to pandas and cross-validation tools in scikit-learn) are the top priority issues.

You can treat everything below as a suggestion.

The example notebook is good but a bit long. It would be really helpful to new users if you could compartmentalize the functionality. In particular, the table of contents could have more direct headings, such as:
- Constructing a dataset
- Cross-validation
- Saving/reloading a dataset
- Exporting to numpy
- etc.
Is "dataset" the right name? What about dataframe or datatable? This is what pandas / pytables call them...
Saving datasets in HDF5 or Feather format would be really helpful. I am always looking for a better way to save and reload datasets, because I'm too impatient to learn h5py. Pickling has well-documented issues, so packages that make these other formats easily accessible are a big plus.
It seems like the user needs export their data using the dataset.data_and_labels() command in order to use scikit-learn? It seems like that slightly defeats the purpose of the package. Since scikit-learn has a very strict API, would it make sense to fit models internally within the dataset object? Something like:

python from sklearn import svm dataset.fit_model(svm.SVC(gamma=0.001, C=100.)) Y = dataset.predict()

I could see this getting tricky and beyond the scope of the current submission (e.g. what happens if you fit multiple models? Do you just delete the last one). Anyways, just a thought I had.

ahwillia on 4 Sep 2017

Thanks Alex - very useful comments.

Except for detailed API documentation, I think I've already addressed @stsievert's comments. I think he wants to take another pass to review this.

Your suggestions are very good, esp. making example TOC headings more direct and simple to read :). I changed the headings for now, will revamp once I get more time, with more later examples (showing how it is used in neuropredict for example).
dataset is the right name in my own mind, but I agree with you it can other meanings in other domains. Calling it a DataFrame (as in pandas) collapse its meaning to 2D object (which it currently is), but I'd like to try convey it can store arbitrary data with trivial changes
- Good suggestions for serializations - will consider the suggestions, but from research 6 months ago, pickle serves the purpose (for all its flaws) for now, if users don't do stupid things. json is certainly is a good alternative , but its disk size is higher (don't know how much) than pickle's binary format. This can be a problem for some people (like me), who uses a large number of datasets for each project. Improving this is certainly on the roadmap.
- Yes allowing model fitting will open another can of worms :).. neuropredict serves that purpose.

raamana on 4 Sep 2017

One thing I'd like @stsievert and @ahwillia to really give it a serious thought is on API. How can we improve it to ease your ML workflow on a daily/weekly basis?

I will get back to you soon with improved API docs

raamana on 4 Sep 2017

How can we improve API to ease your ML workflow on a daily/weekly basis?

I don't think I'm the right person to ask, as I'm not used to dealing with datasets with very cumbersome metadata. Honestly, I don't even use pandas because working directly with numpy arrays serves my purposes. I also end up using a lot of nested dicts, which is maybe suboptimal, but this package doesn't appear to address that issue.

ahwillia on 4 Sep 2017

👍1

No problem, let me know as you get ideas, or when you choose to adopt this in your workflow.

I guess API documentation is the only thing holding the review back now? Can one of you tick off the boxes at the top for me to focus on what needs to be done? Thanks,

raamana on 4 Sep 2017

How can we improve API to ease your ML workflow on a daily/weekly basis?

Possible API improvements:

you have implemented __contains__. If i in dataset, I would expect dataset[i] return the value.
- could be implemented with __get_subset_from_dict
- could simplify get_subset
glance: this is similar to Pandas df.head. Rename?
summarize_classes is similar to Pandas df.describe. Rename?
Implementation of __del__ could be used to delete a sample
- could be implemented with del_sample
In your implementation of num_features, you raise a warning if you set. Make it a function instead? You can not set a function value.
You have some functions that could be marshaled to a utils file. The functions below use none of the class features, and I don't know if they're appropriate for the class MLDataset.
- __take
- __str_names
- keys_with_value

These are all first-glance comments. Let me know how any of these comments can be improved.

stsievert on 4 Sep 2017

👍1

I've gone through the checklist, and here are my findings (I can't edit the top comment)

General checks

[x] repository
[x] license (MIT)
[x] version. The version on PyPi matches but there ~~are no~~ is a GitHub releases.
[x] authorship

Functionality

[x] installation. Issue https://github.com/raamana/pyradigm/issues/6 ~~filed~~ closed and https://github.com/raamana/pyradigm/pull/8 ~~proposed~~ merged.
[x] functionality. I need to test the last cell in [PyradigmExample.ipynb] about sklearn support.

Documentation
[x] statement of need. There is one on the readme
- could this library somehow wrap Pandas?
[x] installation instructions with PyPI.
[x] Example usage with [PyradigmExample.ipynb]
[x] functionality documentation. Needs docs (and now https://github.com/raamana/pyradigm/issues/2 closed)
[x] automated tests.
[x] Community guidelines

Software paper
[x] authors
[x] statement of need. There is one present.
[x] references. I don't believe you should self-reference your paper; other papers don't. Maybe @arokem can chime in?

stsievert on 5 Sep 2017

Glad to update you guys that pyradigm now works on both 2.7 and 3.6, along with 3.5: https://travis-ci.org/raamana/pyradigm

raamana on 5 Sep 2017

🎉1

[x] Full docs for API
[x] same version everywhere
[x] installations tests on CI
[x] functionality tests on CI

Thanks again. Let me know if you guys need anything else to close the review.

raamana on 5 Sep 2017

Scott, thanks for the suggestions to improve the API. I will keep them in mind and improve the API in due course. Note the API must cater to non-expert programmers/users from neuro- or related fields (who know of datasets/samples etc, but never heard of pandas). I'm not targeting those who already know of or use pandas :)

raamana on 5 Sep 2017

@stsievert : IIUC - the reference in the paper is to a different, more neuro-specific project. That's fine.

@raamana : Regarding documentation of the software -- I don't think that the current documentation is sufficient. I would recommend adding a sphinx documentation website, including full reference documentation of the package and its API. If you want to automate documentation building on your CI, this is a good option: https://drdoctr.github.io/doctr/

arokem on 6 Sep 2017

Thanks for keeping the API comments in mind!

the reference in the paper is [...] different

Whoops -- that's what you get for skimming the URL.

stsievert on 6 Sep 2017

I've finally tamed Sphinx to help me produce the docs : http://pyradigm.readthedocs.io/en/latest/

Thanks @arokem for the doctr tip, and I will try to use that too.

Is there anything else I need to do for the acceptance?

raamana on 7 Sep 2017

👍1

An older theme showed a nice table of attributes and methods at the top of the API reference page, using autosummary directive, which doesnt seem to work with the current theme or docs hosting service readthedocs. If you can help me get that to work, that'll be appreciated (my hours googling and hacking didn't work so far :) ).

raamana on 7 Sep 2017

Where does https://github.com/raamana/pyradigm/pull/8 stand?

Is this required for acceptance, @stsievert? Does your 👍 on previous comment mean that it's ready to go from your point of view?

@raamana : for the table, you mean something like this?

http://yeatmanlab.github.io/pyAFQ/reference/AFQ.html

Sorry - I am not exactly sure how that comes about. But feel free to copy the sphinx configuration from that project!

arokem on 8 Sep 2017

I just merged the PR https://github.com/raamana/pyradigm/pull/8

@arokem, yes. I gotten it for the older theme, but the shiny new (mobile-friendly) RTD theme doesn't seem to interact well with the autosummary directive. Its minor anyways - I might make a manual table later on, it would be ideal to have it automated. I will check their conf.

raamana on 8 Sep 2017

Is this required for acceptance, @stsievert? Does your 👍 on previous comment mean that it's ready to go from your point of view?

:+1: for review here -- I've checked all the boxes and am satisfied! In https://github.com/openjournals/joss-reviews/issues/382#issuecomment-327631871 I was giving a :+1: for documentation, not for review, but I am satisfied now.

stsievert on 8 Sep 2017

Awesome. Thanks Alex, Scott and Ariel 👍, much appreciate your time and effort.

Let's hope pyradigm will find some users and the users will find it useful :)

raamana on 8 Sep 2017

🎉1

Congratulations!

@raamana : Your next step is to make an archive from the current version (e.g., using Zenodo) and to post the DOI of the archive here.

arokem on 9 Sep 2017

Thank you, here is the DOI: 10.5281/zenodo.888108

raamana on 9 Sep 2017

@whedon set 10.5281/zenodo.888108 as archive

arokem on 11 Sep 2017

OK. 10.5281/zenodo.888108 is the archive.

whedon on 11 Sep 2017

@arfon: I believe this paper is publishable in its current form. Please let me know if we need to do anything else here.

arokem on 11 Sep 2017

@ahwillia - many thanks for your review here and to @arokem for editing this submission ✨

@raamana - your submission is now accepted into JOSS and your DOI is http://dx.doi.org/10.21105/joss.00382 ⚡️ 🚀 💥

arfon on 11 Sep 2017

Thanks Arfon and Ariel :)

Thanks again to Scott @stsievert and Alex, who contributed to the review significantly.

raamana on 11 Sep 2017

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[REVIEW]: PyGDH ("pigged"): Python Grid Discretization Helper

whedon · 11Comments

[REVIEW]: iheatmapr: Interactive complex heatmaps in R

whedon · 11Comments

[REVIEW]: Pyret: A Python package for analysis of neurophysiology data

whedon · 12Comments

[PRE REVIEW]: Kindel: indel-aware consensus for nucleotide sequence alignments

whedon · 12Comments

[REVIEW]: cartography: Create and Integrate Maps in your R Workflow

whedon · 12Comments