Joss-reviews: [REVIEW]: datacleanbot: an automated data cleaning tool

Created on 30 Jul 2019  Â·  44Comments  Â·  Source: openjournals/joss-reviews

Submitting author: @Ji-Zhang (Ji Zhang)
Repository: https://github.com/Ji-Zhang/datacleanbot
Version: v0.4
Editor: @arfon
Reviewers: @kellieotto, @jjmcnelis
Archive: Pending

Status

status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737"><img src="http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737/status.svg)](http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@kellieotto and @jjmcnelis , please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @usethedata know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @kellieotto

Conflict of interest

Code of Conduct

General checks

  • [x] Repository: Is the source code for this software available at the repository url?
  • [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • [x] Version: Does the release version given match the GitHub release (v0.4)?
  • [x] Authorship: Has the submitting author (@Ji-Zhang) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

  • [x] Installation: Does installation proceed as outlined in the documentation?
  • [ ] Functionality: Have the functional claims of the software been confirmed?
  • [ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • [x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • [ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • [ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • [ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • [ ] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
  • [x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • [x] Authors: Does the paper.md file include a list of authors with their affiliations?
  • [x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • [x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

Review checklist for @jjmcnelis

Conflict of interest

Code of Conduct

General checks

  • [x] Repository: Is the source code for this software available at the repository url?
  • [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • [x] Version: Does the release version given match the GitHub release (v0.4)?
  • [x] Authorship: Has the submitting author (@Ji-Zhang) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

  • [ ] Installation: Does installation proceed as outlined in the documentation?
  • [ ] Functionality: Have the functional claims of the software been confirmed?
  • [ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • [x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • [ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • [ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • [ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • [ ] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
  • [x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • [x] Authors: Does the paper.md file include a list of authors with their affiliations?
  • [x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • [x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
pending-major-enhancements rejected review

All 44 comments

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @kellieotto it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
Attempting PDF compilation. Reticulating splines etc...

@kellieotto Just checking in (I've been off-line working a major proposal and getting my house on the market to sell). Any questions?

Thanks for checking in. I plan to get to it this week, no questions!

On Sun, Aug 4, 2019 at 3:40 PM Bruce Wilson notifications@github.com
wrote:

@kellieotto https://github.com/kellieotto Just checking in (I've been
off-line working a major proposal and getting my house on the market to
sell). Any questions?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openjournals/joss-reviews/issues/1608?email_source=notifications&email_token=AB46QQV6EBRA6UTMAIWEKH3QC5LGPA5CNFSM4IHXWRN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QLC3Y#issuecomment-518041967,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB46QQWTLG5ITOXFYSBE2HLQC5LGPANCNFSM4IHXWRNQ
.

--
Kellie Ottoboni
Ph.D. Statistics '19, University of California, Berkeley
Fellow at Berkeley Institute for Data Science

Mobile: (650) 520-5056
Website: www.kellieottoboni.com

I've created issues in the repo for the problems I encountered. The biggest issue is that I'm not able to get the examples to run. Maybe I'm doing something silly on my end, but I think there's a dependency management issue here.

@whedon generate pdf

Attempting PDF compilation. Reticulating splines etc...

@whedon add @jjmcnelis as reviewer

OK, @jjmcnelis is now a reviewer

I apologize for the delay. I will complete my initial review by this Wednesday.

EDIT (2019-08-28) @Ji-Zhang I don't have experience working with several of datacleanbot's dependencies (I'll ask @usethedata to clarify if this disqualifies me as a reviewer). These hangups could be familiar for Python users in your domain. So I'll wait for your comment before opening issues on the following points:

  1. Getting started in fresh Python 3 environments:
  2. I don't have R installed on my Windows system. Pip exits with an error after it fails to find my R path. Is rpy2 essential to the core functionality of datacleanbot?
  3. I successfully installed with pip in a virtual environment on ubuntu. The dependencies of the bayesian submodule aren't covered in datacleanbot's setup.py. Is this deliberate? I can't work through the examples on the readthedocs page without manually installing.

  4. After applying the multiple imputation approach, the autoclean example (https://datacleanbot.readthedocs.io/en/latest/Example_autoclean.html) attempts to unpickle a remote file located inside your github repository:

metalearner = joblib.load(urlopen("https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/metalearner_rf.pkl?raw=true"))

I haven't dug into your code, but this seems like a questionable practice (predict_best_anomaly_algorithm: line 1033 in https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/dataclean.py). Is it common for machine learning software to access remote resources like this?

There's another piece to this that I'd like you to clarify. From my understanding of the pickle module, pickled object(s) must be unpickled in a Python environment with compatible versions of the dependent software. E.g. In the autoclean example, I couldn't advance past the routine above (predict_best_anomaly_algorithm) because my version of sklearn is incompatible:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1344, in autoclean
    Xy_cleaned = handle_outlier(features_new, Xy_filled)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1285, in handle_outlier
    best = predict_best_anomaly_algorithm(X, y)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1049, in predict_best_anomaly_algorithm
    metalearner = joblib.load(urlopen("https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/metalearner_rf.pkl?raw=true"))
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 588, in load
    obj = _unpickle(fobj)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 526, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.6/pickle.py", line 1338, in load_global
    klass = self.find_class(module, name)
  File "/usr/lib/python3.6/pickle.py", line 1392, in find_class
    return getattr(sys.modules[module], name)
AttributeError: module 'sklearn.utils.deprecation' has no attribute 'DeprecationDict'

Please correct me if I've misinterpreted the traceback.

  1. setup.py indicates that datacleanbot is intended for use with Python 3. Have you tested or otherwise confirmed incompatibility with Python 2? I don't see any mention this on the readthedocs page.

Overall, I think this will be a good fit for JOSS. Looking forward to your comment.

Hi @Ji-Zhang how are things going on your end for your JOSS submission?

Hi, I am sorry I missed @jjmcnelis 's comments. I am kind of snowed under recently. I will work on that next week and also make the unit test available. Sorry for the delay.

Hi @jjmcnelis I apologize for the delay to answer your questions.

The following are my answers:

  1. Getting started in fresh Python 3 environments:
  2. rpy2 is related to the data type discovery functionality, which is based on the Bayesian model. Without rpy2, the Bayesian model can not run successfully.
  3. Only openml is recommended to be installed manually. The others are not deliberate. Could you please let me know which submodels you need to install manually? Thanks!

  4. The remote file metalearner is a pre-trained machine learning model. The prediction of the optimal algorithm involves this model. However, the pickle file metalearner cannot be packaged together through pip. So I took this way around.
    The sklearn incompatible problem is related to GridSearchCV which I used to train the metalearner model. GridSearch had a DeprecationDict in it, but that doesn't exist in sklearn any more. I tried to retrain the model with the newest sklearn version but it doesn't seem work.

  5. Some dependencies of datacleanbot only support Python3 so I assume Python2 won't work. I will add this note to the document.

Please feel free to let me know if you have more questions. Again, sorry for the delay.

@jjmcnelis How has @Ji-Zhang responded to your concerns?

I've pinged @jjmcnelis out of band.

@kthyng @Ji-Zhang I'm sorry for holding you up. Thanks for the thoughtful responses to my questions. I'm satisfied with points 1 and 2. Point 3 needs addressing.

  1. I raised the point about rpy2 in the interest of usability. I don't use R often enough to justify having it installed on my workstation. I'd look into implementing the functionality provided by rpy2 in Python in a future release so that your software will be usable by the widest audience. But this isn't a JOSS requirement as far as I know.

  2. Since my initial review I've come across a few other examples of machine learning resources retrieved remotely when a model runs, so I guess it's reasonably common.

  3. Dependency issues are many, on my end. I haven't been able to run the examples in three separate attempts, on three separate machines. And I get a number of different tracebacks that bear no resemblance to each other. Did you hear from @kellieotto about their dependency issues (#issuecomment-519314651)?

@Ji-Zhang what are your thoughts about the question in point 3 above?

@usethedata I am working on simplifying the package such that the dependencies can become not this heavy. The light version should have been released before the weekend.

@jjmcnelis @kellieotto I have deprecated the feature of detecting feature data type using the Bayesian method as too many dependencies there. I created a new clean virtual environment to test it and it works well. I hope it can work on your end successfully as well . Thank you.

Hi - As the JOSS Associate Editor-in-Chief on duty this week, I'm going through some submissions with little recent action, and I can't quite tell what is going on in this submission. @usethedata - can you help me understand what is the next action?

@whedon assign @arfon as editor

:wave: all, I'm going to step in here and try and wrap up this submission on behalf of @usethedata.

:wave: @kellieotto, @jjmcnelis, in their most recent comment it sounds like @Ji-Zhang has simplified some of the requirements here.

Do you think you could both take another look at this software again and confirm you're able to install it now? If at all possible, I'd love to get this review wrapped up in the next couple of weeks.

@arfon Sorry for dropping the ball on this! I just went ahead and tried installing again, and ran into another dependency issue (see linked issue)

@arfon Sorry for dropping the ball on this! I just went ahead and tried installing again, and ran into another dependency issue (see linked issue)

Thanks @kellieotto. @Ji-Zhang - can you see if you can figure out what's going on in https://github.com/Ji-Zhang/datacleanbot/issues/10 ?

Hi @arfon and @kellieotto , thank you for spotting the bug. This bug is now fixed. Please update to the latest version of datacleanbot and feel free to let me know if any questions. Thank you in advance.

:wave: @kellieotto - please come and take another look at this submission when you get a chance.

@arfon I followed up with @Ji-Zhang in Ji-Zhang/datacleanbot#10. Still having import issues.

@arfon I followed up with @Ji-Zhang in Ji-Zhang/datacleanbot#10. Still having import issues.

OK, thanks for checking back in on this.

Dear authors and reviewers

We wanted to notify you that in light of the current COVID-19 pandemic, JOSS has decided to suspend submission of new manuscripts and to handle existing manuscripts (such as this one) on a "best efforts basis". We understand that you may need to attend to more pressing issues than completing a review or updating a repository in response to a review. If this is the case, a quick note indicating that you need to put a "pause" on your involvement with a review would be appreciated but is not required.

Thanks in advance for your understanding.

_Arfon Smith, Editor in Chief, on behalf of the JOSS editorial team._

:wave: @kellieotto, @jjmcnelis - how are you getting along with your reviews? I realize these are challenging times for many, so please don't take this enquiry as anything other than a friendly enquiry to see how things are going here.

@kellieotto - it looks like @Ji-Zhang responded to your dependency issues in https://github.com/Ji-Zhang/datacleanbot/issues/10 .

@arfon thanks for the reminder. I followed up with another dependency error in that issue.

👋 @Ji-Zhang - I think the next action is on you here to help @kellieotto with their dependency issues.

@Ji-Zhang - I think the next action is on you here to help @kellieotto with their dependency issues.

Thank you for your kind reminder. Working on it.

@arfon I just followed up in the issue with @Ji-Zhang -- still having the same error as before.

@kellieotto Sorry for this inconvenience. Could you try it again, please?

👋 hi everybody!

Looks like this review was progressing, a dependency issue was flagged by @kellieotto, and @Ji-Zhang made some changes, and is asking for the reviewer to try again. Could we have an update from those involved? If you need time, for whatever reason, just let use know—of course we understand that times are tough.

@labarba @Ji-Zhang I think I should step down as reviewer here. I realize my turnaround time has been very slow. I'm not sure whether it's the package or my installation of Python + associated packages, but I just can't get it to install and build correctly. I'm very sorry for the time wasted.

Hi @kellieotto , I think it is because some dependent packages are outdated. I fixed it by updating the setup file. Run pip install datacleanbot in a newly created environment, the example should work. Thanks for all the time you have devoted so far.

Hi @Ji-Zhang. This review has been going on for a very long time (more than 1 year) and both reviewers have struggled with this software. As a result, I've just taken a deeper look at this submission and I have to say that I think this software is still well below the standard that is allowable in JOSS. For example:

Given the age of this submission, and the state of this software currently I'm going to reject this submission. If you are able to substantially improve the quality of this software then we would welcome a resubmission to JOSS at some point in the future.

@kellieotto, @jjmcnelis - thanks for your time here. I'm sorry your work here didn't result in a paper for the author.

@whedon reject

Paper rejected.

Was this page helpful?
0 / 5 - 0 ratings