Joss-reviews: [REVIEW]: datacleanbot: an automated data cleaning tool

Created on 30 Jul 2019 · 44Comments · Source: openjournals/joss-reviews

Submitting author: @Ji-Zhang (Ji Zhang)
Repository: https://github.com/Ji-Zhang/datacleanbot
Version: v0.4
Editor: @arfon
Reviewers: @kellieotto, @jjmcnelis
Archive: Pending

Status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737"><img src="http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737/status.svg)](http://joss.theoj.org/papers/5c683cc54ee2bc4a626fe530615bb737)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@kellieotto and @jjmcnelis , please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @usethedata know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @kellieotto

Conflict of interest

[x] As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: Does the release version given match the GitHub release (v0.4)?
[x] Authorship: Has the submitting author (@Ji-Zhang) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

Review checklist for @jjmcnelis

Conflict of interest

[x] As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: Does the release version given match the GitHub release (v0.4)?
[x] Authorship: Has the submitting author (@Ji-Zhang) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[ ] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

pending-major-enhancements rejected review

Source

whedon

All 44 comments

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @kellieotto it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon on 30 Jul 2019

Attempting PDF compilation. Reticulating splines etc...

whedon on 30 Jul 2019

:point_right: Check article proof :page_facing_up: :point_left:

whedon on 30 Jul 2019

@kellieotto Just checking in (I've been off-line working a major proposal and getting my house on the market to sell). Any questions?

usethedata on 5 Aug 2019

Thanks for checking in. I plan to get to it this week, no questions!

On Sun, Aug 4, 2019 at 3:40 PM Bruce Wilson notifications@github.com
wrote:

@kellieotto https://github.com/kellieotto Just checking in (I've been
off-line working a major proposal and getting my house on the market to
sell). Any questions?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openjournals/joss-reviews/issues/1608?email_source=notifications&email_token=AB46QQV6EBRA6UTMAIWEKH3QC5LGPA5CNFSM4IHXWRN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QLC3Y#issuecomment-518041967,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB46QQWTLG5ITOXFYSBE2HLQC5LGPANCNFSM4IHXWRNQ
.

--
Kellie Ottoboni
Ph.D. Statistics '19, University of California, Berkeley
Fellow at Berkeley Institute for Data Science

Mobile: (650) 520-5056
Website: www.kellieottoboni.com

kellieotto on 5 Aug 2019

I've created issues in the repo for the problems I encountered. The biggest issue is that I'm not able to get the examples to run. Maybe I'm doing something silly on my end, but I think there's a dependency management issue here.

kellieotto on 8 Aug 2019

@whedon generate pdf

Ji-Zhang on 8 Aug 2019

Attempting PDF compilation. Reticulating splines etc...

whedon on 8 Aug 2019

:point_right: Check article proof :page_facing_up: :point_left:

whedon on 8 Aug 2019

@whedon add @jjmcnelis as reviewer

usethedata on 12 Aug 2019

OK, @jjmcnelis is now a reviewer

whedon on 12 Aug 2019

I apologize for the delay. I will complete my initial review by this Wednesday.

EDIT (2019-08-28) @Ji-Zhang I don't have experience working with several of datacleanbot's dependencies (I'll ask @usethedata to clarify if this disqualifies me as a reviewer). These hangups could be familiar for Python users in your domain. So I'll wait for your comment before opening issues on the following points:

Getting started in fresh Python 3 environments:
I don't have R installed on my Windows system. Pip exits with an error after it fails to find my R path. Is rpy2 essential to the core functionality of datacleanbot?
I successfully installed with pip in a virtual environment on ubuntu. The dependencies of the bayesian submodule aren't covered in datacleanbot's setup.py. Is this deliberate? I can't work through the examples on the readthedocs page without manually installing.
After applying the multiple imputation approach, the autoclean example (https://datacleanbot.readthedocs.io/en/latest/Example_autoclean.html) attempts to unpickle a remote file located inside your github repository:

metalearner = joblib.load(urlopen("https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/metalearner_rf.pkl?raw=true"))

I haven't dug into your code, but this seems like a questionable practice (predict_best_anomaly_algorithm: line 1033 in https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/dataclean.py). Is it common for machine learning software to access remote resources like this?

There's another piece to this that I'd like you to clarify. From my understanding of the pickle module, pickled object(s) must be unpickled in a Python environment with compatible versions of the dependent software. E.g. In the autoclean example, I couldn't advance past the routine above (predict_best_anomaly_algorithm) because my version of sklearn is incompatible:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1344, in autoclean
    Xy_cleaned = handle_outlier(features_new, Xy_filled)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1285, in handle_outlier
    best = predict_best_anomaly_algorithm(X, y)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/datacleanbot/dataclean.py", line 1049, in predict_best_anomaly_algorithm
    metalearner = joblib.load(urlopen("https://github.com/Ji-Zhang/datacleanbot/blob/master/datacleanbot/metalearner_rf.pkl?raw=true"))
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 588, in load
    obj = _unpickle(fobj)
  File "/home/jnd/tmp/datacleanbot/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 526, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
  File "/usr/lib/python3.6/pickle.py", line 1338, in load_global
    klass = self.find_class(module, name)
  File "/usr/lib/python3.6/pickle.py", line 1392, in find_class
    return getattr(sys.modules[module], name)
AttributeError: module 'sklearn.utils.deprecation' has no attribute 'DeprecationDict'

Please correct me if I've misinterpreted the traceback.

setup.py indicates that datacleanbot is intended for use with Python 3. Have you tested or otherwise confirmed incompatibility with Python 2? I don't see any mention this on the readthedocs page.

Overall, I think this will be a good fit for JOSS. Looking forward to your comment.

jjmcnelis on 26 Aug 2019

Hi @Ji-Zhang how are things going on your end for your JOSS submission?

kthyng on 18 Sep 2019

Hi, I am sorry I missed @jjmcnelis 's comments. I am kind of snowed under recently. I will work on that next week and also make the unit test available. Sorry for the delay.

Ji-Zhang on 18 Sep 2019

👍1

Hi @jjmcnelis I apologize for the delay to answer your questions.

The following are my answers:

Getting started in fresh Python 3 environments:
rpy2 is related to the data type discovery functionality, which is based on the Bayesian model. Without rpy2, the Bayesian model can not run successfully.
Only openml is recommended to be installed manually. The others are not deliberate. Could you please let me know which submodels you need to install manually? Thanks!
The remote file metalearner is a pre-trained machine learning model. The prediction of the optimal algorithm involves this model. However, the pickle file metalearner cannot be packaged together through pip. So I took this way around.
The sklearn incompatible problem is related to GridSearchCV which I used to train the metalearner model. GridSearch had a DeprecationDict in it, but that doesn't exist in sklearn any more. I tried to retrain the model with the newest sklearn version but it doesn't seem work.
Some dependencies of datacleanbot only support Python3 so I assume Python2 won't work. I will add this note to the document.

Please feel free to let me know if you have more questions. Again, sorry for the delay.

Ji-Zhang on 25 Sep 2019

@jjmcnelis How has @Ji-Zhang responded to your concerns?

kthyng on 22 Oct 2019

I've pinged @jjmcnelis out of band.

usethedata on 3 Nov 2019

@kthyng @Ji-Zhang I'm sorry for holding you up. Thanks for the thoughtful responses to my questions. I'm satisfied with points 1 and 2. Point 3 needs addressing.

I raised the point about rpy2 in the interest of usability. I don't use R often enough to justify having it installed on my workstation. I'd look into implementing the functionality provided by rpy2 in Python in a future release so that your software will be usable by the widest audience. But this isn't a JOSS requirement as far as I know.
Since my initial review I've come across a few other examples of machine learning resources retrieved remotely when a model runs, so I guess it's reasonably common.
Dependency issues are many, on my end. I haven't been able to run the examples in three separate attempts, on three separate machines. And I get a number of different tracebacks that bear no resemblance to each other. Did you hear from @kellieotto about their dependency issues (#issuecomment-519314651)?

jjmcnelis on 3 Nov 2019

@Ji-Zhang what are your thoughts about the question in point 3 above?

usethedata on 24 Nov 2019

@usethedata I am working on simplifying the package such that the dependencies can become not this heavy. The light version should have been released before the weekend.

Ji-Zhang on 27 Nov 2019

@jjmcnelis @kellieotto I have deprecated the feature of detecting feature data type using the Bayesian method as too many dependencies there. I created a new clean virtual environment to test it and it works well. I hope it can work on your end successfully as well . Thank you.

Ji-Zhang on 29 Nov 2019

Hi - As the JOSS Associate Editor-in-Chief on duty this week, I'm going through some submissions with little recent action, and I can't quite tell what is going on in this submission. @usethedata - can you help me understand what is the next action?

danielskatz on 7 Jan 2020

@whedon assign @arfon as editor

arfon on 1 Feb 2020

:wave: all, I'm going to step in here and try and wrap up this submission on behalf of @usethedata.

arfon on 1 Feb 2020

:wave: @kellieotto, @jjmcnelis, in their most recent comment it sounds like @Ji-Zhang has simplified some of the requirements here.

Do you think you could both take another look at this software again and confirm you're able to install it now? If at all possible, I'd love to get this review wrapped up in the next couple of weeks.

arfon on 1 Feb 2020

@arfon Sorry for dropping the ball on this! I just went ahead and tried installing again, and ran into another dependency issue (see linked issue)

kellieotto on 6 Feb 2020

@arfon Sorry for dropping the ball on this! I just went ahead and tried installing again, and ran into another dependency issue (see linked issue)

Thanks @kellieotto. @Ji-Zhang - can you see if you can figure out what's going on in https://github.com/Ji-Zhang/datacleanbot/issues/10 ?

arfon on 6 Feb 2020

Hi @arfon and @kellieotto , thank you for spotting the bug. This bug is now fixed. Please update to the latest version of datacleanbot and feel free to let me know if any questions. Thank you in advance.

Ji-Zhang on 16 Feb 2020

:wave: @kellieotto - please come and take another look at this submission when you get a chance.

arfon on 3 Mar 2020

@arfon I followed up with @Ji-Zhang in Ji-Zhang/datacleanbot#10. Still having import issues.

kellieotto on 3 Mar 2020

👍1

@arfon I followed up with @Ji-Zhang in Ji-Zhang/datacleanbot#10. Still having import issues.

OK, thanks for checking back in on this.

arfon on 3 Mar 2020

Dear authors and reviewers

We wanted to notify you that in light of the current COVID-19 pandemic, JOSS has decided to suspend submission of new manuscripts and to handle existing manuscripts (such as this one) on a "best efforts basis". We understand that you may need to attend to more pressing issues than completing a review or updating a repository in response to a review. If this is the case, a quick note indicating that you need to put a "pause" on your involvement with a review would be appreciated but is not required.

Thanks in advance for your understanding.

_Arfon Smith, Editor in Chief, on behalf of the JOSS editorial team._

arfon on 14 Mar 2020

:wave: @kellieotto, @jjmcnelis - how are you getting along with your reviews? I realize these are challenging times for many, so please don't take this enquiry as anything other than a friendly enquiry to see how things are going here.

@kellieotto - it looks like @Ji-Zhang responded to your dependency issues in https://github.com/Ji-Zhang/datacleanbot/issues/10 .

arfon on 10 May 2020

@arfon thanks for the reminder. I followed up with another dependency error in that issue.

kellieotto on 10 May 2020

👍1

👋 @Ji-Zhang - I think the next action is on you here to help @kellieotto with their dependency issues.

arfon on 5 Jun 2020

@Ji-Zhang - I think the next action is on you here to help @kellieotto with their dependency issues.

Thank you for your kind reminder. Working on it.

Ji-Zhang on 7 Jun 2020

👍1

@arfon I just followed up in the issue with @Ji-Zhang -- still having the same error as before.

kellieotto on 19 Jun 2020

@kellieotto Sorry for this inconvenience. Could you try it again, please?

Ji-Zhang on 28 Jun 2020

👋 hi everybody!

Looks like this review was progressing, a dependency issue was flagged by @kellieotto, and @Ji-Zhang made some changes, and is asking for the reviewer to try again. Could we have an update from those involved? If you need time, for whatever reason, just let use know—of course we understand that times are tough.

labarba on 23 Aug 2020

@labarba @Ji-Zhang I think I should step down as reviewer here. I realize my turnaround time has been very slow. I'm not sure whether it's the package or my installation of Python + associated packages, but I just can't get it to install and build correctly. I'm very sorry for the time wasted.

kellieotto on 30 Aug 2020

❤1

Hi @kellieotto , I think it is because some dependent packages are outdated. I fixed it by updating the setup file. Run pip install datacleanbot in a newly created environment, the example should work. Thanks for all the time you have devoted so far.

Ji-Zhang on 30 Aug 2020

Hi @Ji-Zhang. This review has been going on for a very long time (more than 1 year) and both reviewers have struggled with this software. As a result, I've just taken a deeper look at this submission and I have to say that I think this software is still well below the standard that is allowable in JOSS. For example:

Multiple blocks of commented out code: here • here • here • here
No tests.
Multiple unresolved issues

Given the age of this submission, and the state of this software currently I'm going to reject this submission. If you are able to substantially improve the quality of this software then we would welcome a resubmission to JOSS at some point in the future.

@kellieotto, @jjmcnelis - thanks for your time here. I'm sorry your work here didn't result in a paper for the author.

arfon on 31 Aug 2020

@whedon reject

arfon on 31 Aug 2020

Paper rejected.

whedon on 31 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[REVIEW]: Virtual Bumblebees

whedon · 12Comments

[REVIEW]: gwdegree: A Shiny App to Aid Interpretation of Geometrically-Weighted Degree Estimates in Exponential Random Graph Models

whedon · 8Comments

[PRE REVIEW]: Mordecai: Full Text Geoparsing and Event Geocoding

whedon · 13Comments

[PRE REVIEW]: MSMExplorer: Data Visualizations for Biomolecular Dynamics

whedon · 12Comments

[REVIEW]: GIMS: Graphical Interface for Materials Simulations

whedon · 9Comments