Joss-reviews: [REVIEW]: REGENS: an open source Python package for simulating realistic autosomal genotypes

Created on 13 Oct 2020 · 43Comments · Source: openjournals/joss-reviews

Submitting author: @greggj2016 (John Gregg)
Repository: https://github.com/EpistasisLab/regens
Version: v0.1.0
Editor: @fboehm
Reviewer: @raivivek, @dwinter
Archive: Pending

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/5d8ab9adbeb2cf591fcfab172ccb3dc6"><img src="https://joss.theoj.org/papers/5d8ab9adbeb2cf591fcfab172ccb3dc6/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/5d8ab9adbeb2cf591fcfab172ccb3dc6/status.svg)](https://joss.theoj.org/papers/5d8ab9adbeb2cf591fcfab172ccb3dc6)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@raivivek & @dwinter, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fboehm know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @raivivek

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@greggj2016) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @dwinter

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@greggj2016) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Python Shell TeX review

Source

whedon

Most helpful comment

Thanks, @dwinter, @raivivek and @greggj2016 ! The discussions around the reviews seem to be highly constructive and helpful, I hope, to @greggj2016 and continued development of REGENS. The next steps, I think, are for @greggj2016 to make changes to the software to address the issues raised in the reviews. Ultimately, we need both reviewers to be satisfied with the submission to the point where they're comfortable checking all of the boxes in the above checklists.

Unlike many journals, our review process at JOSS is highly iterative. It's best, I think, when there's a great deal of dialogue between reviewers and authors, like you all have done above.

@greggj2016 - do you feel that you have sufficient feedback via reviews and Github issues to begin refining the software? When you feel that you've resolved the issues, please comment again on this thread so that @raivivek and @dwinter may examine the fixes. They'll then provide additional feedback, stating that they're satisfied or unsatisfied with the fixes. If they're unsatisfied, they'll also explain what remains to be done. Does that make sense to everyone? Thanks again for your work on this!!

fboehm on 3 Nov 2020

👍4

All 43 comments

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @raivivek, @dwinter it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon on 13 Oct 2020

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1186/s12859-017-2004-2 is OK
- 10.1534/genetics.109.108431 is OK
- 10.1038/ejhg.2017.50 is OK
- 10.1371/journal.pgen.0020142 is OK
- 10.1186/1756-0381-5-16 is OK
- 10.1007/978-3-540-78757-0_3 is OK
- doi:10.1038/nature15393 is OK
- 10.1126/sciadv.aaw9206 is OK
- 10.1038/nrg2361 is OK
- 10.1186/s13040-017-0139-3 is OK
- 10.1038/s41576-018-0018-x is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon on 13 Oct 2020

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon on 13 Oct 2020

@dwinter @raivivek - I just wanted to check in to see how the reviews are going. As you complete items from the checklist, please check them off. And please be sure to reach out to me if you have questions or concerns. I'm here to assist you. Thanks again!

fboehm on 23 Oct 2020

👍1

@fboehm I'm unable to edit the checklist — likely because my invitation expired before I could accept it. Would you be able to look into it? Thanks!

raivivek on 23 Oct 2020

@fboehm I am still unable to edit the checklist. I noticed that something similar happened in another thread and whedon was used with the following command: @whedon re-invite @libertyh as reviewer — perhaps this could help? Sorry for the trouble.

raivivek on 24 Oct 2020

@whedon re-invite @raivivek as reviewer

fboehm on 25 Oct 2020

OK, the reviewer has been re-invited.

@raivivek please accept the invite by clicking this link: https://github.com/openjournals/joss-reviews/invitations

whedon on 25 Oct 2020

Sorry about that, @raivivek. I was away from email for much of the weekend.

Thanks again for your work on this! Please let me know how I can assist you.

fboehm on 25 Oct 2020

Thanks @fboehm! No worries at all. It's working as expected now. On the same note, I have completed my first-round of reviews and have a few revisions for @greggj2016 to consider.

Overall, I think the software paper and documentation is well done. The code for the software can use some refactoring but it doesn't prohibit use in its current form. I have highlighted some of those specific suggestions below.

General checks

[ ] Keep only one LICENSE file (LICENSE with the copyright holder and year information within that file)
[ ] Please create a new GitHub release with the latest code

Functionality

[ ] Create an installation package (should be available to install through pip) or at the very least add a requirements.txt file with dependencies and minimum versions of the packages required
[ ] Community guidelines - CONTRIBUTING.md is missing
[ ] Provide supplementary details on the benchmark comparison made in the paper and the code to reproduce that in the manuscript

Documentation & Software paper

Both are adequate and I've updated corresponding items in the reviewer checklist

Code

[ ] Specify Python version support — Python 3 only or works across Python versions? What specific versions of Python supported, if any
[ ] Please add package version information in the tool. For example, regens.py --version should tell the current version
[ ] Code/additional files for the manuscript is currently bundled in the software repository. I'd recommend separating those. Creating an installation package should help (please see the end of this review)
[ ] Python code should be run through a code formatter such as black to remove redundancies, code duplication, and other issues
[ ] Unit testing module provides no feedback on what tests it's running and their status

I strongly recommend creating an installation package such that REGENS can be installed through pip (even if the authors do not publish it on PyPI). Currently, the core of it is in two files regens.py and regens_library.py, and it creates opportunities for the tool to break if the library file is unknowingly moved by the user. This will also allow one to separate necessary data for the software from the manuscript code/docs contained in the repository using package_data or package_dir directives
- More specifically, this would involve something like:
  - Testing related code to test/
  - References and other supplementary data to data/
  - Manuscript and other related docs to docs/ or manuscript/

Also, @fboehm Let me know if you'd like me to open a new issue in the target repository with these suggestions. I'm still learning my way around JOSS reviews, so please let me know if I've made mistakes! :-) Thanks.

raivivek on 26 Oct 2020

👍1

@raivivek - Thanks so much for your hard work on this! In future reviews, you can feel free to either list concerns in the review (like you've done here) and/or to open issues in the package submission repository. What you've done here is perfectly sufficient. Should you wish to open issues in the package repository, please be sure to link to them in the review.

I expect that the authors will appreciate your formulation of the edits as checklists. Thank you for being so thorough!

fboehm on 26 Oct 2020

@greggj2016 - please reply to each of the comments from @raivivek as you revise the package. Please use this review thread to discuss and clarify any points that need elaboration. Thanks again!

fboehm on 26 Oct 2020

@dwinter @raivivek - I just wanted to check in to see how the reviews are going. As you complete items from the checklist, please check them off. And please be sure to reach out to me if you have questions or concerns. I'm here to assist you. Thanks again!

Hi @fboehm, apologies for being out of the loop. We've had a bit of a difficult run with a few illnesses in our family. Nothing major, and all back to normal now. I have made a start on reviewing the package, and should be able to tick some boxes and provide some comments later in the week.

dwinter on 27 Oct 2020

@dwinter - No problem at all. I'm relieved that things are back to normal now. Please let me know if you encounter any difficulties or have any questions. Thanks for your work on this!!

fboehm on 27 Oct 2020

@raivivek - Thank you for your thorough review! We will work on incorporating your suggestions. Could you please elaborate on your comment “Unit testing module provides no feedback on what tests it’s running and their status”? I believe the tests print statements about which tests run and whether they pass.

greggj2016 on 27 Oct 2020

@greggj2016 Great that you brought this up. So the way I tested this is that I simply executed the file regens_testers.py. This gave no output and no feedback. Because there are no instructions as to how to run the tests, I presumed this would be the way to go. So I'd say as a first-step, you could revise the README to include unit-testing instructions. I'd also like to highlight that this would be easier and more can be done in more integrated fashion if you were to create a Python package. Let me know if I've missed anything.

raivivek on 27 Oct 2020

Sorry for the confusion. The file regens_testers.py only contains the testing functions. It does not implement them, which is why you received no output. Each test is its own folder and should be run after changing your working directory to that folder.

The folders named correctness_testing_ACB contains, correctness_testing_GBR, runtime_testing, and unit_testing_files each contain a README.md file explaining what to do. Testing output also exists in those folders for you to examine if you'd prefer. Keep in mind that, for each shell script containing "source activate PyTriadsim", you will want to replace that line with a line that activates your local environment".

Please let me know if this helps and if you have any further questions!

One more thing: If you would rather not run the shell script from the testing folder, you can also just paste the command in the shell script into your command line. If you do this, your working directory should be in the regens folder, not the unit testing folder. For runtime testing, if you want to run the commands without the shell script, you will have to run the commands in regens1.sh through regens10.sh and triadsim1.sh through triadsim10.sh directly.

greggj2016 on 27 Oct 2020

Hi @greggj2016 — thanks for the explanation. It's great to know that there are amazing tests bundled with REGENS. If possible, I'd encourage you to create a wrapper script that runs all of these tests for the end-user and reports results in one output. For example, the output exit status of this script can be used to ascertain if the software is working as expected or not in pipelines. Also, you may wish to avoid any kind of hardcoded values such as source activate.. unless your install script creates such an environment. Thanks! Looking forward to all the changes.

raivivek on 29 Oct 2020

Hi @greggj2016 and @raivivek,

I am working my way through this review and will have more detailed comments soon. Regens looks like a useful and well-written piece of software, and as far as I can tell it works as it says it does. I think think the major room for improvement will be in software engineering approaches that will make it easier for others ti work with or contribute to regens (and for you to maintain in ti future!).To that end, I wanted to reiterate a few points @raivivek has made.

It is not clear to someone coming to the repo for the first time what test are available and how they might be run. The unitests don't use a testing module, rather they run a script that prints the resuls of each test to screen. Ideally, the existing tests could be moved to an automated testing environment (unittest, pytest.nose etc). At the least, it will be important to document the correct ways to run the tets and provide a high-level script that calls the tests without bouncing around between subdirectories.
Ideally, the code should be provided as an installation package (https://packaging.python.org/tutorials/packaging-projects/), which wuold make it easier to install across platforms and handle requirements. If not, then there should certainly be a requirements.txt file to allow users to easily install the required packages via pip or conda.

Will provide a more detailed review next week.

dwinter on 30 Oct 2020

👍2

It turns out my more complete review is not so different from the comments provide above. happy to continute discussions here or on the regens repo.

Summary

regens is a well-written piece of software that will help researchers simulate datasets for GWAS. This includes realistic genotypes (including the LD structure of a given population) and phenotypic data (matching a given inheritance pattern and/or epistatic structure). The manuscript clearly states the need for such programs in the development of GWAS algorithms. Though my own research is not directly related to GWAS (so my insight might not be directly relevant) it does seem that regens will prove helpful for researchers producing simulated datasets to compare methods, or addressing the feasibility of GWAS in novel populations (at least where the LD structure is known). So I think it meets the JOSS requirements of "Substantial scholarly effort".

Overall, the program is well-written. The core functions are all included in a single library file. The functions are well-documented and appropriately modular. High-level documentation is given via the README for the repository and an examples directory that includes scripts to run particular analyses and the required input files.

The package contains a number of tests. These are not implemented in a standard unit testing environment. Rather tests are spread across subdirectories, and partly written in the library functions and partly in another file. Although this is an unusual arrangement, the tests do seem to cover a good portion of the functionality and test of the package for the correctness of results from particular inputs.

The accompanying paper clearly states the need for the paper. The cited paper by Shi et al (2018) suggests the approach taken here simulates realistic data and the 'Differentiating attributes' section clearly compares this package with others that perform similar tasks.

Major suggestions

Although my assessment is positive overall, I think the code could be improved by adopting software engineering approaches that make is easy for the current authors and future contributors to further develop the code and other users to incoprorate regens in their own workflows. I have started issues at the regens repo, which we can hopefully use to focus discussion on these points:

As noted in EpistasisLab/regens#11, installation instructions could be improved a little. Ideally the code could be provided as an installation package.

As noted EpistasisLab/regens#12, the tests provided with code are spread around a number of directories and it is not clear to someone new to the code exactly how they should be run. The key purpose of tests is to make sure future changes to the code don't introduce errors in output, so making the test-running as frictionless as possible is important.

Minor comments

It is a bit difficult to asses the `contribution and authorship' checkbox on the review, as it appears the code was only moved under version control recently (meaning commits only reflect a few weeks of work on the project). I am sure @greggj2016 was the major contributor to this code prior to it coming under version control, but it might be useful to have this confirmed.
Is there a reason to have a default value for the '--in' argument 'empty', rather than making it a required argument without a default? (ArgParse has the nice property of printing out a usage if a required argument is not included, which can be a helpful error message).
In the Input subsection of the README, it ways in recombination map is given as 'a folder with one gzipped tab separated dataframe per chromosome. Each dataframe contains genomic positions of intervals' boundaries and the intervals' recombination rates.'. I think the files are actually recombination maps with the pysical (bp) v genetic (cM) positions of sites? And each interval is defined by a single position. Are these the centres or right-edges of genomic intervals? It would be useful to do two things (also listed in EpistasisLab/regens#13)
- Make it clear that relevant files are provided for a lot of human populations (so many users needn't worry about this)
- Clearly define the required format in case anyone has non-human data they would like to use in this way
I have also opened an issue for the missing contributing.md file (EpistasisLab/regens#14)

dwinter on 2 Nov 2020

👍1

David Winters, thank you for your detailed review! I'll start addressing both of your suggestions tomorrow.

Just a note about issue EpistasisLab/regens#13: your first comment is addressed under the README file's "Input" section, and your second comment is addressed under the "Simulate genotype data with custom recombination rate dataframes" section. It would help me if you could clarify how I might make those sections more visible or simpler to understand.

greggj2016 on 2 Nov 2020

A'huh. This is partly an issue of a tired reader and partly the fact the README is quite long!

I think it's always good to write teh README in the 'inverted pyramid' style with the most important information at the top. As most readers won't need to know about this format, and the current description confused at least one reader it might be best to have the first section to say words to the effect of
(a) regens requires a recombination map
(b) maps are provided for many human populations
(c) custom maps can be provided, the format required for these detalied in SECTION XXX below"

dwinter on 2 Nov 2020

Another note regarding your minor comment:

First, and this is important to understand the recombination rate map file, centimorgans are _not_ true distances. Not many sources seem to explain it well. The best definition I came across was "One cM is equal to a 1% chance that a marker at one genetic locus will be separated from a marker at another locus [within the interval being examined] due to crossing over, (https://www.medicinenet.com/script/main/art.asp?articlekey=2665). However, this is still misleading. A better definition is that 1 centimorgan is equivalent to saying that the expected number of recombination events in the observed segment is 0.01. This is an important distinction when considering multiple segments: If you have two non-overlapping segments that are 1cM each, then the recombination rate of their union is (1cM + 1cM = 2cM), which corresponds to additivity of expected values, Also note that a (rather long) genomic segment can be more than 100cM, implying an expected value of more than 1 crossover event, which is no longer a probability. You can see how this paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC52322/pdf/pnas01067-0025.pdf) uses it at the bottom of table 4, and how this paper (https://www.nature.com/articles/nature09525) uses it at the bottom of table 1 (i.e. they both add the cM measurements for all chromosomes).

After considering the first point, the second point is that the cM column contains _cumulative_ recombination rates, which means that the ith cM value is the recombination rate of the interval in between the first bp position and the ith bp position. Additionally, genomic bp positions are interval left boundaries, which is why the "cM" element at the first position is 0. The cumulative recombination rate is necessarily 0 at the very first position because an interval cannot be considered without two positions.

Let me know If I can clarify this point further, It has been suggested that most readers won't care about these details, but I can include them if you'd like (technically I do mention them, but I'm a little bit brief).

greggj2016 on 2 Nov 2020

Also, I think I understand and agree with the majority of your comments. Note though that this is my first serious coding project, so I was wondering if you could clarify what you meant when you said "I think the code could be improved by adopting software engineering approaches that make is easy for the current authors and future contributors to further develop the code and other users to incoprorate regens in their own workflows."

What does that mean specifically? Also, I'm wondering if pip is necessarily the right tool because regens reads plink files as input and writes plink files as output. There is no standard output for regens to pass directly to another program, and pip seems to be for installing functions used by the python environment, so i don't understand what purpose this serves.

I could technically give an option to make the simulated genomes standard output, but simulating all chromosomes in the python environment would require roughly 10-20 fold more RAM then regen uses to write each chromosome into its own plink fileset one at a time. I don't think many people would want to use that much ram, though it has still been suggested that I might do this at some point. Still, it might require more work than expected at a first glance to implement efficiently, so I'd like to put that on the back-burner.

greggj2016 on 2 Nov 2020

I can also state what you mentioned above more clearly in the README file.

greggj2016 on 2 Nov 2020

I am the funder and supervisor of the project. Please accept this comment as confirmation that Mr. John Gregg, under my supervision, conceived of the project, planned the project, performed all the programming for the project, performed all the testing and evaluation of the code, wrote the JOSS paper, and created the Github project.

Jason H. Moore, PhD, FACMI
Edward Rose Professor of Informatics
Director, Institute for Biomedical Informatics
Director, Division of Informatics,
Department of Biostatistics, Epidemiology, & Informatics
Senior Associate Dean for Informatics
The Perelman School of Medicine
University of Pennsylvania

moorejh28 on 2 Nov 2020

It is a bit difficult to assess the `contribution and authorship' checkbox on the review, as it appears the code was only moved under version control recently (meaning commits only reflect a few weeks of work on the project). I am sure @greggj2016 was the major contributor to this code prior to it coming under version control, but it might be useful to have this confirmed.

Hi @dwinter I'm also confirming that @greggj2016 is the main author of the project.

trang1618 on 2 Nov 2020

Hi @greggj2016,

Also, I think I understand and agree with the majority of your comments. Note though that this is my first serious coding project, so I was wondering if you could clarify what you meant when you said "I think the code could be improved by adopting software engineering approaches that make is easy for the current authors and future contributors to further develop the code and other users to incoprorate regens in their own workflows."

What does that mean specifically...

This was really a preface dor the two major points I make in my review. I was trying to make it clear that paractices (automated unit testing and packaging) we have both suggested are helpful for development, and not just another hurdle that we are trying to place in your way.

I _think_ that suggesting these approaches is in line with the JOSS guidelines and ethos. Having said that, I've not reviwed for JOSS before. With two reviews in, it might be a good time to get some feedback on the next steps from our editor @fboehm ?

dwinter on 2 Nov 2020

Also, I'm wondering if pip is necessarily the right tool because regens reads plink files as input and writes plink files as output. There is no standard output for regens to pass directly to another program, and pip seems to be for installing functions used by the python environment, so i don't understand what purpose this serves

I don't think you should bundle a plink release with regens, much less implement these steps in python. The idea would be to include all of the python code in a pip package (no necessarily hosted on pip) that handles installing all of the other package/modules that are required. Even if you don't expect others to use your library code, a package will make it was to install the regens.py script into a users $PATH.

Happy to continue this discussion, though it might be best to concentrate it under the EpistasisLab/regens#11?

dwinter on 2 Nov 2020

Unlike many journals, our review process at JOSS is highly iterative. It's best, I think, when there's a great deal of dialogue between reviewers and authors, like you all have done above.

fboehm on 3 Nov 2020

👍4

Hi @dwinter, @raivivek and @greggj2016 - I just wanted to check in to see how the review is going. Is there anything that I can help with at this time?

Thanks again for your work on this!

fboehm on 13 Nov 2020

Yes, there has been a bit of a delay, which I apologize for. It has taken a while to figure out how to properly implement automated installation because this is my first serious project. My current strategy is to upload the code portion of my algorithm to pypi and the recombination rate files to figshare.com. I'll write an installation function that downloads the recombination rate files from figshare.com the first time that somebody calls regens.py. Other project have also eaten into my time. I hope to be done with this by Monday, but I'll give you guys a concrete update by then regardless.

greggj2016 on 13 Nov 2020

👍3

Full disclosure: My laptop that I started making the automated installation on broke down recently. I had a "new" laptop for a while, but I wanted to use the old one for as long as possible, and it stopped working just recently. My new laptop is apparently missing BLAS/LAPACK libraries that my old laptop already had, and when I try to get around that using anaconda, whatever compiler it uses is also missing some piece. Note that this doesn't change the existing project code at all: just my attempt at automated installation. It will take a little longer than expected, but it'll get done eventually.

greggj2016 on 14 Nov 2020

👍1

Hi @greggj2016 That sounds like a plan. Feel free to sketch out your proposed solution before you pour lots of time into coding it up.

dwinter on 15 Nov 2020

👍1

Just want to say so far so good. I'm gonna first make the user manually install miniconda and a specific version of numpy. Then my current setup.py and _init_.py files are configured so that, when I type "python setup.py install", it automatically installs regens in a way that allows one to run it from any directory. The recombination map files have to be in your working directory, and if they're not, then regens.py downloads them to your working directory automatically. It turns out that it's important that people use python version 3.7.x and an earlier version of numpy specifically. Part of the issue I was having turned out to be from an unstable new version of numpy, and the older version was apparently incompatible with python 3.8.x.

I still gotta figure out the best way to upload regens to pypi (my current repository is too large to upload to pypi) , and then it should be just small changes from there. Barring some new unexpected problem, I expect to be done with this soon.

greggj2016 on 16 Nov 2020

Question: To ensure that pypi is connected to my regens repository given pypi's 60MB size limit, I intend to move the majority of non-python file content to figshare.com. There will be new functions in regens.py that automatically download the relevant content to your working directory on first use. Does that sound reasonable?

greggj2016 on 16 Nov 2020

@greggj2016 Looks like good progress. On your question regarding data/recombination files, I think the figshare way sounds reasonable. However, there are other options as well in case you are interested. First option is that you can host the data on GitHub itself and download tarballs/zip from there. Second option is that you can request package size limit on PyPI to be increased so that you can bundle the data with the package itself. The benefit of second option is that it'll reduce friction in terms of writing additional code to download data and keeping the versioning consistent. You can request the package size limit change here — https://github.com/pypa/packaging-problems. For example, see here or here.

raivivek on 17 Nov 2020

👍1

For right now, the larger files have been uploaded and annotated in figshare. It's actually better for me because, when regens is called as a package from the installation directory, the code I wrote still references many of the relevant files from the working directory. It might be a little bit nicer if all files are downloaded to and accessible from the installation directory, but I also feel like that can be an improvement for later.

Right now, the automated installation works, at least for me. First install python version 3.7.x with numpy version 1.16.2 manually (the latest versions have some issues). Preferably do so in anaconda, which supposedly has LAPACK and BLAS pre-installed. Then, do pip install regens==0.0.0, and then try running the following command

python -m regens \
--in input_files/ACB --out examples/ACB_simulated \
--simulate_nbreakpoints 4 --simulate_nsamples 10000 \
--phenotype continuous --mean_phenotype 5.75 \
--population_code ACB --human_genome_version hg19 --noise 0.5 \
--causal_SNP_IDs_path input_files/causal_SNP_IDs2.txt \
--major_minor_assignments_path input_files/major_minor_assignments2.txt \
--SNP_phenotype_map_path input_files/SNP_phenotype_map2.txt \
--betas_path input_files/betas.txt

You should be able to run this from any directory, but for convenience, first cd into an empty folder. If you have windows instead of linux, replace "\" with "^". You will need an internet connection

greggj2016 on 18 Nov 2020

This means I still gotta correct the syntax of my examples in the readme and some other minor things

greggj2016 on 18 Nov 2020

Okay, I think it's almost ready.

1) I would like to know if you think that the new installation instructions and description of the recombination rate files in the README are satisfactory.
2) I will write a detailed document about all of the tests and a shell script that runs the other testing shell scripts later tonight. In the future, I will probably centralize this with github's continuous integration feature, though this is not a priority for me right now.
3) I think there might be one typo in the paper to fix.

After accounting for these three things, I would like to know if you guys think this package is ready for publication.

greggj2016 on 20 Nov 2020

I also have another more specific question: Can I leave the efficiency testing of Triadsim out of the master unit testing script? The problem is that I don't want to assume anything about the number of cores that the user has, and Triadsim takes a really long time to run and a huge amount of RAM. Triadsim also has to be installed separately, so I would rather leave that particular component to the user who wants to test that.

greggj2016 on 20 Nov 2020

@greggj2016 Would be nice to know what are the time and RAM estimates for running that task. I think it's fine to skip that test by default and tell the user that you've skipped it because of so-and-so reasons. Then can force run it with a flag (ideally) if they want to.

raivivek on 20 Nov 2020

👍1

I'll mention that, though it is in the paper too.

"[Triadsim] takes an average CPU-time of 6.8 hours and an average peak RAM of 54.6 GB to simulate 10000 trios (20000 unrelated GWAS samples) with 4 breakpoints."

Regens is much faster, so I can include that by default.

greggj2016 on 20 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[PRE REVIEW]: xphyle: Extraordinarily simple file handling

whedon · 12Comments

[REVIEW]: GRUPO: Gauging Research University Publication Output

whedon · 10Comments

[REVIEW]: cartography: Create and Integrate Maps in your R Workflow

whedon · 12Comments

[PRE REVIEW]: Neuropsydia.py: A Python Module for Creating Experiments, Tasks and Questionnaires

whedon · 8Comments

[REVIEW]: Pyret: A Python package for analysis of neurophysiology data

whedon · 12Comments