Submitting author: @greggj2016 (John Gregg)
Repository: https://github.com/EpistasisLab/regens
Version: v0.1.0
Editor: @fboehm
Reviewer: @raivivek, @dwinter
Archive: Pending
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
Status badge code:
HTML: <a href="https://joss.theoj.org/papers/5d8ab9adbeb2cf591fcfab172ccb3dc6"><img src="https://joss.theoj.org/papers/5d8ab9adbeb2cf591fcfab172ccb3dc6/status.svg"></a>
Markdown: [](https://joss.theoj.org/papers/5d8ab9adbeb2cf591fcfab172ccb3dc6)
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
@raivivek & @dwinter, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fboehm know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @raivivek, @dwinter it looks like you're currently assigned to review this paper :tada:.
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
:star: Important :star:
If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿
To fix this do the following two things:
For a list of things I can do to help you, just type:
@whedon commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@whedon generate pdf
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1186/s12859-017-2004-2 is OK
- 10.1534/genetics.109.108431 is OK
- 10.1038/ejhg.2017.50 is OK
- 10.1371/journal.pgen.0020142 is OK
- 10.1186/1756-0381-5-16 is OK
- 10.1007/978-3-540-78757-0_3 is OK
- doi:10.1038/nature15393 is OK
- 10.1126/sciadv.aaw9206 is OK
- 10.1038/nrg2361 is OK
- 10.1186/s13040-017-0139-3 is OK
- 10.1038/s41576-018-0018-x is OK
MISSING DOIs
- None
INVALID DOIs
- None
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
@dwinter @raivivek - I just wanted to check in to see how the reviews are going. As you complete items from the checklist, please check them off. And please be sure to reach out to me if you have questions or concerns. I'm here to assist you. Thanks again!
@fboehm I'm unable to edit the checklist — likely because my invitation expired before I could accept it. Would you be able to look into it? Thanks!
@fboehm I am still unable to edit the checklist. I noticed that something similar happened in another thread and whedon
was used with the following command: @whedon re-invite @libertyh as reviewer
— perhaps this could help? Sorry for the trouble.
@whedon re-invite @raivivek as reviewer
OK, the reviewer has been re-invited.
@raivivek please accept the invite by clicking this link: https://github.com/openjournals/joss-reviews/invitations
Sorry about that, @raivivek. I was away from email for much of the weekend.
Thanks again for your work on this! Please let me know how I can assist you.
Thanks @fboehm! No worries at all. It's working as expected now. On the same note, I have completed my first-round of reviews and have a few revisions for @greggj2016 to consider.
Overall, I think the software paper and documentation is well done. The code for the software can use some refactoring but it doesn't prohibit use in its current form. I have highlighted some of those specific suggestions below.
General checks
Functionality
pip
) or at the very least add a requirements.txt
file with dependencies and minimum versions of the packages requiredCONTRIBUTING.md
is missingDocumentation & Software paper
Code
regens.py --version
should tell the current versionpip
(even if the authors do not publish it on PyPI). Currently, the core of it is in two files regens.py
and regens_library.py
, and it creates opportunities for the tool to break if the library file is unknowingly moved by the user. This will also allow one to separate necessary data for the software from the manuscript code/docs contained in the repository using package_data
or package_dir
directivestest/
data/
docs/
or manuscript/
Also, @fboehm Let me know if you'd like me to open a new issue in the target repository with these suggestions. I'm still learning my way around JOSS reviews, so please let me know if I've made mistakes! :-) Thanks.
@raivivek - Thanks so much for your hard work on this! In future reviews, you can feel free to either list concerns in the review (like you've done here) and/or to open issues in the package submission repository. What you've done here is perfectly sufficient. Should you wish to open issues in the package repository, please be sure to link to them in the review.
I expect that the authors will appreciate your formulation of the edits as checklists. Thank you for being so thorough!
@greggj2016 - please reply to each of the comments from @raivivek as you revise the package. Please use this review thread to discuss and clarify any points that need elaboration. Thanks again!
@dwinter @raivivek - I just wanted to check in to see how the reviews are going. As you complete items from the checklist, please check them off. And please be sure to reach out to me if you have questions or concerns. I'm here to assist you. Thanks again!
Hi @fboehm, apologies for being out of the loop. We've had a bit of a difficult run with a few illnesses in our family. Nothing major, and all back to normal now. I have made a start on reviewing the package, and should be able to tick some boxes and provide some comments later in the week.
@dwinter - No problem at all. I'm relieved that things are back to normal now. Please let me know if you encounter any difficulties or have any questions. Thanks for your work on this!!
@raivivek - Thank you for your thorough review! We will work on incorporating your suggestions. Could you please elaborate on your comment “Unit testing module provides no feedback on what tests it’s running and their status”? I believe the tests print statements about which tests run and whether they pass.
@greggj2016 Great that you brought this up. So the way I tested this is that I simply executed the file regens_testers.py
. This gave no output and no feedback. Because there are no instructions as to how to run the tests, I presumed this would be the way to go. So I'd say as a first-step, you could revise the README to include unit-testing instructions. I'd also like to highlight that this would be easier and more can be done in more integrated fashion if you were to create a Python package. Let me know if I've missed anything.
Sorry for the confusion. The file regens_testers.py only contains the testing functions. It does not implement them, which is why you received no output. Each test is its own folder and should be run after changing your working directory to that folder.
The folders named correctness_testing_ACB contains, correctness_testing_GBR, runtime_testing, and unit_testing_files each contain a README.md file explaining what to do. Testing output also exists in those folders for you to examine if you'd prefer. Keep in mind that, for each shell script containing "source activate PyTriadsim", you will want to replace that line with a line that activates your local environment".
Please let me know if this helps and if you have any further questions!
One more thing: If you would rather not run the shell script from the testing folder, you can also just paste the command in the shell script into your command line. If you do this, your working directory should be in the regens folder, not the unit testing folder. For runtime testing, if you want to run the commands without the shell script, you will have to run the commands in regens1.sh through regens10.sh and triadsim1.sh through triadsim10.sh directly.
Hi @greggj2016 — thanks for the explanation. It's great to know that there are amazing tests bundled with REGENS. If possible, I'd encourage you to create a wrapper script that runs all of these tests for the end-user and reports results in one output. For example, the output exit status of this script can be used to ascertain if the software is working as expected or not in pipelines. Also, you may wish to avoid any kind of hardcoded values such as source activate..
unless your install script creates such an environment. Thanks! Looking forward to all the changes.
Hi @greggj2016 and @raivivek,
I am working my way through this review and will have more detailed comments soon. Regens looks like a useful and well-written piece of software, and as far as I can tell it works as it says it does. I think think the major room for improvement will be in software engineering approaches that will make it easier for others ti work with or contribute to regens (and for you to maintain in ti future!).To that end, I wanted to reiterate a few points @raivivek has made.
It is not clear to someone coming to the repo for the first time what test are available and how they might be run. The unitests don't use a testing module, rather they run a script that prints the resuls of each test to screen. Ideally, the existing tests could be moved to an automated testing environment (unittest
, pytest
.nose
etc). At the least, it will be important to document the correct ways to run the tets and provide a high-level script that calls the tests without bouncing around between subdirectories.
Ideally, the code should be provided as an installation package (https://packaging.python.org/tutorials/packaging-projects/), which wuold make it easier to install across platforms and handle requirements. If not, then there should certainly be a requirements.txt
file to allow users to easily install the required packages via pip or conda.
Will provide a more detailed review next week.
It turns out my more complete review is not so different from the comments provide above. happy to continute discussions here or on the regens repo.
regens
is a well-written piece of software that will help researchers simulate datasets for GWAS. This includes realistic genotypes (including the LD structure of a given population) and phenotypic data (matching a given inheritance pattern and/or epistatic structure). The manuscript clearly states the need for such programs in the development of GWAS algorithms. Though my own research is not directly related to GWAS (so my insight might not be directly relevant) it does seem that regens
will prove helpful for researchers producing simulated datasets to compare methods, or addressing the feasibility of GWAS in novel populations (at least where the LD structure is known). So I think it meets the JOSS requirements of "Substantial scholarly effort".
Overall, the program is well-written. The core functions are all included in a single library file. The functions are well-documented and appropriately modular. High-level documentation is given via the README for the repository and an examples
directory that includes scripts to run particular analyses and the required input files.
The package contains a number of tests. These are not implemented in a standard unit testing environment. Rather tests are spread across subdirectories, and partly written in the library functions and partly in another file. Although this is an unusual arrangement, the tests do seem to cover a good portion of the functionality and test of the package for the correctness of results from particular inputs.
The accompanying paper clearly states the need for the paper. The cited paper by Shi et al (2018) suggests the approach taken here simulates realistic data and the 'Differentiating attributes' section clearly compares this package with others that perform similar tasks.
Although my assessment is positive overall, I think the code could be improved by adopting software engineering approaches that make is easy for the current authors and future contributors to further develop the code and other users to incoprorate regens
in their own workflows. I have started issues at the regens
repo, which we can hopefully use to focus discussion on these points:
As noted in EpistasisLab/regens#11, installation instructions could be improved a little. Ideally the code could be provided as an installation package.
As noted EpistasisLab/regens#12, the tests provided with code are spread around a number of directories and it is not clear to someone new to the code exactly how they should be run. The key purpose of tests is to make sure future changes to the code don't introduce errors in output, so making the test-running as frictionless as possible is important.
It is a bit difficult to asses the `contribution and authorship' checkbox on the review, as it appears the code was only moved under version control recently (meaning commits only reflect a few weeks of work on the project). I am sure @greggj2016 was the major contributor to this code prior to it coming under version control, but it might be useful to have this confirmed.
Is there a reason to have a default value for the '--in' argument 'empty', rather than making it a required argument without a default? (ArgParse
has the nice property of printing out a usage if a required argument is not included, which can be a helpful error message).
In the Input
subsection of the README, it ways in recombination map is given as 'a folder with one gzipped tab separated dataframe per chromosome. Each dataframe contains genomic positions of intervals' boundaries and the intervals' recombination rates.'. I think the files are actually recombination maps with the pysical (bp) v genetic (cM) positions of sites? And each interval is defined by a single position. Are these the centres or right-edges of genomic intervals? It would be useful to do two things (also listed in EpistasisLab/regens#13)
I have also opened an issue for the missing contributing.md file (EpistasisLab/regens#14)
David Winters, thank you for your detailed review! I'll start addressing both of your suggestions tomorrow.
Just a note about issue EpistasisLab/regens#13: your first comment is addressed under the README file's "Input" section, and your second comment is addressed under the "Simulate genotype data with custom recombination rate dataframes" section. It would help me if you could clarify how I might make those sections more visible or simpler to understand.
A'huh. This is partly an issue of a tired reader and partly the fact the README is quite long!
I think it's always good to write teh README in the 'inverted pyramid' style with the most important information at the top. As most readers won't need to know about this format, and the current description confused at least one reader it might be best to have the first section to say words to the effect of
(a) regens requires a recombination map
(b) maps are provided for many human populations
(c) custom maps can be provided, the format required for these detalied in SECTION XXX below"
Another note regarding your minor comment:
First, and this is important to understand the recombination rate map file, centimorgans are _not_ true distances. Not many sources seem to explain it well. The best definition I came across was "One cM is equal to a 1% chance that a marker at one genetic locus will be separated from a marker at another locus [within the interval being examined] due to crossing over, (https://www.medicinenet.com/script/main/art.asp?articlekey=2665). However, this is still misleading. A better definition is that 1 centimorgan is equivalent to saying that the expected number of recombination events in the observed segment is 0.01. This is an important distinction when considering multiple segments: If you have two non-overlapping segments that are 1cM each, then the recombination rate of their union is (1cM + 1cM = 2cM), which corresponds to additivity of expected values, Also note that a (rather long) genomic segment can be more than 100cM, implying an expected value of more than 1 crossover event, which is no longer a probability. You can see how this paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC52322/pdf/pnas01067-0025.pdf) uses it at the bottom of table 4, and how this paper (https://www.nature.com/articles/nature09525) uses it at the bottom of table 1 (i.e. they both add the cM measurements for all chromosomes).
After considering the first point, the second point is that the cM column contains _cumulative_ recombination rates, which means that the ith cM value is the recombination rate of the interval in between the first bp position and the ith bp position. Additionally, genomic bp positions are interval left boundaries, which is why the "cM" element at the first position is 0. The cumulative recombination rate is necessarily 0 at the very first position because an interval cannot be considered without two positions.
Let me know If I can clarify this point further, It has been suggested that most readers won't care about these details, but I can include them if you'd like (technically I do mention them, but I'm a little bit brief).
Also, I think I understand and agree with the majority of your comments. Note though that this is my first serious coding project, so I was wondering if you could clarify what you meant when you said "I think the code could be improved by adopting software engineering approaches that make is easy for the current authors and future contributors to further develop the code and other users to incoprorate regens in their own workflows."
What does that mean specifically? Also, I'm wondering if pip is necessarily the right tool because regens reads plink files as input and writes plink files as output. There is no standard output for regens to pass directly to another program, and pip seems to be for installing functions used by the python environment, so i don't understand what purpose this serves.
I could technically give an option to make the simulated genomes standard output, but simulating all chromosomes in the python environment would require roughly 10-20 fold more RAM then regen uses to write each chromosome into its own plink fileset one at a time. I don't think many people would want to use that much ram, though it has still been suggested that I might do this at some point. Still, it might require more work than expected at a first glance to implement efficiently, so I'd like to put that on the back-burner.
I can also state what you mentioned above more clearly in the README file.
I am the funder and supervisor of the project. Please accept this comment as confirmation that Mr. John Gregg, under my supervision, conceived of the project, planned the project, performed all the programming for the project, performed all the testing and evaluation of the code, wrote the JOSS paper, and created the Github project.
Jason H. Moore, PhD, FACMI
Edward Rose Professor of Informatics
Director, Institute for Biomedical Informatics
Director, Division of Informatics,
Department of Biostatistics, Epidemiology, & Informatics
Senior Associate Dean for Informatics
The Perelman School of Medicine
University of Pennsylvania
It is a bit difficult to assess the `contribution and authorship' checkbox on the review, as it appears the code was only moved under version control recently (meaning commits only reflect a few weeks of work on the project). I am sure @greggj2016 was the major contributor to this code prior to it coming under version control, but it might be useful to have this confirmed.
Hi @dwinter I'm also confirming that @greggj2016 is the main author of the project.
Hi @greggj2016,
Also, I think I understand and agree with the majority of your comments. Note though that this is my first serious coding project, so I was wondering if you could clarify what you meant when you said "I think the code could be improved by adopting software engineering approaches that make is easy for the current authors and future contributors to further develop the code and other users to incoprorate regens in their own workflows."
What does that mean specifically...
This was really a preface dor the two major points I make in my review. I was trying to make it clear that paractices (automated unit testing and packaging) we have both suggested are helpful for development, and not just another hurdle that we are trying to place in your way.
I _think_ that suggesting these approaches is in line with the JOSS guidelines and ethos. Having said that, I've not reviwed for JOSS before. With two reviews in, it might be a good time to get some feedback on the next steps from our editor @fboehm ?
Also, I'm wondering if pip is necessarily the right tool because regens reads plink files as input and writes plink files as output. There is no standard output for regens to pass directly to another program, and pip seems to be for installing functions used by the python environment, so i don't understand what purpose this serves
I don't think you should bundle a plink release with regens
, much less implement these steps in python. The idea would be to include all of the python code in a pip package (no necessarily hosted on pip) that handles installing all of the other package/modules that are required. Even if you don't expect others to use your library code, a package will make it was to install the regens.py
script into a users $PATH
.
Happy to continue this discussion, though it might be best to concentrate it under the EpistasisLab/regens#11?
Thanks, @dwinter, @raivivek and @greggj2016 ! The discussions around the reviews seem to be highly constructive and helpful, I hope, to @greggj2016 and continued development of REGENS. The next steps, I think, are for @greggj2016 to make changes to the software to address the issues raised in the reviews. Ultimately, we need both reviewers to be satisfied with the submission to the point where they're comfortable checking all of the boxes in the above checklists.
Unlike many journals, our review process at JOSS is highly iterative. It's best, I think, when there's a great deal of dialogue between reviewers and authors, like you all have done above.
@greggj2016 - do you feel that you have sufficient feedback via reviews and Github issues to begin refining the software? When you feel that you've resolved the issues, please comment again on this thread so that @raivivek and @dwinter may examine the fixes. They'll then provide additional feedback, stating that they're satisfied or unsatisfied with the fixes. If they're unsatisfied, they'll also explain what remains to be done. Does that make sense to everyone? Thanks again for your work on this!!
Hi @dwinter, @raivivek and @greggj2016 - I just wanted to check in to see how the review is going. Is there anything that I can help with at this time?
Thanks again for your work on this!
Yes, there has been a bit of a delay, which I apologize for. It has taken a while to figure out how to properly implement automated installation because this is my first serious project. My current strategy is to upload the code portion of my algorithm to pypi and the recombination rate files to figshare.com. I'll write an installation function that downloads the recombination rate files from figshare.com the first time that somebody calls regens.py. Other project have also eaten into my time. I hope to be done with this by Monday, but I'll give you guys a concrete update by then regardless.
Full disclosure: My laptop that I started making the automated installation on broke down recently. I had a "new" laptop for a while, but I wanted to use the old one for as long as possible, and it stopped working just recently. My new laptop is apparently missing BLAS/LAPACK libraries that my old laptop already had, and when I try to get around that using anaconda, whatever compiler it uses is also missing some piece. Note that this doesn't change the existing project code at all: just my attempt at automated installation. It will take a little longer than expected, but it'll get done eventually.
Hi @greggj2016 That sounds like a plan. Feel free to sketch out your proposed solution before you pour lots of time into coding it up.
Just want to say so far so good. I'm gonna first make the user manually install miniconda and a specific version of numpy. Then my current setup.py and _init_.py files are configured so that, when I type "python setup.py install", it automatically installs regens in a way that allows one to run it from any directory. The recombination map files have to be in your working directory, and if they're not, then regens.py downloads them to your working directory automatically. It turns out that it's important that people use python version 3.7.x and an earlier version of numpy specifically. Part of the issue I was having turned out to be from an unstable new version of numpy, and the older version was apparently incompatible with python 3.8.x.
I still gotta figure out the best way to upload regens to pypi (my current repository is too large to upload to pypi) , and then it should be just small changes from there. Barring some new unexpected problem, I expect to be done with this soon.
Question: To ensure that pypi is connected to my regens repository given pypi's 60MB size limit, I intend to move the majority of non-python file content to figshare.com. There will be new functions in regens.py that automatically download the relevant content to your working directory on first use. Does that sound reasonable?
@greggj2016 Looks like good progress. On your question regarding data/recombination files, I think the figshare way sounds reasonable. However, there are other options as well in case you are interested. First option is that you can host the data on GitHub itself and download tarballs/zip from there. Second option is that you can request package size limit on PyPI to be increased so that you can bundle the data with the package itself. The benefit of second option is that it'll reduce friction in terms of writing additional code to download data and keeping the versioning consistent. You can request the package size limit change here — https://github.com/pypa/packaging-problems. For example, see here or here.
For right now, the larger files have been uploaded and annotated in figshare. It's actually better for me because, when regens is called as a package from the installation directory, the code I wrote still references many of the relevant files from the working directory. It might be a little bit nicer if all files are downloaded to and accessible from the installation directory, but I also feel like that can be an improvement for later.
Right now, the automated installation works, at least for me. First install python version 3.7.x with numpy version 1.16.2 manually (the latest versions have some issues). Preferably do so in anaconda, which supposedly has LAPACK and BLAS pre-installed. Then, do pip install regens==0.0.0
, and then try running the following command
python -m regens \
--in input_files/ACB --out examples/ACB_simulated \
--simulate_nbreakpoints 4 --simulate_nsamples 10000 \
--phenotype continuous --mean_phenotype 5.75 \
--population_code ACB --human_genome_version hg19 --noise 0.5 \
--causal_SNP_IDs_path input_files/causal_SNP_IDs2.txt \
--major_minor_assignments_path input_files/major_minor_assignments2.txt \
--SNP_phenotype_map_path input_files/SNP_phenotype_map2.txt \
--betas_path input_files/betas.txt
You should be able to run this from any directory, but for convenience, first cd into an empty folder. If you have windows instead of linux, replace "\" with "^". You will need an internet connection
This means I still gotta correct the syntax of my examples in the readme and some other minor things
Okay, I think it's almost ready.
1) I would like to know if you think that the new installation instructions and description of the recombination rate files in the README are satisfactory.
2) I will write a detailed document about all of the tests and a shell script that runs the other testing shell scripts later tonight. In the future, I will probably centralize this with github's continuous integration feature, though this is not a priority for me right now.
3) I think there might be one typo in the paper to fix.
After accounting for these three things, I would like to know if you guys think this package is ready for publication.
I also have another more specific question: Can I leave the efficiency testing of Triadsim out of the master unit testing script? The problem is that I don't want to assume anything about the number of cores that the user has, and Triadsim takes a really long time to run and a huge amount of RAM. Triadsim also has to be installed separately, so I would rather leave that particular component to the user who wants to test that.
@greggj2016 Would be nice to know what are the time and RAM estimates for running that task. I think it's fine to skip that test by default and tell the user that you've skipped it because of so-and-so reasons. Then can force run it with a flag (ideally) if they want to.
I'll mention that, though it is in the paper too.
"[Triadsim] takes an average CPU-time of 6.8 hours and an average peak RAM of 54.6 GB to simulate 10000 trios (20000 unrelated GWAS samples) with 4 breakpoints."
Regens is much faster, so I can include that by default.
Most helpful comment
Thanks, @dwinter, @raivivek and @greggj2016 ! The discussions around the reviews seem to be highly constructive and helpful, I hope, to @greggj2016 and continued development of REGENS. The next steps, I think, are for @greggj2016 to make changes to the software to address the issues raised in the reviews. Ultimately, we need both reviewers to be satisfied with the submission to the point where they're comfortable checking all of the boxes in the above checklists.
Unlike many journals, our review process at JOSS is highly iterative. It's best, I think, when there's a great deal of dialogue between reviewers and authors, like you all have done above.
@greggj2016 - do you feel that you have sufficient feedback via reviews and Github issues to begin refining the software? When you feel that you've resolved the issues, please comment again on this thread so that @raivivek and @dwinter may examine the fixes. They'll then provide additional feedback, stating that they're satisfied or unsatisfied with the fixes. If they're unsatisfied, they'll also explain what remains to be done. Does that make sense to everyone? Thanks again for your work on this!!