Joss-reviews: [PRE REVIEW]: Synthia: multi-dimensional synthetic data generation in Python

Created on 24 Oct 2020  路  37Comments  路  Source: openjournals/joss-reviews

Submitting author: @dmey (D. Meyer)
Repository: https://github.com/dmey/synthia
Version: 1.0.0
Editor: @oliviaguest
Reviewer: Pending
Managing EiC: Kyle Niemeyer

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Author instructions

Thanks for submitting your paper to JOSS @dmey. Currently, there isn't an JOSS editor assigned to your paper.

The author's suggestion for the handling editor is @arfon.

@dmey if you have any suggestions for potential reviewers then please mention them here in this thread (without tagging them with an @). In addition, this list of people have already agreed to review for JOSS and may be suitable for this submission (please start at the bottom of the list).

Editor instructions

The JOSS submission bot @whedon is here to help you find and assign reviewers and start the main review. To find out what @whedon can do for you type:

@whedon commands
Python pre-review

All 37 comments

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
Software report (experimental):

github.com/AlDanial/cloc v 1.84  T=0.12 s (354.6 files/s, 60453.2 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
SVG                              1              0              0           4607
Python                          20            316            361            839
Markdown                         7             89              0            115
Jupyter Notebook                 4              0            390             81
YAML                             3             11              5             71
CSS                              1              7              7             61
TeX                              1              5              0             47
reStructuredText                 3             38             66             41
INI                              1              0              0              2
HTML                             1              0              0              2
-------------------------------------------------------------------------------
SUM:                            42            466            829           5866
-------------------------------------------------------------------------------


Statistical information for the repository '40a53db89b90e75a2c9bfb3d' was
gathered on 2020/10/24.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Thomas Nagler                    2           137             34            7.47
dmey                             9          1766            353           92.53

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Thomas Nagler                95           69.3          0.0               28.42
dmey                       1421           80.5          0.0                9.85

PDF failed to compile for issue #2779 with the following error:

Can't find any papers to compile :-(

@whedon generate pdf from branch joss-paper

Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

@whedon query scope

Submission flagged for editorial review.

Hi @dmey, thanks for your submission to JOSS. Due to the relatively small size of your software package, the editorial board is going to take a closer look at whether it falls within our scope.

Hi @kyleniemeyer, many thanks for letting me know. In case this may be of relevance to the board, this package have been used in two papers (currently in preparation) which I am planning to submit in the next few weeks. Furthermore, the tool is novel in its approach, well written and likely to be cited by future machine learning (ML) groups.

@kyleniemeyer just as clarification to my previous message -- as I am going to upload the scientific papers that make use of/cite Synthia on arXiv in a couple of weeks while their peer-review takes place, I can update this thread with links to those respective papers. Originally, I thought that this was going to be discussed during review but I am more than happy to wait here if that will make it easier to show the novelty and contribution of this tool to the community.

I'm having a look at the paper regarding the scope query requested by @kyleniemeyer . Is it normal that the paper is extremely short? The Github pdf only contains a "Summary" and "Acknowledgments" sections, it seems rather incomplete and I wonder if this is an involuntary mistake

@VivianePons thanks for looking into this. My understanding is that the summary paper needs to be very short -- abstract like -- as it is meant only as summary of the motivation and purpose of the tool and because the purpose of the review is to review the software rather than paper as done in more traditional journals. I have checked again at https://joss.readthedocs.io/en/latest/submitting.html and it says that the summary paper should be between 250-1000 words but I am more than happy to extend this, especially given that my first draft was much much longer and cut it down considerably at submission to make it more to the point.

Indeed, papers are rather short but they are still a bit more furnished. Look at our example paper: https://joss.readthedocs.io/en/latest/submitting.html#example-paper-and-bibliography

In particular, papers should contain a "Statement of need" which is missing in your case. You can also have some other sections such as "Features", "Examples"

You can browse through our recent publications to give you an idea.

@VivianePons many thanks for clarifying this, please allow me to make the necessary changes as advised.

@whedon check references from branch joss-paper

Attempting to check references... from custom branch joss-paper
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1201/b17116 is OK
- 10.1109/DSAA.2016.49 is OK

MISSING DOIs

- None

INVALID DOIs

- None

In particular, I would like to understand what your software adds specifically in terms of implementation. Considering the small amount of code, we might fear that it is mainly a python wrapper to some other tools like vinecopulib. Could you give us some information regarding this aspect?

Could you give us some information regarding this aspect?

@dmey - could you elaborate? This will help us making our editorial scope decision.

@arfon -- may I give you my response by early next week?

No problem!

@VivianePons apologies for the delay but I have had no time to look at this yet -- could I get back sometime in the next week? Thanks.

@whedon generate pdf from branch joss-paper

Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

@arfon @VivianePons @kyleniemeyer and @danielskatz many thanks for allowing me to get back to you this week. We have recently extended the documentation, added more examples, and reworded the paper to address what I think were your main concerns. We have also added a couple of new features, that is, the handling of discrete and categorical data in the last two releases which brings the number of lines of pure Python code to 1097 (please see cloc output below).

With regards to your individual questions --

In particular, I would like to understand what your software adds specifically in terms of implementation.

@VivianePons thanks for raising this -- looking at the paper and repository with fresh eyes, I can see how this was unclear. I have now made changes to the repository, paper and website and hope that the changes make the purpose clearer. With regards to your specific question, Synthia can currently be used to model univariate and multivariate data, parameterize marginals with empirical and parametric methods and apply manipulations such as stretching and uniformization (I have added a summary at https://dmey.github.io/synthia/features.html). For multivariate data we support three different types of methods: fPCA, parametric (Gaussian) copula, and vine copula models and provide a pure Python implementation for the former two and rely on vinecopulib for the latter. Recently we have also added the capability to handle discrete and categorical data when using vine copulas.

Considering the small amount of code, we might fear that it is mainly a python wrapper to some other tools like vinecopulib. Could you give us some information regarding this aspect?

We have tried to write Synthia succinctly and the current lines of pure Python code according to the cloc tool is 1097 (see below). The use of vinecopulib is important but it is not a required dependency. In our installation vinecopulib is also marked as an optional dependency (see https://dmey.github.io/synthia/installation.html). The amount of code that corresponds to the integration with vinecopulib is very very small, about 20-30 lines of code. Furthermore, although vinecopulib does play an important role in Synthia, its purpose is limited to the generation of vines not that of data generation in general.
As Synthia presents a new method for generation using multidimensional data in Python using fPCA, together with gaussian and vine copulas models, natively handle multidimensional arrays and datasets (essential in componential sciences), and the parametrizations and manipulation of univariate distribution in a single tool, I believe the paper is within scope.

The scope of the journal (https://joss.readthedocs.io/en/latest/submitting.html) indicates that [our bold]:

JOSS publishes articles about research software. This definition includes software that: solves complex modeling problems in a scientific context (physics, mathematics, biology, medicine, social science, neuroscience, engineering); supports the functioning of research instruments or the execution of research experiments; extracts knowledge from large data sets; offers a mathematical library, or similar.

JOSS publishes articles about software that represent substantial scholarly effort on the part of the authors. Your software should be a significant contribution to the available open source software that either enables some new research challenges to be addressed or makes addressing research challenges significantly better (e.g., faster, easier, simpler)

I cited a paper which is going to be submitted in the next few 10 days, I will let you know as soon as it's been deposited to that I can update the reference. And apologies for the long text but I thought it would be best to address everything in one long comment.

As a side note, I think there is a small issue with typesetting the figures in the paper (Table 1). Would it be possible to reduce the text size or change the width by a little so that the code blocks display as one liners. Otherwise I could move them to a different layout.

Output from the cloc command (local run, commit id: 0da044afc3c6d7bad0b60f54dcf21ba2fb6374be).

      54 text files.
      54 unique files.
      21 files ignored.

github.com/AlDanial/cloc v 1.74  T=0.52 s (69.8 files/s, 4497.8 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          21            364            369           1097
Markdown                         9            117              0            206
YAML                             3             11              5             71
CSS                              1              7              7             61
INI                              1              0              0              2
HTML                             1              0              0              2
-------------------------------------------------------------------------------
SUM:                            36            499            381           1439
-------------------------------------------------------------------------------

@whedon check repository

Software report (experimental):

github.com/AlDanial/cloc v 1.84  T=0.10 s (498.5 files/s, 83480.9 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
SVG                              1              0              0           4607
Python                          21            365            370           1102
Markdown                        10            112              0            198
Jupyter Notebook                 7              0            911            177
YAML                             3             11              5             71
CSS                              1              7              7             61
TeX                              1              5              0             47
reStructuredText                 3             37             68             40
INI                              1              0              0              2
HTML                             1              0              0              2
-------------------------------------------------------------------------------
SUM:                            49            537           1361           6307
-------------------------------------------------------------------------------


Statistical information for the repository '1f383df63cf604807d3377a9' was
gathered on 2020/11/16.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Maik Riechert                    2           242             37           10.05
Thomas Nagler                    2           137             34            6.16
dmey                            19          1927            398           83.78

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Maik Riechert               241           99.6          0.1                2.90
Thomas Nagler                87           63.5          0.8               24.14
dmey                       1509           78.3          0.0                8.88

@openjournals/dev - any comments on this question from the author:

As a side note, I think there is a small issue with typesetting the figures in the paper (Table 1). Would it be possible to reduce the text size or change the width by a little so that the code blocks display as one liners. Otherwise I could move them to a different layout.

馃憢 @oliviaguest - would you be willing to edit this for JOSS?

@whedon invite @oliviaguest as editor

@oliviaguest has been invited to edit this submission.

I am really inundated with work at the moment, so on the proviso I can start (looking for reviewers, etc.) next week, sure. 鈽猴笍

Sure, that's fine!

@whedon assign @oliviaguest as editor

OK, the editor is @oliviaguest

Was this page helpful?
0 / 5 - 0 ratings