Joss-reviews: [REVIEW]: STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining

Created on 13 Jun 2019 · 45Comments · Source: openjournals/joss-reviews

Submitting author: @seanmylaw (Sean Law)
Repository: https://github.com/TDAmeritrade/stumpy
Version: 1.0.0
Editor: @mbobra
Reviewer: @ejolly, @hooman650
Archive: 10.5281/zenodo.3340125

Status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/eb91faaf9219d46c9acd373cfee8ac29"><img src="http://joss.theoj.org/papers/eb91faaf9219d46c9acd373cfee8ac29/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/eb91faaf9219d46c9acd373cfee8ac29/status.svg)](http://joss.theoj.org/papers/eb91faaf9219d46c9acd373cfee8ac29)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@ejolly & @hooman650, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @mbobra know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @ejolly

Conflict of interest

[x] As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Version: 1.0.0
[x] Authorship: Has the submitting author (@seanmylaw) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

Review checklist for @hooman650

Conflict of interest

[x] As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[ ] Version: 1.0.0
[x] Authorship: Has the submitting author (@seanmylaw) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Authors: Does the paper.md file include a list of authors with their affiliations?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

accepted published recommend-accept review

Source

whedon

Most helpful comment

@ejolly @hooman650 Thank you for agreeing to review this submission! Whedon generated a checklist and linked a reviewer guide above -- let me know if you have any questions.

mbobra on 13 Jun 2019

👍2

All 45 comments

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @ejolly, @hooman650 it looks like you're currently assigned to review this paper :tada:.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

whedon on 13 Jun 2019

Attempting PDF compilation. Reticulating splines etc...

whedon on 13 Jun 2019

:point_right: Check article proof :page_facing_up: :point_left:

whedon on 13 Jun 2019

@ejolly @hooman650 Thank you for agreeing to review this submission! Whedon generated a checklist and linked a reviewer guide above -- let me know if you have any questions.

mbobra on 13 Jun 2019

👍2

👋 @ejolly @hooman650 How is it going? Would you like more time to review? Do you have any questions? Please let me know!

mbobra on 2 Jul 2019

Sorry @mbobra, I’ve been traveling but I can have this done by the end of this week if that’s ok.

ejolly on 6 Jul 2019

👍1

Ok, I have done a preliminary review of STUMPY.

Summary of work from my perspective:

Stumpy simply computes the euclidean distance between a segment with length window to a sequence of data. While a simple operation, it requires a high computational time and space. Stumpy builds upon the ideas published in several papers that employ FFT and algebra to improve the computational time of such process. The space complexity is handled by simply storing the smallest value for each comparison. In general, the work done sounds interesting and can be handy to find patterns in large time-series.

Comments:

Tried to install stumpy with PyPI and everything went well on my python 3.6 which runs in windows 64bit OS.

The author has prepared a nice documentation page as well as contributing guidelines and examples. However, I just ran into an exception as I tried to run the following example from the documentation:

your_time_series = np.random.rand(10000)
window_size = 50  # Approximately, how many data points might be found in a pattern

matrix_profile = stumpy.stump(your_time_series, m=window_size)

left_matrix_profile_index = matrix_profile[2]
right_matrix_profile_index = matrix_profile[3]
idx = 10  # Subsequence index for which to retrieve the anchored time series chain for

anchored_chain = stumpy.atsc(left_matrix_profile_index, right_matrix_profile_index, idx)

all_chain_set, longest_unanchored_chain = stumpy.allc(left_matrix_profile_index, right_matrix_profile_index)

I copy the exception :

index 10 is out of bounds for axis 0 with size 4

Of course the arrays are of size 4 and the index is asking for 10. Please fix.

A comment in regards to the performance comparison graph that is mentioned here; How did the authors of GPU-STOMP implement their algorithm? In tensorflow? I feel like a good implementation is totally feasible in tensorflow, given that recently almost every hard computation is performing way better on GPUs compared to CPUs, I have difficulties believing a good GPU implementation would be beaten by CPUs. But, I might be wrong.
My final comment is in regards to the "window size" input of STUMPY. This could be compared to kernel size in CNNs. This will be always challenging to know prior to analysis. Any suggestions for determining that?
There is no version release number in the repository.

Overview:

In general, I feel like the author has done a good job and STUMPY can be a good contribution to the time-series analysis tool-chain. The documentation look good but more work should be done to make sure all the examples run smoothly and correctly.

hooman650 on 13 Jul 2019

@hooman650 Thank you for your thorough review.

Regarding 2:

The author has prepared a nice documentation page as well as contributing guidelines and examples. However, I just ran into an exception as I tried to run the following example from the documentation:

Indeed, this was a typo and we have submitted an issue and fixed it accordingly in both our README and ReadTheDocs. Thank you for pointing this out!

Regarding 3:

A comment in regards to the performance comparison graph that is mentioned here; How did the authors of GPU-STOMP implement their algorithm? In tensorflow? I feel like a good implementation is totally feasible in tensorflow, given that recently almost every hard computation is performing way better on GPUs compared to CPUs, I have difficulties believing a good GPU implementation would be beaten by CPUs. But, I might be wrong.

Unfortunately, I am not the original author of the GPU-STOMP publication/code and the numbers shown were simply extracted from their published paper for comparison. Currently, I only have access to CPUs but it is a top priority in our project roadmap to port this work over to GPUs. We are in the process of looking for assistance and resources from folks at NVIDIA. The initial goal of our scalable CPU implementation was to allow non-tech-savvy scientists to be able to get up and running quickly without needing access to any specialized hardware. We believe that we have achieved this goal.

Regarding 4:

My final comment is in regards to the "window size" input of STUMPY. This could be compared to kernel size in CNNs. This will be always challenging to know prior to analysis. Any suggestions for determining that?

This is an excellent question and has been discussed by the original authors (not me) in _Section D_ of their paper matrix profile II. In summary, the window size is certainly a _user input_ that requires some level of domain expertise. However, the original authors have demonstrated that the matrix profile is robust to varying window sizes and that being "in the ballpark" is often enough to find motifs.

Regarding 5:

There is no version release number in the repository.

We currently provide a version number in the standard setup.py file and the version number is also accessible within Python via stumpy.__version__. Perhaps there is a better or more standard place to specify the version release number? Any guidance would be greatly appreciated.

seanlaw on 15 Jul 2019

👋 @seanlaw

Github is showing that there aren't any releases in the repository. Could you please go ahead and create a release?

Could you also include your responses to points 3 and 4 above in the text of the paper along with the appropriate references?

mbobra on 16 Jul 2019

@mbobra Thank you for pointing me to the helpful resources. I've gone back and tagged the commit that coincides with the first upload to PyPI (May 3rd) as v1.0.0. Let me know if that is sufficient.

Regarding points 3 and 4, both references (Matrix Profile I and Matrix Profile II) were already included in the original article proof (note that the references that I mention above are just the preprints that are available directly on the original author's group website and the references in the original article proof are referencing the published IEEE manuscripts).

seanlaw on 16 Jul 2019

Hi @mbobra sorry for the delay on my review!

First of all, I'd like to note that Stumpy is a great addition to the time-series analyst's toolkit and is very well-documented, explained, and referenced. I also rather enjoyed the talk. Really nice @seanlaw!

My testing was done using Python 3.6 on Mac OS 10.14.2 and everything installed without issues. I updated my installation given the most recent changes made as per @hooman650's review and can attest that fix for comment 2 now works.

I was unable to test out the performance claims given limited current access to a distributed compute system at this time, but I was able to at least test the functionality by running a local dask server without issues.

Just a few minor suggestions:

Since the tutorial examples are provided as jupyter notebooks and the prerendered file for Tutorial_1 includes inline plots, it would good to add the %matplotlib inline command to the top code cell of that notebook so that anyone just downloading and running the notebooks immediately reproduces the prerendered files on github
In its present form, Tutorial_2 is a bit lacking in context and explanation. While it's nice to see how to use some additional functionality and be provided links to the original paper, I might suggest at least a bit more of an explanation as to what and why one might use time-series chains. Tutorial_1 for example, does a great job in providing a high-level overview of the matrix profile and elucidating the impact of the window size free parameter
More of a minor suggestion, but it might be nice to have the _text_ link to the documentation a bit higher up on the github README page, for example in the website field of the repo description or edit the text in the first paragraph to clearly mention the documentation site. While badge and "matrix normal" links work, the actualy documentation site itself isn't discussed until halfway through the README. Feel free to ignore this if you prefer.

With those minor changes I think this would make a great addition to JOSS.

ejolly on 17 Jul 2019

@ejolly Thank you for your constructive and useful feedback. Please see my responses below:

Regarding 1:

Since the tutorial examples are provided as jupyter notebooks and the prerendered file for Tutorial_1 includes inline plots, it would good to add the %matplotlib inline command to the top code cell of that notebook so that anyone just downloading and running the notebooks immediately reproduces the prerendered files on github

This is an excellent suggestion and we've filed/fixed/closed the this issue as per your recommendation. For completeness, we have also provided interactive Binder notebooks in addition to the pre-rendered notebooks so that the user can "try before installing".

Regarding 2:

In its present form, Tutorial_2 is a bit lacking in context and explanation. While it's nice to see how to use some additional functionality and be provided links to the original paper, I might suggest at least a bit more of an explanation as to what and why one might use time-series chains. Tutorial_1 for example, does a great job in providing a high-level overview of the matrix profile and elucidating the impact of the window size free parameter

We completely agree and it is one of our older issues dating back to May 18th that we are hoping to get some help on identifying a good example dataset for and writing up a more complete tutorial like Tutorial 1. Currently, the tutorial only demonstrates the time series chains API and we'd really like to provide some more intuitive insight with a better data set than the current Taxi data set.

In all fairness, the goal of the STUMPY software is to faithfully implement the algorithms based on the published papers (not written by us) and so we strongly recommend that the user read the papers (clearly referenced) as the papers can provide far more detail and insight than STUMPY can. One needs to keep in mind that without STUMPY, there is really no scalable, performant, and easy to install implementation for computing the matrix profile and so our current focus is to provide a suite of tools based on the published papers and to save the user the time and headache from having to implement the published papers (which are not without errors and missing important implementation details). Eventually, once we've created a community/user base and developed most of the published features then we will certainly spend more time improving the tutorials. It's probably important to point out that STUMPY was created and currently maintained by a single person (me) and, for better or for worse, it is mostly done on my personal time and, without additional assistance, one person can only do so much.

While I completely and wholeheartedly agree that the tutorials could be better (and they will be once the feature set stabilizes), I would respectfully argue that the JOSS requirements make no mention of tutorials and so they are a "nice to have" but should not be used as a criteria to judge the completeness of the software. From an API documentation, unit testing/code coverage, installation instructions, and example usage standpoint, we humbly believe that this open source software meets the JOSS requirements.

Regarding 3:

More of a minor suggestion, but it might be nice to have the text link to the documentation a bit higher up on the github README page, for example in the website field of the repo description or edit the text in the first paragraph to clearly mention the documentation site. While badge and "matrix normal" links work, the actualy documentation site itself isn't discussed until halfway through the README. Feel free to ignore this if you prefer.

This is good feedback. We've filed/fixed/closed the following new issue and added a clearer link in the opening paragraph of the README.

seanlaw on 17 Jul 2019

@seanlaw the binder addition is a great one and I think will be great for new users.

Regarding my point 2:

I apologize as I should have been more clear. I completely agree with your response that publication in JOSS should be _not_ contingent on you adding a more comprehensive tutorial 2 and my comment was more of a suggestion for something that would improve the tutorials, i.e. would be "nice to have". From my review, you have already done a fantastic job of documenting, testing, and providing the requisite high-level explanation of the package functionality as per JOSS requirements.

I completely understand that creating and maintaining a solo project is a huge demand on your time and it will be great to see how tutorials and functionality grow with the community base!

ejolly on 17 Jul 2019

👍1

@ejolly No need to apologize as I assumed no ill intent. Thank you (as well as to @hooman650 and @mbobra) for taking the time to review! I really appreciate the valuable feedback.

seanlaw on 17 Jul 2019

@hooman650 and @ejolly Thank you so much for reviewing! We really appreciate your time and effort ☀️

@seanlaw We're almost there! Can you please archive your release on Zenodo to obtain a DOI and then put that in your README.rst file? After that I think we're done 🎉

mbobra on 17 Jul 2019

@mbobra I've added the DOI as a badge a the top of the README.rst. Is that what you mean?

seanlaw on 18 Jul 2019

👍1

@whedon set 10.5281/zenodo.3340125 as archive

mbobra on 18 Jul 2019

OK. 10.5281/zenodo.3340125 is the archive.

whedon on 18 Jul 2019

@whedon set 1.0.0 as version

mbobra on 18 Jul 2019

OK. 1.0.0 is the version.

whedon on 18 Jul 2019

@whedon check references

mbobra on 18 Jul 2019

Attempting to check references...

whedon on 18 Jul 2019

```Reference check summary:

OK DOIs

10.1109/ICDM.2016.0179 is OK
10.1109/ICDM.2016.0085 is OK
10.1109/ICDM.2017.66 is OK
10.1109/ICDM.2017.79 is OK

MISSING DOIs

None

INVALID DOIs

None
```

whedon on 18 Jul 2019

@whedon generate pdf

mbobra on 18 Jul 2019

Attempting PDF compilation. Reticulating splines etc...

whedon on 18 Jul 2019

:point_right: Check article proof :page_facing_up: :point_left:

whedon on 18 Jul 2019

@openjournals/joss-eics This paper is ready for acceptance! Nice work @seanlaw 🎉

mbobra on 18 Jul 2019

Thanks @mbobra, @hooman650, and @ejolly! This was a wonderful and pleasant submission experience. Hopefully, I will run into you at a conference one day!

seanlaw on 18 Jul 2019

@whedon accept

danielskatz on 18 Jul 2019

🎉1

Attempting dry run of processing paper acceptance...

whedon on 18 Jul 2019

Check final proof :point_right: https://github.com/openjournals/joss-papers/pull/842

If the paper PDF and Crossref deposit XML look good in https://github.com/openjournals/joss-papers/pull/842, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.
@whedon accept deposit=true

whedon on 18 Jul 2019

@danielskatz The final proof looks good. Is there anything else that I need to do or is whedon’s command for you to handle? I just want to make sure I am not holding up the process.

_{Sent with GitHawk}

seanlaw on 18 Jul 2019

It's fine - between being on a plane where I couldn't see the final PDF to check it, and then driving and sleeping, I've just gotten back to it :)

danielskatz on 18 Jul 2019

👍1

Thanks to @ejolly & @hooman650 for reviewing and to @mbobra for editing

danielskatz on 18 Jul 2019

👍1

@whedon accept deposit=true

danielskatz on 18 Jul 2019

Doing it live! Attempting automated processing of paper acceptance...

whedon on 18 Jul 2019

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦

whedon on 18 Jul 2019

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

Check final PDF and Crossref metadata that was deposited :point_right: https://github.com/openjournals/joss-papers/pull/843
Wait a couple of minutes to verify that the paper DOI resolves https://doi.org/10.21105/joss.01504
If everything looks good, then close this review issue.
Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? notify your editorial technical team...

whedon on 18 Jul 2019

🎉1

:tada::tada::tada: Congratulations on your paper acceptance! :tada::tada::tada:

If you would like to include a link to your paper from your README use the following code snippets:

Markdown:
[![DOI](http://joss.theoj.org/papers/10.21105/joss.01504/status.svg)](https://doi.org/10.21105/joss.01504)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.01504">
  <img src="http://joss.theoj.org/papers/10.21105/joss.01504/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: http://joss.theoj.org/papers/10.21105/joss.01504/status.svg
   :target: https://doi.org/10.21105/joss.01504

This is how it will look in your documentation:

We need your help!

Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:

Volunteering to review for us sometime in the future. You can add your name to the reviewer list here: http://joss.theoj.org/reviewer-signup.html
Making a small donation to support our running costs here: https://numfocus.salsalabs.org/donate-to-joss

whedon on 18 Jul 2019

🎉1

@danielskatz I just spotted a minor typo in the PDF. Is there some way that I can fix it?

seanlaw on 16 Aug 2019

@whedon generate pdf

seanlaw on 16 Aug 2019

Attempting PDF compilation. Reticulating splines etc...

whedon on 16 Aug 2019

@arfon I have fixed the typo in the original source repository. Can you please take a look?

seanlaw on 24 Aug 2019

@arfon I have fixed the typo in the original source repository. Can you please take a look?

Done. It could take a few hours to show up as fixed on the JOSS site as there's caching in place for the PDFs.

arfon on 24 Aug 2019

❤1

Thanks, @arfon! It looks good now

seanlaw on 24 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[PRE REVIEW]: MSMExplorer: Data Visualizations for Biomolecular Dynamics

whedon · 12Comments

[REVIEW]: cartography: Create and Integrate Maps in your R Workflow

whedon · 12Comments

[REVIEW]: The Pulsar Signal Simulator: A Python package for simulating radio signal data from pulsars

whedon · 9Comments

[PRE REVIEW]: Kindel: indel-aware consensus for nucleotide sequence alignments

whedon · 12Comments

[REVIEW]: The Experiment Factory: Reproducible Experiment Containers

whedon · 12Comments