Tsfresh: STUMPY, Matrix Profiles, and Motif Discovery

Created on 6 Apr 2020 · 21Comments · Source: blue-yonder/tsfresh

Hello, tsfresh devs/users! First off, I wanted to say thank you for this wonderful and thoughtfully created package. I am a big fan of the work that y'all are doing here!

I had noticed an older issue (and PR) between @Ezekiel-Kruglick, @GillesVandewiele, @MaxBenChrist, and @nils-braun regarding the earlier work from Eamonn Keogh's group on motif discovery (and shapelets too). If I understand correctly, this discussion happened right before Keogh published his wonderful papers on matrix profiles during the fall of 2017. I was wondering if it is of any interest to the group to re-explore the idea of motif extraction in light of these papers. The STUMPY Python package is focused on providing a fast and user friendly interface for computing the matrix profile and, more importantly, faithfully reproduces Keogh's work. It is Python 3 only and has support for parallel CPU computation via Numba, distributed computations via Dask, multi-GPU support, and maintains 100% code coverage. Depending on the data size, it may fit well with some of the tsfresh use cases.

Full disclosure, I am the creator of STUMPY so let me know if you see an opportunity to collaborate here!

new feature calculator

Source

seanlaw

Most helpful comment

Hi @seanlaw,

This is indeed a great remark. Even better, a few days ago, Keogh and others published a paper with all possible features that can easily be extracted through the matrix profile ("The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code").

One of those includes shapelet discovery, which is really easy to implement and fast:

GillesVandewiele on 7 Apr 2020

❤2

All 21 comments

Hi @seanlaw,

One of those includes shapelet discovery, which is really easy to implement and fast:

GillesVandewiele on 7 Apr 2020

❤2

A pre-print of that paper can be found here and STUMPY currently has an open issue to add tutorials to demonstrate these 10 capabilities (PR welcome!).

As stated in their paper, the most time consuming part isn't shapelet discovery but the time that it takes to compute the matrix profile for large times series. Let me take a look at the paper and see if I can implement the shapelet component quickly (since motif discovery is already available).

seanlaw on 7 Apr 2020

I already did it (hence the picture). I didn't use stumpy though, and of the major disadvantages is that it extracts very similar shapelets that do not complement each other well...

I'll share code here later today.

GillesVandewiele on 8 Apr 2020

👍1

Ahh, I was wondering why your red circles were filled and were different from the paper. 😄

What data did you use (it doesn't look like the GunPoint data from their paper)? How did you compute the matrix profile?

seanlaw on 8 Apr 2020

👍1

My apologies, those were indeed not from the GunPoint dataset. Below are those for the GunPoint dataset:

I used this library since it was developed in my own research group, but I have used stumpy before that and it worked well.

Here's the code:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
np.random.seed(1337)  # Random seed for reproducibility
from tslearn.datasets import UCR_UEA_datasets
from distancematrix.generator.filter_generator import FilterGenerator
from distancematrix.generator.znorm_euclidean import ZNormEuclidean
from distancematrix.generator.euclidean import Euclidean
from distancematrix.consumer.matrix_profile_lr import MatrixProfileLR
from distancematrix.calculator import AnytimeCalculator

# Load ItalyPowerDemand dataset
X_train, y_train, X_test, y_test = UCR_UEA_datasets().load_dataset('GunPoint')
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1]))

def mp_one_vs_all(X_train, y_train, m, K):
    X = []
    lens = []
    for label in set(y_train):
        lens.append(sum(y_train == label))
        all_ts = []
        for ts in X_train[y_train == label]:
            all_ts.extend(ts)
            all_ts.append(np.NaN)
        X.append(all_ts)

    top_shaps = []
    for i in range(len(X)):
        # Extract K shapelets C - 1 times (with C = # unique classes)
        other_ts = []
        for j in range(len(X)):
            if i != j:
                other_ts.extend(X[j])

        calc = AnytimeCalculator(m, X[i])  # One series passed => self-join
        gen_0 = calc.add_generator(0, FilterGenerator(ZNormEuclidean()))
        cons_0 = calc.add_consumer([0], MatrixProfileLR())  # Consumer 0 works on generator 0
        calc.calculate_columns(print_progress=False)
        mp_own = list(cons_0.matrix_profile())

        calc = AnytimeCalculator(m, X[i], other_ts)  # One series passed => self-join
        gen_0 = calc.add_generator(0, FilterGenerator(ZNormEuclidean()))
        cons_0 = calc.add_consumer([0], MatrixProfileLR())  # Consumer 0 works on generator 0
        calc.calculate_columns(print_progress=False)
        mp_other = list(cons_0.matrix_profile())

        diffs = np.array(mp_other) - np.array(mp_own)
        diffs[~np.isfinite(diffs)] = float('-inf')

        peaks, properties = find_peaks(diffs, height=0, distance=m//2)
        top_ix = peaks[np.argsort(properties['peak_heights'])[-K:]]

        plt.figure(figsize=(15, 3))
        plt.plot(diffs)
        plt.scatter(top_ix, diffs[top_ix], c='r', s=50)
        plt.title(i)
        plt.show()

        for k in range(len(top_ix)):
            shap = np.array(X[i][top_ix[k]:top_ix[k] + m])
            shap = shap[~np.isnan(shap)]
            top_shaps.append(shap)

    return top_shaps

top_shaps = mp_one_vs_all(X_train, y_train, 38, 10)

GillesVandewiele on 8 Apr 2020

Small note is that this dataset feels somewhat cherry-picked. Performance on other datasets is a lot worse than that achieved with techniques such as Learning Shapelets or Shapelet Transform. This is due to the disadvantage I mentioned: the extracted shapelets are all very similar to each other. Any ideas on how we could go about to fix this?

GillesVandewiele on 8 Apr 2020

I used this library since it was developed in my own research group, but I have used stumpy before that and it worked well.

Unfortunately, I have no prior experience with Shapelets so I'm probably of little help but I've engaged with Keogh's team in the past and they might respond. I'm guessing that Shapelets derived from a matrix profile are "free" but not necessarily guaranteed to be the best or, at least, they may point you in the "right direction". 🤷‍♂So, @GillesVandewiele, I am not entirely surprised as Shapelet discovery seems to be a hard problem. Thank you for sharing the work from your research group and for following up with your code! I appreciate it.

My original post may have been misleading since it referred to a past Shapelet issue (sorry about that) but the primary focus of STUMPY is to efficiently compute the matrix profile (especially for large datasets) and I was curious if the output (distances, location of nearest match, index distance to nearest match, left and right nearest neighbor distance, etc) could/should be used as time series features or, afterward, derive other new features from the matrix profile. Obviously, these are not mutually exclusive but STUMPY is well maintained with 100% test coverage and may be of interest to tsfresh.

seanlaw on 8 Apr 2020

Small note is that this dataset feels somewhat cherry-picked. Performance on other datasets is a lot worse than that achieved with techniques such as Learning Shapelets or Shapelet Transform. This is due to the disadvantage I mentioned: the extracted shapelets are all very similar to each other. Any ideas on how we could go about to fix this?

Okay, so I quickly skimmed the original Shapelets paper and also the Shapelet section of the Top Ten Paper. I apologize in advance if I missed anything or am misunderstanding the core concepts due to my haste.

So, I don't think that there should be any expectation that the extracted shapelets should be different from each other. Say you have two classes, A and B, if class A is fairly homogeneous then you would also expect the discovered shapelets to be highly similar. In contrast, assuming that class A and class B are very different then the discovered shapelets from class A can be expected to be different from class B. So, I actually don't think this is a "disadvantage" as it is built into the assumptions. That is, the extracted shapelets will be similar to each other if class A have samples that are also very similar.

However, I think that in order to find more "unique" shapelets, one might need to try out different window sizes (so as to capture different shapes/sizes of shapelets) OR you may need to do some pre-work to ensure that the samples from class A are as dissimilar as possible to start with (but, again, is it really a useful "class" if the samples are that dissimilar to begin with?).

seanlaw on 8 Apr 2020

Well it is indeed to be expected when using the naive vanilla implementations. But the ultimate goal would be to have a set of K shapelets such that if that set is used to transform the timeseries into distances, then the performance of a classifier on those distances is optimal. Of course, having a lot of similar shapelets in that set of K shapelets will not help much.

On the GunPoint dataset, they are able to achieve an accuracy of 0.9133-ish with one shapelet (i.e. the dataset becomes 1-dimensional after transformation), but often a lot more shapelets are required to get good performance.

Therefore, we could perhaps look for some smarter heuristics/tricks to extract better shapelets (when K > 1).

To better illustrate my point, assume we have the following toy dataset and we want to extract 2 shapelets:
Screenshot from 2020-04-08 17-15-42

Now we can either extract the two shapelets with the highest information gain (but those two will be very similar to each other), or we could extract one shapelet with the highest information gain and then look for one that complements that shapelet well. As can be seen below, a good performance can only be achieved by using the latter approach.
Screenshot from 2020-04-08 17-15-55

GillesVandewiele on 8 Apr 2020

Concerning the other applications of MP: I totally agree that these would be a very interesting additions to tsfresh as most of those features are very agnostic to the task (which is perfect for tsfresh)

GillesVandewiele on 8 Apr 2020

👍1

@GillesVandewiele I understand. It is very similar in spirit to, say, principal component analysis where each component is orthogonal to one another (and therefore can be used to optimally represent any datapoint) or choosing uncorrelated features as input for linear regression. Features that are highly correlated may be split the model weights and reduce their variable importance.

Given a set of potential shapelets, it sounds like what you want is to 1) group the ones that are similar and 2) choose a "snippet" (sometimes people refer to this as an "exemplar" in the clustering world) that best represents the group. Keogh has a paper on "snippets" that may be useful for consideration

seanlaw on 8 Apr 2020

👍1

Concerning the other applications of MP: I totally agree that these would be a very interesting additions to tsfresh as most of those features are very agnostic to the task (which is perfect for tsfresh)

Awesome! Are you aware of any feature extraction functions within tsfresh that also take a long time (hours-days) for exceptionally long time series (say, 100 M points)?

Do you know if the matrix profile implementation from your research group can handle datasets as large as 100 M+ points?

seanlaw on 8 Apr 2020

Let me take a look at the paper and see if I can implement the shapelet component quickly (since motif discovery is already available).

I found some time to reproduce the GunPoint example with STUMPY. For completeness, a Jupyter notebook can be found here

seanlaw on 9 Apr 2020

Concerning the other applications of MP: I totally agree that these would be a very interesting additions to tsfresh as most of those features are very agnostic to the task (which is perfect for tsfresh)

Awesome! Are you aware of any feature extraction functions within tsfresh that also take a long time (hours-days) for exceptionally long time series (say, 100 M points)?

Do you know if the matrix profile implementation from your research group can handle datasets as large as 100 M+ points?

There's definitely a few slow ones in the ComprehensiveParameters, but I can't recall them atm to be honest.

I'm also not that familiar with the MP library I'm using since it is written by a colleague, but it seems to scale really well. It has no GPU support though...

Nice notebook btw!

GillesVandewiele on 9 Apr 2020

👍1

@nils-braun Based on #785, it sounds like you are moving in a different direction. Please feel free to close this issue

seanlaw on 8 Jan 2021

Thanks for the message @seanlaw - the other issue should not be a stop sign for this one. I would be more than happy to see multiple features based on matrix profiles in tsfresh. If you would like, we can also add STUMPY as a "matrix profiles feature provider" or maybe even let the user choose (or maybe - if there are certain known differences, it might even make sense to have multiple implementations in parallel, see below).
Sorry for never having answered your issue! I think @kempa-liehr and @MaxBenChrist were busy at that time and then it was probably forgotten...

I am, honestly, not an expert in matrix profile computation. I see that both packages (Matrixprofile and stumpy) have put some effort into speed optimization. If the final result of the two libraries is similar (which I guess it is), I think it might make sense to have a configuration option to choose (because I guess there might be some cases where one of the two libraries is faster). What do you think?

nils-braun on 10 Jan 2021

the other issue should not be a stop sign for this one. I would be more than happy to see multiple features based on matrix profiles in tsfresh. If you would like, we can also add STUMPY as a "matrix profiles feature provider" or maybe even let the user choose (or maybe - if there are certain known differences, it might even make sense to have multiple implementations in parallel, see below).

Excellent! I'm not too familiar with Matrixprofile but I can tell you that STUMPY works very hard to faithfully reproduce the original published work by Keogh et al and so we try not to make any assumptions or take any liberties that we think would allow the user to "shoot themselves in the foot". I am probably biased but it certainly makes sense to provide users with at least a basic STUMPY implementation.

Sorry for never having answered your issue! I think @kempa-liehr and @MaxBenChrist were busy at that time and then it was probably forgotten...

Please, no apologies needed! As an open source maintainer myself, I completely understand. I am here to help and contribute!

I am, honestly, not an expert in matrix profile computation. I see that both packages (Matrixprofile and stumpy) have put some effort into speed optimization. If the final result of the two libraries is similar (which I guess it is), I think it might make sense to have a configuration option to choose (because I guess there might be some cases where one of the two libraries is faster).

I can only speak to STUMPY but, aside from the basic matrix profile output, here are some obvious benefits just to name a few:

We pride ourselves in always having 100% code coverage. Not only are we testing specific inputs/outputs but we actually implement slow/naive versions of the fast published matrix profile algorithms to compare against. Users can be confident that nothing gets released without full test coverage.
Our code base only depends on NumPy, SciPy, and Numba and nothing else (i.e., no Cython, no raw CUDA). We offer a distributed version of our code that has a (soft) dependency for Dask Distributed and was tested with over 32 servers (256 CPUs) so this should play nicely with tsfresh since it also uses Dask Distributed (a future consideration). Also, thanks toNumba, STUMPY comes with automatic multi-GPU support without any additional dependencies. We easily computed a large matrix profile with a 16-GPU DGX-2 local cluster. So, whether it's multi-CPU/multi-GPU, we offer a highly scalable solution for computing matrix profiles. Of course, we might appear "slower" in some performance tests but that's usually due to the cost of having to JIT-compile some of our functions (which are fast thereafter). The benefits of STUMPY's scalability really shines through when you have time series that are longer than 10K+ data points in length (note that our performance plots need to be updated as STUMPY is even faster now).
We also offer a scalable version for computing multidimensional matrix profiles (note that this is different from computing many 1-D matrix profiles and stacking them on top of each other!).
STUMPY is already available on conda-forge (and PyPI) so we can easily add it as a dependency

What do you think?

@nils-braun @kempa-liehr @MaxBenChrist I'll work on putting together a similar "combiner" PR following your draft and maybe we can work together? I'll start with something simple so as to limit the initial scope and then, over time, we can increase the functionality with additional PRs. How does that sound?

seanlaw on 11 Jan 2021

I'll provide my 2 cents here:

Code coverage - As we all know, being software developers to some extent, code coverage does not mean much beyond that your tests touch every line of code generated. It DOES NOT guarantee reliability. The matrixprofile library provides plotting within it making it pretty difficult to test all code base and some I/O functionality using protobuf. Protobuf auto-generates a decent amount of code that does not get tested.
STUMPY chose to use open-source code optimization libraries, such as NUMBA at the expense of consistent results. NUMBA uses "fastmath" under the covers that makes some assumptions about the data being used. I experimented with this feature in compiling Cython code.
A Cython dependency is not much of an issue for us as we distribute binary wheels just like MANY popular libraries (numpy, pandas, scipy, etc.)
STUMPY does provide value in that it offers horizontal scalability as an option. However, I would argue that a lot of use cases start out small and do not require GPU and multi-cpu usage.

All in all, we both provide useful features to the community at large. Do you not agree here @seanlaw ? Many of your implementations are VERY similar to ours.

tylerwmarrs on 11 Jan 2021

I don't understand the quibble with the use of Cython or line coverage. I haven't looked through all of it, but the biggest point I would generally test is divergence of carried co-moments, since lookahead methods tend to only be quasi-stable. Such tests aren't really determined by code coverage at all.

On the topic of Cython, Cython generally makes things easier to distribute, due to its origins. It was used by the Sage developers. Numba has conventionally been the more difficult one, which makes this claim sound a bit weird to me. Since it's possible to distribute cythonized source files, it's arguably a trivially removable dependency anyway.

Numba does sometimes produce faster binaries in cases with significant opportunities for inner loop vectorization and instruction level parallelism, again through the innermost loop of a nest. Most of these methods don't exhibit that, and Numba does not admit unroll and jam type optimizations or in place outer loop vectorization (nor does LLVM as it stands on master).

kavj on 11 Jan 2021

@kavj I am glad that you caught wind of this since you are a fundamental researcher within Eamonn Keogh's group. To make things a little easier to understand, are you stating the following about Numba and cython?

Cython is usually faster than Numba.
Numba is trying to reach performance of optimized Cython.

@kavj Have you used fastmath compilation option? What experiences do you exhibit with it? I find that it throws away accuracy for performance gains. When I initially created the Cython code, I tested fastmath as a compilation option and would always get unexpected results.

tylerwmarrs on 11 Jan 2021

Both packages have their merits and this was not intended to be a debate. We can agree to disagree. I was merely seeking clarification and wanted to close out an existing issue that I had created in April 2020. I may revisit this in the future as my time and energy is probably better spent elsewhere. Best of luck.

seanlaw on 11 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings