Hello, tsfresh devs/users! First off, I wanted to say thank you for this wonderful and thoughtfully created package. I am a big fan of the work that y'all are doing here!
I had noticed an older issue (and PR) between @Ezekiel-Kruglick, @GillesVandewiele, @MaxBenChrist, and @nils-braun regarding the earlier work from Eamonn Keogh's group on motif discovery (and shapelets too). If I understand correctly, this discussion happened right before Keogh published his wonderful papers on matrix profiles during the fall of 2017. I was wondering if it is of any interest to the group to re-explore the idea of motif extraction in light of these papers. The STUMPY Python package is focused on providing a fast and user friendly interface for computing the matrix profile and, more importantly, faithfully reproduces Keogh's work. It is Python 3 only and has support for parallel CPU computation via Numba, distributed computations via Dask, multi-GPU support, and maintains 100% code coverage. Depending on the data size, it may fit well with some of the tsfresh use cases.
Full disclosure, I am the creator of STUMPY so let me know if you see an opportunity to collaborate here!
Hi @seanlaw,
This is indeed a great remark. Even better, a few days ago, Keogh and others published a paper with all possible features that can easily be extracted through the matrix profile ("The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code").
One of those includes shapelet discovery, which is really easy to implement and fast:
A pre-print of that paper can be found here and STUMPY currently has an open issue to add tutorials to demonstrate these 10 capabilities (PR welcome!).
As stated in their paper, the most time consuming part isn't shapelet discovery but the time that it takes to compute the matrix profile for large times series. Let me take a look at the paper and see if I can implement the shapelet component quickly (since motif discovery is already available).
I already did it (hence the picture). I didn't use stumpy though, and of the major disadvantages is that it extracts very similar shapelets that do not complement each other well...
I'll share code here later today.
Ahh, I was wondering why your red circles were filled and were different from the paper. 馃槃
What data did you use (it doesn't look like the GunPoint data from their paper)? How did you compute the matrix profile?
My apologies, those were indeed not from the GunPoint dataset. Below are those for the GunPoint dataset:
I used this library since it was developed in my own research group, but I have used stumpy before that and it worked well.
Here's the code:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
np.random.seed(1337) # Random seed for reproducibility
from tslearn.datasets import UCR_UEA_datasets
from distancematrix.generator.filter_generator import FilterGenerator
from distancematrix.generator.znorm_euclidean import ZNormEuclidean
from distancematrix.generator.euclidean import Euclidean
from distancematrix.consumer.matrix_profile_lr import MatrixProfileLR
from distancematrix.calculator import AnytimeCalculator
# Load ItalyPowerDemand dataset
X_train, y_train, X_test, y_test = UCR_UEA_datasets().load_dataset('GunPoint')
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1]))
def mp_one_vs_all(X_train, y_train, m, K):
X = []
lens = []
for label in set(y_train):
lens.append(sum(y_train == label))
all_ts = []
for ts in X_train[y_train == label]:
all_ts.extend(ts)
all_ts.append(np.NaN)
X.append(all_ts)
top_shaps = []
for i in range(len(X)):
# Extract K shapelets C - 1 times (with C = # unique classes)
other_ts = []
for j in range(len(X)):
if i != j:
other_ts.extend(X[j])
calc = AnytimeCalculator(m, X[i]) # One series passed => self-join
gen_0 = calc.add_generator(0, FilterGenerator(ZNormEuclidean()))
cons_0 = calc.add_consumer([0], MatrixProfileLR()) # Consumer 0 works on generator 0
calc.calculate_columns(print_progress=False)
mp_own = list(cons_0.matrix_profile())
calc = AnytimeCalculator(m, X[i], other_ts) # One series passed => self-join
gen_0 = calc.add_generator(0, FilterGenerator(ZNormEuclidean()))
cons_0 = calc.add_consumer([0], MatrixProfileLR()) # Consumer 0 works on generator 0
calc.calculate_columns(print_progress=False)
mp_other = list(cons_0.matrix_profile())
diffs = np.array(mp_other) - np.array(mp_own)
diffs[~np.isfinite(diffs)] = float('-inf')
peaks, properties = find_peaks(diffs, height=0, distance=m//2)
top_ix = peaks[np.argsort(properties['peak_heights'])[-K:]]
plt.figure(figsize=(15, 3))
plt.plot(diffs)
plt.scatter(top_ix, diffs[top_ix], c='r', s=50)
plt.title(i)
plt.show()
for k in range(len(top_ix)):
shap = np.array(X[i][top_ix[k]:top_ix[k] + m])
shap = shap[~np.isnan(shap)]
top_shaps.append(shap)
return top_shaps
top_shaps = mp_one_vs_all(X_train, y_train, 38, 10)
Small note is that this dataset feels somewhat cherry-picked. Performance on other datasets is a lot worse than that achieved with techniques such as Learning Shapelets or Shapelet Transform. This is due to the disadvantage I mentioned: the extracted shapelets are all very similar to each other. Any ideas on how we could go about to fix this?
I used this library since it was developed in my own research group, but I have used stumpy before that and it worked well.
Unfortunately, I have no prior experience with Shapelets so I'm probably of little help but I've engaged with Keogh's team in the past and they might respond. I'm guessing that Shapelets derived from a matrix profile are "free" but not necessarily guaranteed to be the best or, at least, they may point you in the "right direction". 馃し鈥嶁檪So, @GillesVandewiele, I am not entirely surprised as Shapelet discovery seems to be a hard problem. Thank you for sharing the work from your research group and for following up with your code! I appreciate it.
My original post may have been misleading since it referred to a past Shapelet issue (sorry about that) but the primary focus of STUMPY is to efficiently compute the matrix profile (especially for large datasets) and I was curious if the output (distances, location of nearest match, index distance to nearest match, left and right nearest neighbor distance, etc) could/should be used as time series features or, afterward, derive other new features from the matrix profile. Obviously, these are not mutually exclusive but STUMPY is well maintained with 100% test coverage and may be of interest to tsfresh.
Small note is that this dataset feels somewhat cherry-picked. Performance on other datasets is a lot worse than that achieved with techniques such as Learning Shapelets or Shapelet Transform. This is due to the disadvantage I mentioned: the extracted shapelets are all very similar to each other. Any ideas on how we could go about to fix this?
Okay, so I quickly skimmed the original Shapelets paper and also the Shapelet section of the Top Ten Paper. I apologize in advance if I missed anything or am misunderstanding the core concepts due to my haste.
So, I don't think that there should be any expectation that the extracted shapelets should be different from each other. Say you have two classes, A
and B
, if class A
is fairly homogeneous then you would also expect the discovered shapelets to be highly similar. In contrast, assuming that class A
and class B
are very different then the discovered shapelets from class A
can be expected to be different from class B
. So, I actually don't think this is a "disadvantage" as it is built into the assumptions. That is, the extracted shapelets will be similar to each other if class A
have samples that are also very similar.
However, I think that in order to find more "unique" shapelets, one might need to try out different window sizes (so as to capture different shapes/sizes of shapelets) OR you may need to do some pre-work to ensure that the samples from class A
are as dissimilar as possible to start with (but, again, is it really a useful "class" if the samples are that dissimilar to begin with?).
Well it is indeed to be expected when using the naive vanilla implementations. But the ultimate goal would be to have a set of K shapelets such that if that set is used to transform the timeseries into distances, then the performance of a classifier on those distances is optimal. Of course, having a lot of similar shapelets in that set of K shapelets will not help much.
On the GunPoint dataset, they are able to achieve an accuracy of 0.9133-ish with one shapelet (i.e. the dataset becomes 1-dimensional after transformation), but often a lot more shapelets are required to get good performance.
Therefore, we could perhaps look for some smarter heuristics/tricks to extract better shapelets (when K > 1).
To better illustrate my point, assume we have the following toy dataset and we want to extract 2 shapelets:
Now we can either extract the two shapelets with the highest information gain (but those two will be very similar to each other), or we could extract one shapelet with the highest information gain and then look for one that complements that shapelet well. As can be seen below, a good performance can only be achieved by using the latter approach.
Concerning the other applications of MP: I totally agree that these would be a very interesting additions to tsfresh as most of those features are very agnostic to the task (which is perfect for tsfresh)
@GillesVandewiele I understand. It is very similar in spirit to, say, principal component analysis where each component is orthogonal to one another (and therefore can be used to optimally represent any datapoint) or choosing uncorrelated features as input for linear regression. Features that are highly correlated may be split the model weights and reduce their variable importance.
Given a set of potential shapelets, it sounds like what you want is to 1) group the ones that are similar and 2) choose a "snippet" (sometimes people refer to this as an "exemplar" in the clustering world) that best represents the group. Keogh has a paper on "snippets" that may be useful for consideration
Concerning the other applications of MP: I totally agree that these would be a very interesting additions to tsfresh as most of those features are very agnostic to the task (which is perfect for tsfresh)
Awesome! Are you aware of any feature extraction functions within tsfresh that also take a long time (hours-days) for exceptionally long time series (say, 100 M points)?
Do you know if the matrix profile implementation from your research group can handle datasets as large as 100 M+ points?
Let me take a look at the paper and see if I can implement the shapelet component quickly (since motif discovery is already available).
I found some time to reproduce the GunPoint example with STUMPY. For completeness, a Jupyter notebook can be found here
Concerning the other applications of MP: I totally agree that these would be a very interesting additions to tsfresh as most of those features are very agnostic to the task (which is perfect for tsfresh)
Awesome! Are you aware of any feature extraction functions within tsfresh that also take a long time (hours-days) for exceptionally long time series (say, 100 M points)?
Do you know if the matrix profile implementation from your research group can handle datasets as large as 100 M+ points?
There's definitely a few slow ones in the ComprehensiveParameters
, but I can't recall them atm to be honest.
I'm also not that familiar with the MP library I'm using since it is written by a colleague, but it seems to scale really well. It has no GPU support though...
Nice notebook btw!
@nils-braun Based on #785, it sounds like you are moving in a different direction. Please feel free to close this issue
Thanks for the message @seanlaw - the other issue should not be a stop sign for this one. I would be more than happy to see multiple features based on matrix profiles in tsfresh. If you would like, we can also add STUMPY as a "matrix profiles feature provider" or maybe even let the user choose (or maybe - if there are certain known differences, it might even make sense to have multiple implementations in parallel, see below).
Sorry for never having answered your issue! I think @kempa-liehr and @MaxBenChrist were busy at that time and then it was probably forgotten...
I am, honestly, not an expert in matrix profile computation. I see that both packages (Matrixprofile and stumpy) have put some effort into speed optimization. If the final result of the two libraries is similar (which I guess it is), I think it might make sense to have a configuration option to choose (because I guess there might be some cases where one of the two libraries is faster). What do you think?
the other issue should not be a stop sign for this one. I would be more than happy to see multiple features based on matrix profiles in tsfresh. If you would like, we can also add STUMPY as a "matrix profiles feature provider" or maybe even let the user choose (or maybe - if there are certain known differences, it might even make sense to have multiple implementations in parallel, see below).
Excellent! I'm not too familiar with Matrixprofile but I can tell you that STUMPY works very hard to faithfully reproduce the original published work by Keogh et al and so we try not to make any assumptions or take any liberties that we think would allow the user to "shoot themselves in the foot". I am probably biased but it certainly makes sense to provide users with at least a basic STUMPY implementation.
Sorry for never having answered your issue! I think @kempa-liehr and @MaxBenChrist were busy at that time and then it was probably forgotten...
Please, no apologies needed! As an open source maintainer myself, I completely understand. I am here to help and contribute!
I am, honestly, not an expert in matrix profile computation. I see that both packages (Matrixprofile and stumpy) have put some effort into speed optimization. If the final result of the two libraries is similar (which I guess it is), I think it might make sense to have a configuration option to choose (because I guess there might be some cases where one of the two libraries is faster).
I can only speak to STUMPY but, aside from the basic matrix profile output, here are some obvious benefits just to name a few:
NumPy
, SciPy
, and Numba
and nothing else (i.e., no Cython
, no raw CUDA
). We offer a distributed version of our code that has a (soft) dependency for Dask Distributed
and was tested with over 32 servers (256 CPUs) so this should play nicely with tsfresh
since it also uses Dask Distributed (a future consideration). Also, thanks toNumba
, STUMPY comes with automatic multi-GPU support without any additional dependencies. We easily computed a large matrix profile with a 16-GPU DGX-2 local cluster. So, whether it's multi-CPU/multi-GPU, we offer a highly scalable solution for computing matrix profiles. Of course, we might appear "slower" in some performance tests but that's usually due to the cost of having to JIT-compile some of our functions (which are fast thereafter). The benefits of STUMPY's scalability really shines through when you have time series that are longer than 10K+ data points in length (note that our performance plots need to be updated as STUMPY is even faster now). conda-forge
(and PyPI) so we can easily add it as a dependencyWhat do you think?
@nils-braun @kempa-liehr @MaxBenChrist I'll work on putting together a similar "combiner" PR following your draft and maybe we can work together? I'll start with something simple so as to limit the initial scope and then, over time, we can increase the functionality with additional PRs. How does that sound?
I'll provide my 2 cents here:
All in all, we both provide useful features to the community at large. Do you not agree here @seanlaw ? Many of your implementations are VERY similar to ours.
I don't understand the quibble with the use of Cython or line coverage. I haven't looked through all of it, but the biggest point I would generally test is divergence of carried co-moments, since lookahead methods tend to only be quasi-stable. Such tests aren't really determined by code coverage at all.
On the topic of Cython, Cython generally makes things easier to distribute, due to its origins. It was used by the Sage developers. Numba has conventionally been the more difficult one, which makes this claim sound a bit weird to me. Since it's possible to distribute cythonized source files, it's arguably a trivially removable dependency anyway.
Numba does sometimes produce faster binaries in cases with significant opportunities for inner loop vectorization and instruction level parallelism, again through the innermost loop of a nest. Most of these methods don't exhibit that, and Numba does not admit unroll and jam type optimizations or in place outer loop vectorization (nor does LLVM as it stands on master).
@kavj I am glad that you caught wind of this since you are a fundamental researcher within Eamonn Keogh's group. To make things a little easier to understand, are you stating the following about Numba and cython?
@kavj Have you used fastmath compilation option? What experiences do you exhibit with it? I find that it throws away accuracy for performance gains. When I initially created the Cython code, I tested fastmath as a compilation option and would always get unexpected results.
Both packages have their merits and this was not intended to be a debate. We can agree to disagree. I was merely seeking clarification and wanted to close out an existing issue that I had created in April 2020. I may revisit this in the future as my time and energy is probably better spent elsewhere. Best of luck.
Most helpful comment
Hi @seanlaw,
This is indeed a great remark. Even better, a few days ago, Keogh and others published a paper with all possible features that can easily be extracted through the matrix profile ("The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code").
One of those includes shapelet discovery, which is really easy to implement and fast:
