Tsfresh: speed up calculation by numba

Created on 27 May 2017  路  17Comments  路  Source: blue-yonder/tsfresh

Hi,
I have made some performance comparison of 3 basic feature extractors: mean, sum of values, and standard deviation between your feature extractors and pandas and I found some major comparison differences.

I changed the code in _extraction.py_ at line _345_

if isDirect:
  del column_name_to_aggregate_function[column_prefix+'__sum_values']
  del column_name_to_aggregate_function[column_prefix+'__standard_deviation']
  del column_name_to_aggregate_function[column_prefix+'__mean']
  extracted_features =  pd.DataFrame(index=dataframe[column_id].unique())
  extracted_features[column_prefix+'__mean']= dataframe.groupby(column_id)[column_value].mean()
  extracted_features[column_prefix+'__sum_values']= dataframe.groupby(column_id)[column_value].sum()
  extracted_features[column_prefix+'__standard_deviation']= dataframe.groupby(column_id)[column_value].std()
else:
  extracted_features = dataframe.groupby(column_id)[column_value].aggregate(column_name_to_aggregate_function)

As you can see, I only called pandas built in function instead of calling the implementation of the feature extractor.

The following plot compares different time series lengths with 1000 ids
TimeSeriesLength

As you can see, there is a major performance difference.
The reason for that is that probably because pandas functions are optimized.

I think that the feature extractors should be optimized using numba or cython

enhancement help wanted

All 17 comments

Thank you for this nice study @liorshk!
We will definitely have a look into this. For this we will have to change some internals in the feature calculators, but this should be doable.

@liorshk: Nice observation.

But, I do not see the urge to optimize this part.

Yes, the functions .mean(), .max(), .min(), .std(), .sum() and .median() are optimized on a groupby object (so gp.f() is faster than gp.apply(f)).

However, if we want to exploit that, we would have to implement some annoying routines, making the code unnecessary complicated. We probably would have to add a fourth kind of feature calculator.

The return is not worth it from my point of view.

Well, we could at least use numba or something similar to speed up the calculation by itself. Let's study this in greater detail.

It seems that @liorshk is right.

Also, I've noticed that the type of return in "apply" calculators is pandas series, which are know to be much more consuming than simple data structures, such as tuples (self checked).
Accordingly, it would be great if all types of calculators would return tuples instead.

BTW, is there any special reason why you choose to work with the pandas's dataframe format instead of dictionaries with 'id' keys and [time_vector, signal_vector] values? I suspect that the frequently used groupby('id') operation extends dramatically the program's runtime...

Also, I've noticed that the type of return in "apply" calculators is pandas series, which are know to be much more consuming than simple data structures, such as tuples (self checked).
Accordingly, it would be great if all types of calculators would return tuples instead.

Can you Provide some benchmarks for that? If I get you right you propose to change the return type in all function calculators from pandas series to ndarrays, lists or tuples?

BTW, is there any special reason why you choose to work with the pandas's dataframe format instead of dictionaries with 'id' keys and [time_vector, signal_vector] values? I suspect that the frequently used groupby('id') operation extends dramatically the program's runtime.

Right now there is no strict dependency on pandas dataframes. Originally, we were aiming for a framework that would allow us to distribute the apply calls over a cluster. Also the group by routine was quite convenient because it saved us to write a lot of code ( as always in business, computation time is cheap but programming time not)

I like your idea with the id column as keys. Maybe we should benchmark such a internal representation against the current format.

I have now implemented and tested a version where I am using numpy arrays internally and do not return Series, but tuples. The performance boost is actually not that high as I expected (10 %), but still.

Also, I do like the logics now more, as there is no need for distinguishing between apply and aggregate and using numba should be easier now.

I will fix up the branch an make a PR.

I have now implemented and tested a version where I am using numpy arrays internally and do not return Series, but tuples. The performance boost is actually not that high as I expected (10 %), but still.
Also, I do like the logics now more, as there is no need for distinguishing between apply and aggregate and using numba should be easier now.
I will fix up the branch an make a PR.

10% decrease in runtime? amazing. I am really curious to see the pr. It is probably touching many parts of tsfresh?

All right, parts of it are now in tsfresh (head version). There is still more to do (we could still gain from numba etc.), but this requires some more work. I will leave this issue open for later reference.

@nils-braun then we should probably adapt the issue title.

maybe "speed up calculation by numba"?

Go for it, I am currently on my smartphone

I think, a great place to test that is the sample_entropy feature calculator.

I would like to take a look at this issue.

Do you want a big PR with all feature calculator modified to use numba whenever it is possible or can we work with incremental change (doing feature calculator one by one) ?

I think some feature calculator can be improved really quickly, but we need to benchmark this.

Thank you very much! That would be great.

Incremental PRs are fine, if they make sense: if there is some initialization time involved in this and it will only pay off with multiple of the calculators converted, a larger PR might be better.
If this is really the case, we can also keep a "numba" branch around and we can do incremental changes against this one.

Would be really interesting to see, if this pays off!

I tried with sample_entropy, got 20% improvement without multiprocessing, but once multiprocessing is at works performance are getting worse (really worse). I do not think I will be able to improve performance with numba at this point.

Do you have a branch where we can see what you have done ?
To see if there is a bottleneck somewhere ? I am insterested

https://github.com/thibaultbl/tsfresh/tree/numba

I created an "optimized_sample_entropy" to benchmark with the sample_entropy version.

Was this page helpful?
0 / 5 - 0 ratings