Tsfresh: extract_features is failing with: "OverflowError: value too large to convert to int"

Created on 26 Aug 2018  路  46Comments  路  Source: blue-yonder/tsfresh

I am running extract_features on a very large matrix, having ~350 million rows and 6 features (as part of a complex data science pipeline). I am using a machine with 64 cores and 2TB memory, and utilizing all 64 cores. I am getting this error: "Overflow Error: value too large to convert to int".
Some comments:
i) When I split the matrix vertically into, say 3, chunks (each chunk having 2 features only), and run them sequentially, everything works fine. So it does not seem like I am having issues with "problematic" values in the matrix.
ii) It does not seem a memory related issue either (as alluded to in https://github.com/blue-yonder/tsfresh/issues/368) because I was babysitting the mentioned run that failed, and was checking memory usage regularly (using "free -g"). It never got above 400GB.
iii) I also tried running with LocalDaskDistributor and got the same error.
iv) All 6 features in the matrix are floats.
v) pai_tsfresh below is a fork of tsfresh.

  1. Your operating system
    No LSB modules are available.
    Distributor ID: Ubuntu
    Description: Ubuntu 16.04.3 LTS
    Release: 16.04
    Codename: xenial

  2. The version of tsfresh that you are using
    latest

  3. A minimal code snippet which reproduces the problem/bug
    Here's the call in my code to extract_features:
    extracted_features_df = extract_features(rolled_design_matrix,
    column_id='account_date_index',
    column_sort='date',
    default_fc_parameters=fc_parameters,
    n_jobs=64)
    where fc_parameters is:
    {'abs_energy': None,
    'autocorrelation': [{'lag': 1}],
    'binned_entropy': [{'max_bins': 10}],
    'c3': [{'lag': 1}],
    'cid_ce': [{'normalize': True}],
    'fft_aggregated': [{'aggtype': 'centroid'},
    {'aggtype': 'variance'},
    {'aggtype': 'skew'},
    {'aggtype': 'kurtosis'}],
    'fft_coefficient': [{'attr': 'real', 'coeff': 0}],
    'sample_entropy': None,
    'spkt_welch_density': [{'coeff': 2}],
    'time_reversal_asymmetry_statistic': [{'lag': 1}]}

  4. Any reported errors or traceback
    Here's the traceback:
    Traceback (most recent call last):
    File "/home/yuval/pai/projects/ds-feature-engineering-service/feature_engineering_service/src/fe/stateless/time_series_features_enricher/time_series_features_enricher.py", line 175, in do_enrich
    distributor=local_dask_distributor)
    File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 152, in extract_features
    distributor=distributor)
    File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in _do_extraction
    data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
    File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in
    data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1922, in get_iterator
    splitter = self._get_splitter(data, axis=axis)
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1928, in _get_splitter
    comp_ids, _, ngroups = self.group_info
    File "pandas/_libs/properties.pyx", line 38, in pandas._libs.properties.cache_readonly.__get__
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2040, in group_info
    comp_ids, obs_group_ids = self._get_compressed_labels()
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in _get_compressed_labels
    all_labels = [ping.labels for ping in self.groupings]
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in
    all_labels = [ping.labels for ping in self.groupings]
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2750, in labels
    self._make_labels()
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2767, in _make_labels
    self.grouper, sort=self.sort)
    File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 468, in factorize
    table = hash_klass(size_hint or len(values))
    File "pandas/_libs/hashtable_class_helper.pxi", line 1005, in pandas._libs.hashtable.StringHashTable.__init__
    OverflowError: value too large to convert to int

bug

All 46 comments

i) When I split the matrix vertically into, say 3, chunks (each chunk having 2 features only), and run them sequentially, everything works fine. So it does not seem like I am having issues with "problematic" values in the matrix.

I don't really understand what you describe here. What does it mean to split the matrix into chunks?

v) pai_tsfresh below is a fork of tsfresh.

can you please use the newest master version?

the stack trace is not of help for me. What I need to debug this, is a minimal example. I need a small dataset that causes the error.

~350 million rows and 6 features

so you have 350 million data points of three different types of time series?

my matrix has 8 columns (6 feature columns, say, X1, X2, ..., X6, column_id and column_sort). In order to refute a possible issue with the values themselves, I ran extract_features on X1, X2 only, then on X3, X4 only and finally on X5, X6 only. That is, I ran it 3 times, reducing the number of features from 6 to 2. These runs succeeded, and I did not get the error mentioned above.

I think I was forking from newest master version. I'll double check that.

About sending a small dataset: the problem is that the issue I am having is manifested only when I am using a HUGE matrix (like the one I have, with about 350 million rows).

ok. the thing is: I have got a few bug reports that sound like yours where tsfresh stalls on big datasets. I would love to fix this but I could never reproduce this bug. I ran tsfresh on huge datasets myself and never had those problems.

We recently fixed an issue with the index column in a3cb6af5ce41b8a1ffa3b58fb3fe7d766911f227 . It would be great if you could check if this fixes your problems.

How long is tsfresh running until the error ocurrs?

I'll check whether this fixes my issue.

I don't remember exactly, but I think tsfresh is running for several minutes (without showing the progress bar) until the error occurs.

One other thing to mention (not sure if it's related to the error) is that my column_id is of dtype object and the values are long strings, e.g., ''id04857cd8js8chd8|2018-04-27' (representing a concatenation of some id and some date), and my column_sort is of type datetime64[ns].

Hi @yuval-nardi could you post some info about the Dataframe df, df(info) and and a sample of the data from df.head(). You say your column_id is dtype object and if those are Python objects and not strings, overflowing can be expected. In general not recommended to use objects inside a dataframe and base types (floats,int,datetime,strings) should be used - as per https://github.com/pandas-dev/pandas/issues/2773

@earthgecko the type of the column_id is object, but the values themselves are strings. If you have a column in a dataframe with string values, then automatically its dtype is object. I'll try to use hash/int instead.

@yuval-nardi - I guess I should have specified that was just to ensure that
pandas is indeed interpretting that as a string and NOT as a Python object for
some reason, maybe an edge case where it was being interpretted as a list due
to the pipe seperator. However, pandas does interpret it as a string object for
me as well.

data = {'time':[1535358128, 1535358129, 1535358130],'id':['id04857cd8js8chd8|2018-04-27', 'id04857cd8js8chd8|2018-04-28', 'id04857cd8js8chd8|2018-04-29'],'value':[1,2,3]}
df = pd.DataFrame(data)
print df

                             id        time  value
0  id04857cd8js8chd8|2018-04-27  1535358128      1
1  id04857cd8js8chd8|2018-04-28  1535358129      2
2  id04857cd8js8chd8|2018-04-29  1535358130      3

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
id       3 non-null object
time     3 non-null int64
value    3 non-null int64
dtypes: int64(2), object(1)
memory usage: 339.0 bytes

type(df.loc[0,'id'])

str

@yuval-nardi would you happen to know or be able to determine how many unique id
strings there are in the data set? If that info is not too much trouble to
provide, then while you are at it could you also determine how many unique
timestamps there are. And another perhaps obvious question is the column names
of X1, X2, ... X6 are not strange in anyway, just plaintext strings?

I will be very interested to see if representing your string objects
as hashed ints will resolve this issue. As Max has pointed out there have been
a number of similar issues relating to large data sets and memory issues.
I believe that most of these can be attributed to the data set itself and
panda/python handling in it terms of the Python internal representation of the
objects, counts, allocations. I am not certain of whether pandas is using
Python internals in cases of string objects and it is not easy to find a quick
answer on that it seems.

Having experienced, debugged, profiled and fixed a similar MemoryError issue
related to Python internals, pandas/numpy analysis of lots of time series,
these tsfresh memory issues smell very similar to me. I do not profess to fully
understand Python internals, however I can tell you the number of objects and
allocations and counts in there can get very, very large, resulting in a
mysterious Python MemoryError.

If my hunch is correct and this is related to Python internals then with large
data sets with strings, perhaps random memory errors are going to be encountered
and their nature to date of not breaking consistently at any specific point in
the code further suggests that this is an error that Python itself runs into,
not specifically tsfresh or Pandas.

Also what seems true in most/many of these case that whenever people reduce the
size of the df it always works as it did in your case.

@MaxBenChrist perhaps it would not be hard for you to test this.
Have all those data sets had string objects in them? I know it is hard to say
as people do not always give df outputs or data samples even when they are asked
for them, so I do not think it is possible to tell. I am not sure if this
context has ever been tested in tsfresh? A big data set with lots of long
strings as ids.

@MaxBenChrist you run big data sets and test a lot and you have not been able to
reproduce any of these errors up to date. But perhaps that is an opportunity to
test with a data set (in the size range in question in these issues) that is
KNOWN to work, replace numeric ids with long strings. How many unique strings
would it take to reproduce, if it is reproducible? Maybe that is another
variable.

Just has to be tested somehow :) I shall try and do some testing myself but in
these big data set realms, it is quite difficult, but I can make my machine
have oomkiller go to town on the tsfresh python process, but reproducing
something happening at 400GB RAM bounds in not really easily reproducible.

@MaxBenChrist if a DataFrame is passed to extract_features as the timeseries_container
it then gets processed by _normalize_input_to_internal_representation which is
fine. However in the very first line of _do_extraction

https://github.com/blue-yonder/tsfresh/blob/master/tsfresh/feature_extraction/extraction.py#L217

>>> data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
>>> type(data_in_chunks)
list

data_in_chunks converts it from a DataFrame to a normal Python list, could this be where the issue lies?

With each item in the list being a tuple and in this instance the id is a str object.

>>> type(data_in_chunks[0])
tuple

>>> data_in_chunks[0][0]
'5a2c11dc-d132-4d18-83e3-3eb9bc438cf7'

>>> type(data_in_chunks[0][0])
str

Will look over this later.

@yuval can you post the output of pip list. I am interested in your pandas version

@MaxBenChrist further to this here n_jobs=64 is being used, having looked
into the next steps in the code, I believe that the new data_in_chucks list
which is going to be very big, is being passed via MultiprocessingDistributor
to distributor.map_reduce(_do_extraction_on_chunk, data=data_in_chunks

At 350000000 rows / 64 workers that is a chunk_size of 5468750 rows
Although the data representation of id/kind is probably less than the raw data
this is still a LOT of list data.

I think the issue is there seem to be too many duplications of the data.

When generating all the chunk lists in the staticmethod partition that at
this point tsfresh has:

  • original timeseries_container dataframe object passed to extract_features
  • the df_melt dataframe object (basically a copy of the timeseries_container) created with _normalize_input_to_internal_representation
  • The data_in_chunks list object (another duplication of the data in a different representation)
  • And the partition generator object (I am not certain how big this object is yet, but it seems to increment and decrement objects with each worker)
  • And 64 partition chunk list objects (basically another duplication of the data in a different representation and in 64 pieces)

So the data in the process at the point that the distributor = MultiprocessingDistributor
is running on 64 workers, the original is data multiplied by ~4 (to keep it simple)
with the parent process having 3 representations concurrently as there is no
cleaning up of any of the objects after they have been used as far as I can
determine and timeseries_container dataframe object, df_melt dataframe object
and data_in_chunks list object exist for the duration of the run.

I am still going through all this step by step through the code and verifying
but this is how it is looking at the moment. I think that del timeseries_container
as soon as the df_melt Dataframe has been created
via _normalize_input_to_internal_representation would at least reduce memory
by ~30% as a start as I can see timeseries_container being used thereafter.

More testing.

Interesting analysis.

I had a similar problem in another project: We created a lot of big dataframes and passed them around and got memory errors at some point. What helped was to delete the dataframes by del df and manually call the garbage collector by gc.collect afterwards.

I am on holiday this week. Hoping to find some time to look at this next week.

@MaxBenChrist I am digging into it and looking at how a numpy.ndarray could be possibly used instead of the list, that will be more memory efficient immediately, perhaps a non-trivial exercise, however it is probably a step in the right direction. I am sure I will have more concrete information next week as at the moment my debugging and profiling does not show a vast difference in resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) between all the steps,but I am only using relatively small data sets. Enjoy your holiday :)

@earthgecko @MaxBenChrist Thanks for all the valuable inputs.
I'll try to give some more info about the specifics of my data frame. To make it clearer, I'll show an example. Suppose I have the following data frame:

df = pd.DataFrame({'id': ['idabf8j', 'idabf8j', 'idabf8j', 'id0kd7a', 'id0kd7a'],
                   'date': ['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-03', '2018-05-04'],
                   'X1': [1, 5, 2, 8, 4]})

I use here X1 only, but in general I have many more (in the example I was trying to run I had X1, ..., X6, but again, in general, I have much more).
Now, this data frame represents two time series corresponding to two id's (one with 3 timestamps/days and another with two). In general, I have several thousands of id's and (on average) 700 day (some id's have just a handful of days and some have a couple of thousand). The days may or may not overlap between id's.

Here's the catch: My end goal is not to extract_features on this data frame. What I want is to run extract_features on ALL sub-time-series by moving the rightmost date forward in time and keeping the leftmost date fixed at the first day (for every id). In order to do that, I create the following (rolled) data frame:
df_rolled = pd.DataFrame({'id_date': ['idabf8j|2018-05-01 00:00:00', 'idabf8j|2018-05-02 00:00:00', 'idabf8j|2018-05-02 00:00:00', 'idabf8j|2018-05-03 00:00:00', 'idabf8j|2018-05-03 00:00:00', 'idabf8j|2018-05-03 00:00:00', 'id0kd7a|2018-05-03 00:00:00', 'id0kd7a|2018-05-04 00:00:00', 'id0kd7a|2018-05-04 00:00:00'], 'date': ['2018-05-01', '2018-05-01', '2018-05-02', '2018-05-01', '2018-05-02', '2018-05-03', '2018-05-03', '2018-05-03', '2018-05-04'], 'X1': [1,1,5,1,5,2,8,8,4]})
and then I run extract_features on df_rolled with column_id=id_dateand column_sort=date.

The full df has many unique column_id values (around 1 million!). The number of unique column_sort values is much smaller than that, probably several hundreds.
The names of the columns are of type str representing something meaningful in the context of the problem I am solving, and not X1, X2, etc., ..

I have tried the following:
1) Running with all 6 columns (X1, ..., X6) --> Failed
2a) Running with X1, X2 --> Passed
2b) Running with X3, X4 --> Passed
2c) Running with X5, X6 --> Passed
(same number of rows for each run: 2a, 2b, 2c.)
so it's not about the values in the data frame.
3) Cutting the df horizontally and running every horizontal chunk separately --> Failed
(each horizontal chunk corresponds to, say, 100 id's, with all of it's days. That is, instead of running all several thousands of id's at the same time, run it sequentially on subsets of id's). I also tried 50 id's at the same time, but it did not work. Maybe 50 was still too much, in terms of memory.
4) Replacing the id_date with an int -- > Failed (with the same error message).
(so it's not about the long str of the form: idabf8j|2018-05-01 00:00:00)
5) Running with LocalDaskDistributor instead of MultiprocessingDistributor --> Failed

Hope this can help ..

Hi @yuval-nardi, thank you for the additional information, hopefully it will be
useful to pinpointing the problem, especially the example of the dataframe and
data. The only additional question I have is this, in your data example all
your values are int, is this true for all values? In terms of the type when
tsfresh makes the data_in_chunks Python list, there is a difference between
ints and floats in terms of the Python internal representation of the list
objects. As per Python documentation:
https://docs.python.org/2/tutorial/floatingpoint.html#floating-point-arithmetic-issues-and-limitations

In terms of native Python lists the stored value of a float is an approximation
to the original decimal fraction, to the true decimal value of the binary
approximation stored for 0.1 it would be 0.1000000000000000055511151231257827021181583404541015625

Reviewing your original traceback, supports the theory that this occurs in the
data_in_chunks list creation. I do not beleive this has got anything to do
with the validity of data, values or strings in terms of the dataframe itself,
other than the interpretation, representation and effect of the data on the
resulting list, as your testing has shown.

How to reproduce and prove this is quite difficult. However, perhaps you can
use a simple debug version to validate exactly where/when this is occurring in
the list function. I have created a gist https://gist.github.com/earthgecko/9e6f2f5c0d48d53ff34284a860a50cde
that just prints some debug output and also changes the data_in_chunks list
construction from the current:

    data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]

To building the list via iterating through the GroupBy object instead:

    for names, group in df.groupby([column_id, column_kind])[column_value]:
        print_out = False
        count += 1
        id_name = names[0]
        var_name = names[1]
        data_in_chunks.append((id_name, str(var_name), group))
...
...

With some debug output printed. This is not a fix but may be useful to determine
at what point the list breaks. If it would be possible for you to try and run
the gist version and report back on the output.

You can copy your current version to a backup and here I use wget in the example
to pull down the gist version, I have used your paths from your original
traceback. Please note in the gist version I have made it print out per
million items added to the data_in_chunks list with the print_out_per = 1000000
variable, this may need to be adjusted. I am guessing that at some point the
list creation is going MemoryError and the last print out should give an
indication of where that is.

cp /home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py /home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py.original.bak
wget -O /home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py https://gist.githubusercontent.com/earthgecko/9e6f2f5c0d48d53ff34284a860a50cde/raw/064e8bb4c236ce090e4c51fa03811375fe99a7d8/extraction.py

# Just so you are happy that this changes are acceptable to you
diff /home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py.original.bak /home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py

# Run the extract_features
# Restore the original version

cat /home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py.original.bak > /home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py
rm -f /home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py.original.bak

This may be pretty primitive and simple debugging, but at least it may inform us
of whether the root cause of the problem is indeed in the data_in_chunks method
itself.

The resulting debug printed output will be like (only per million):

debug :: data_in_chunks :: length 3, size 104, number of chars 247
debug :: data_in_chunks :: length 6, size 136, number of chars 515
debug :: data_in_chunks :: length 9, size 200, number of chars 793

@earthgecko some of the columns are ints and some are floats.
I'll try later to use the debug tool and send the results. Thanks.

I am guessing that at some point the list creation is going MemoryError and the last print out should give an indication of where that is.

wow. nice gist @earthgecko. exiting to see the results

@earthgecko @MaxBenChrist Currently running .. will send the results when done :).

I probably am not going to be around much today and I have been thinking and I am a little concerned that print_out_per = 1000000 may be to high to get much useful info after the id_date is grouped, but let us see.

Ok. It crashed. Here's the output:

Running _normalize_input_to_internal_representation
_normalize_input_to_internal_representation run OK
Running _do_extraction
Creating data_in_chunks list
[2018-09-05_05:45:34-ERROR-time_series_features_enricher-do_enrich] Enrichment failed. Skipping this step. Error msg:value too large to convert to int

Seems like it it did not pass even the first million, right? (the last line is my internal logging).
If so, this is strange, because in this case the number of unique id_date is around 1200000 (1.2 million) and I have 7 columns (X1, ..., X7), so there should have been 8-9 prints. Now, when I ran the same number of rows but with X1, X2, X3 only it succeeded (I did this ran a while ago with the old data_in_chunks). That is, I think it wouldn't surprise me to see it crashing after the 3rd or 4th million, but not on the first!

Anyway, shall I try it with a smaller print_out_per? how smaller?

Here's the full stack trace:

Traceback (most recent call last):
  File "/shared_directory/yuval/projects/ds-feature-engineering-service/feature_engineering_service/src/fe/stateless/time_series_features_enricher/time_series_features_enricher.py", line 386, in do_enrich
    n_jobs=num_of_cores_to_use)  # ,
  File "/shared_directory/yuval/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 138, in extract_features
    distributor=distributor)
  File "/shared_directory/yuval/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 200, in _do_extraction
    for names, group in df.groupby([column_id, column_kind])[column_value]:
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2226, in get_iterator
    splitter = self._get_splitter(data, axis=axis)
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2232, in _get_splitter
    comp_ids, _, ngroups = self.group_info
  File "pandas/_libs/properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2335, in group_info
    comp_ids, obs_group_ids = self._get_compressed_labels()
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2351, in _get_compressed_labels
    all_labels = [ping.labels for ping in self.groupings]
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2351, in <listcomp>
    all_labels = [ping.labels for ping in self.groupings]
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 3070, in labels
    self._make_labels()
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 3103, in _make_labels
    self.grouper, sort=self.sort)
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/algorithms.py", line 630, in factorize
    na_value=na_value)
  File "/shared_directory/yuval/env/lib/python3.6/site-packages/pandas/core/algorithms.py", line 473, in _factorize_array
    table = hash_klass(size_hint or len(values))
  File "pandas/_libs/hashtable_class_helper.pxi", line 1229, in pandas._libs.hashtable.StringHashTable.__init__
OverflowError: value too large to convert to int

Hi @yuval-nardi, thanks for the feedback. I was a little worried that may have
been the case once the data was grouped. So how small? Why not try it on
print_out_per = 5000
that should at least determine that is actually going to work, just make sure
you have a large shell scrollback or pipe the output to a log file.

I have updated the gist to the value of 5000 and also added a groupby step
and then iterating the Groupby object, rather than making an assumption that
the groupby is working. You can fetch and diff the gist as per yesterday.

    # for names, group in df.groupby([column_id, column_kind])[column_value]:
    print('Grouping dataframe')
    grouped = df.groupby([column_id, column_kind])[column_value]
    print('Dataframe grouped OK')
    for names, group in grouped:

https://gist.github.com/earthgecko/9e6f2f5c0d48d53ff34284a860a50cde

Thanks @earthgecko. My two machines are currently busy. I'll run with the updated gist once finished.

@earthgecko I just looked over the gist, if you want to print every print_out_per steps

    print('Creating data_in_chunks list')
    # data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
    import sys
    data_in_chunks = []
    print_out_per = 5000
    count = 0

    # for names, group in df.groupby([column_id, column_kind])[column_value]:
    print('Grouping dataframe')
    grouped = df.groupby([column_id, column_kind])[column_value]
    print('Dataframe grouped OK')
    for names, group in grouped:

        id_name = names[0]
        var_name = names[1]
        data_in_chunks.append((id_name, str(var_name), group))

        if count % print_out_per == 0:
            list_length = len(data_in_chunks)
            list_size = sys.getsizeof(data_in_chunks)
            list_chars = len(str(data_in_chunks))
            print('debug :: data_in_chunks :: length %s, size %s, number of chars %s' % (
                str(list_length), str(list_size), str(list_chars)))
          count += 1


@MaxBenChrist yes a modulo would be less verbose here agreed. I just never use them in any language for some reason. I have modified the gist though as such and tested the gist again and I can confirm that the modified gist has the same result.

Tested for the benefit of yuval just to be certain seeing as there was a change and I do not want to waste any of yuval's time by introducing any indent errors or anything, luckily I never added any, so the gist passed the first test ;)

Let us just hope all this at least leads us somewhere closer to an answer.

@earthgecko I started running now with the updated gist. Will report back when it's done.

Hi @earthgecko, I ran it with print_out_per = 5000. Here's (part of) the log (again, the last line is my own try-catch):

Running _normalize_input_to_internal_representation
_normalize_input_to_internal_representation run OK
Running _do_extraction
Creating data_in_chunks list
Grouping dataframe
Dataframe grouped OK
[2018-09-07_10:51:54-ERROR-time_series_features_enricher-do_enrich] Enrichment failed. Skipping this step. Error msg:value too large to convert to int

Surprisingly (or not :)), it crashed even before getting to the first 5000 elements. I'll try with, say, 100, instead.

Hi @yuval-nardi once again thank you for the feedback. I suspect that this is
probably due to the fact that the dataframe is to bigger to iterate as a list so
it is not even getting past:

    for names, group in grouped:

You could set the print_out_per = 1 to verify that but if it does print
something out you will probably want to Ctrl+C immediately.

However trying to think around and solve that issue, if it is the issue, I have
created another gist, which I think will overcome that problem by using the
get_group function and interating over the enumeration of the keys, rather
than trying to iterate the entire grouped object:

    for i, key in enumerate(grouped.groups.keys()):

This should in theory break the use of the object into getting slices of data
from it rather than trying to access the entire thing in one operation.

Please note that I think this is going to work, in terms of at least starting
to construct the data_in_chunks list. Therefore I have set
print_out_per = 5000, which may print quite a bit. So you may want to
adjust the value as you see fit. I am certain that this testing takes you quite
a bit of time to load up and get running, so thank you very much for your
feedback, time and input in this regard, it is much appreciated.

This is the new gist URL - https://gist.github.com/earthgecko/ec411ffce3b82757c456faedc429a8cc
it is different from the old gist URL.

You can diff as usual.

@earthgecko The print_out_per = 100 run crashed similarly (at the first 100):

Running _normalize_input_to_internal_representation
_normalize_input_to_internal_representation run OK
Running _do_extraction
Creating data_in_chunks list
Grouping dataframe
Dataframe grouped OK
[2018-09-08_03:23:52-ERROR-time_series_features_enricher-do_enrich] Enrichment failed. Skipping this step. Error msg:value too large to convert to int

confirming somewhat your above concern.

Your idea (i.e., of using get_group) behind the new gist sounds really promising. I can't wait to give it a shot. will report back.

I ran it with the updated gist (that uses get_group):

Running _normalize_input_to_internal_representation
_normalize_input_to_internal_representation run OK
Running _do_extraction
Creating data_in_chunks list
Dataframe grouped OK
[2018-09-09_05:00:39-ERROR-time_series_features_enricher-do_enrich] Enrichment failed. Skipping this step. Error msg:value too large to convert to int

This is really weird. It passed the line:
print('Dataframe grouped OK')
and crashed before:
print('Dataframe grouped keys length :: %s' % str(len(grouped.groups.keys())))

but there's nothing in-between!

Any ideas?

>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1018-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.14.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

I am going to open an issue on the pandas bugtracker, can you send us the output of

import pandas as pd
pd.show_versions()

Thx @yuval-nardi, I will open issue at pandas bugtracker.

Just for the sake of it, can you try to upgrade pandas to maybe 0.23.4 and try again?

@MaxBenChrist wait .. I want to double check the pd.show_versions() .. I thought I was using pandas 0.23.4 ..

@MaxBenChrist Ok. Seems I was using pandas 0.22.0. I'll try to update and run again.

@MaxBenChrist still failing, after upgrading to pandas 0.23.4.

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1018-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@yuval-nardi

That is a bummer. Is there any way you can send me a sample of your time series data? I can stack the DataFrame multiple times and hope that I can reproduce the error. A small script that reproduces will be of help if we open a bug report on pandas tracker.

This is really weird. It passed the line:
print('Dataframe grouped OK')
and crashed before:
print('Dataframe grouped keys length :: %s' % str(len(grouped.groups.keys())))
but there's nothing in-between!
Any ideas?

I think it crashed at grouped.groups.keys()

@MaxBenChrist Unfortunately, I cannot share the data, and the error is manifested on very large data, and seems to work on smaller samples. I will try to simulate random data from, say, a Gaussian distribution, of the order of the rolled matrix I have and see if I can reproduce the error.

I was trying to write scripts to replicate, but really I do not think it
is not possible. I can make a string that makes a MemoryError if I limit the
amount of memory the Python process is allowed to use, but that is not
necessarily the same thing.

@yuval-nardi because it errored out at grouped.groups.keys(), it does
suggest that this may be pandas related as Max says.

I have updated the last gist to add some additional debug info, however I think
it may break again when it gets to

        for i, key in enumerate(grouped.groups.keys()):

if this is a pandas related issue.

I have added some try/except methods and some addition debug and to print out
memory information relating to the df before it is grouped by, relating to the
memory of each specialized class type from pandas.core.internals relating to the
ObjectBlock class, FloatBlock and the int block.

I have also removed the str(len(grouped.groups.keys())) so that the
groupedby object is not operated on and replaced it with a sys.getsizeof(grouped)
this should be OK I think as it is not operating on the groupedby object
directly per se.

This updated version of the last gist may give us a little more info to supply
in a pandas issue. It should also further validate that it is a pandas issue
if it does error at the enumeration of the grouped keys.

https://gist.github.com/earthgecko/ec411ffce3b82757c456faedc429a8cc

I will try to simulate random data from, say, a Gaussian distribution, of the order of the rolled matrix I have and see if I can reproduce the error.

Great!

This updated version of the last gist may give us a little more info to supply
in a pandas issue.

great idea with the enumerate

really curious if we can nail this bug down. We had these kind of reports multiple times but I was never able to reproduce it

@yuval-nardi I replaced that pandas groupby by a numpy split method, the branch I am working on is https://github.com/blue-yonder/tsfresh/tree/fix_data_in_chunk

if you want you can install that branch and see if the same error shows up. I have not yet compared the memory consumption / performance in in general but it should be more stable

I benchmarked the new implementation and it is magnitudes slower for the artificial data that I created
see https://gist.github.com/MaxBenChrist/c56f4012a49b8f3177d355c0c84a5c38

Will not merge it into master.

ok. Looking forward to see what pandas guys will find.

okay, so it seems we are hitting some internal pandas limitation, different ways to handle this

  • check number of values, if too many, split data_in_chunk calculation
  • just throw warning if too many values, ask user to split data
  • use dask to calculate the data in chunks

I think I prefer the first solution

I add a warning to the feature extraction if the number of ids/kinds is too high, from my point of view this is a corner case (extremely large dataframes). maybe I have time in the near future to give an alternative implementation for the data_in_chunk method that is able to process those huge dataframes

Was this page helpful?
0 / 5 - 0 ratings

Related issues

seanlaw picture seanlaw  路  21Comments

Sukanya2191 picture Sukanya2191  路  7Comments

ClimbsRocks picture ClimbsRocks  路  3Comments

MaxBenChrist picture MaxBenChrist  路  8Comments

michetonu picture michetonu  路  24Comments