Dvc.org: got MemoryError while staging train in get-started

Created on 24 May 2019 · 32Comments · Source: iterative/dvc.org

similar issue as in #333
While going through getting-started, at running the third stage, training:

$ dvc run -f train.dvc
-d src/train.py -d data/features
-o model.pkl
python src/train.py data/features model.pkl

I got an error like :
Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 75, in
train_words = np.array(df_train.text.str.lower().values.astype('U'))
MemoryError
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed
The error occurs due to consumption of more RAM.
As a suggestion maybe(?) reducing the size of array is an option for running it locally.

doc-content enhancement

Source

dnabanita7

Most helpful comment

@jorgeorpinel I believe it is the same error. We never fixed it since it would require rebuilding the project, using a different one completely. We just expect users to have at least 8GB machine at this point.

shcheklein on 20 May 2020

👍2

All 32 comments

@Naba7 thanks for reporting this! it requires 2-3GB memory to run as far as I remember. 2-3GB is a reasonable requirement for the local machine. What was your configuration? was you running it using docker by chance?

shcheklein on 25 May 2019

I got 4GB RAM,still not able to run it. I used git and terminal on ubuntu 18.04LTS.

dnabanita7 on 25 May 2019

What about using stratified sampling in featurization.py?

dnabanita7 on 25 May 2019

@dmpetrov I tried changing row dimensions of df_train as df_train.iloc[10000:,:]. But then I am getting valueerror as size of rows doesnt match which is obvious.

Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 85, in
save_matrix(df_train, train_words_tfidf_matrix, train_output)
File "src/featurization.py", line 61, in save_matrix
result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 585, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 10110, expected 20110.
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed

dnabanita7 on 25 May 2019

I got 4GB RAM,still not able to run it. I used git and terminal on ubuntu 18.04LTS.

I think RAM is the only issue here @shcheklein . I tried the tutorial and it executed successfully.
My machine has 12 GB RAM.
It;s just beefy RAM requirement I think!

ryokugyu on 25 May 2019

@dmpetrov I tried changing row dimensions of df_train as df_train.iloc[10000:,:]. But then I am getting valueerror as size of rows doesnt match which is obvious.

Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 85, in
save_matrix(df_train, train_words_tfidf_matrix, train_output)
File "src/featurization.py", line 61, in save_matrix
result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 585, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 10110, expected 20110.
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed

Did you tried different test train split sizes?

ryokugyu on 25 May 2019

@ryokugyu @Naba7 to be precise - this about get-started, not tutorial, right? There is a separate ticket for tutorial - #333

Most likely, the reason for both (get-started and tutorial) to consume that amount of memory is the way CountVectorizer works. Not the end result - matrix itself might be large, but not _that_ large.

Check this link and some tricks how to optimize ti - https://medium.com/@AgenceSkoli/how-to-avoid-memory-overloads-using-scikit-learn-f5eb911ae66c . It might be that HashingVectorizer is the solution to this.

shcheklein on 25 May 2019

@dmpetrov I tried changing row dimensions of df_train as df_train.iloc[10000:,:]. But then I am getting valueerror as size of rows doesnt match which is obvious.
Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 85, in
save_matrix(df_train, train_words_tfidf_matrix, train_output)
File "src/featurization.py", line 61, in save_matrix
result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 585, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 10110, expected 20110.
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed

Did you tried different test train split sizes?

Test set and train set are different files. So, no need to do the splitting

dnabanita7 on 26 May 2019

yes,it is about getting-started

dnabanita7 on 26 May 2019

@ryokugyu @Naba7 to be precise - this about get-started, not tutorial, right? There is a separate ticket for tutorial - #333

Most likely, the reason for both (get-started and tutorial) to consume that amount of memory is the way CountVectorizer works. Not the end result - matrix itself might be large, but not _that_ large.

Check this link and some tricks how to optimize ti - https://medium.com/@AgenceSkoli/how-to-avoid-memory-overloads-using-scikit-learn-f5eb911ae66c . It might be that HashingVectorizer is the solution to this.

No HashingVectorizer didnt help. The first line where error occurs is assigning values to df_train. Hence,I am checking all other methods mentioned in the article.

dnabanita7 on 26 May 2019

I think first dividing the large train file into chunks by tweaking read_csv's iterator and chunksize parameter as in http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking, then applying the CountVectorizer on each of the chunks and then using append function to create document-matrix may be a good idea.

dnabanita7 on 26 May 2019

I think of changing Max_features parameter in bag_of_words from 5000 to 50.

dnabanita7 on 26 May 2019

👀1

@Naba7 have you tried this change? does it solve this issue? I would expect the problem is not with the number of features (5000) but in the size of the dataset and the way CountVectorizer works - it needs to iterate though _all_ combinations before it can cut 5000 most meaningful.

Can you share the HashingVectorizer experiment?

shcheklein on 27 May 2019

Hashingvectorizer didn't work because the error was in setting the values
of df_train. I am trying to implement the above mentioned method of
dividing into chunks and then appending it. Till now no success.

On Tue 28 May, 2019, 1:40 AM Ivan Shcheklein, notifications@github.com
wrote:

@Naba7 https://github.com/Naba7 have you tried this change? does it
solve this issue? I would expect the problem is not with the number of
features (5000) but in the size of the dataset and the way CountVectorizer
works - it needs to iterate though all combinations before it can cut
5000 most meaningful.

Can you share the HashingVectorizer experiment?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/iterative/dvc.org/issues/380?email_source=notifications&email_token=AHQZX53YF5O7QUBTE4NZ6STPXQ53XA5CNFSM4HPS6A62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKOKYQ#issuecomment-496297314,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQZX5YWQZCBU5ATVZC6P5DPXQ53XANCNFSM4HPS6A6Q
.

dnabanita7 on 28 May 2019

The problem with hashingvectorizer is we can't trace back to the labels that has been hashed at one time,so even if memory error will be solved but we wont get a proper labelization of features.

dnabanita7 on 28 May 2019

featurization (copy).pdf
Till now,I tried changing read_csv while adding iterator and chunksize operators and writing a for loop for Countvectorizer. The script isn't error-free, I am working on it to make it error free. The problem is the iterator returns TextFileReader and I don't know how to convert it to dataframe.

dnabanita7 on 28 May 2019

The error while changing CountVectorizer to HashingVectorizer:

chad7@superuser:~/exp_dvc$ dvc run -f featurize.dvc \
>           -d src/featurization.py -d data/prepared \
>           -o data/features \
>           python src/featurization.py \
>                  data/prepared data/features
Running command:
    python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
  File "src/featurization.py", line 75, in <module>
    train_words = np.array(df_train.text.str.lower().values.astype('U'))
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed

dnabanita7 on 28 May 2019

@Naba7 I don't quite understand why this error happening. Line 75 is even before CountVectorizer. Can you share more details, please?

shcheklein on 29 May 2019

@shcheklein The thing in here is,the error occurs at line 75, not at CountVectorizer. This arraysize needs to be fixed.

dnabanita7 on 29 May 2019

@shcheklein

import os
import sys
import errno
import pandas as pd
import numpy as np
import scipy.sparse as sparse

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

try:
    import cPickle as pickle
except ImportError:
    import pickle

np.set_printoptions(suppress=True)

if len(sys.argv) != 3 and len(sys.argv) != 5:
    sys.stderr.write('Arguments error. Usage:\n')
    sys.stderr.write('\tpython featurization.py data-dir-path features-dir-path$
    sys.exit(1)

train_input = os.path.join(sys.argv[1], 'train.tsv')
test_input = os.path.join(sys.argv[1], 'test.tsv')
train_output = os.path.join(sys.argv[2], 'train.pkl')
test_output = os.path.join(sys.argv[2], 'test.pkl')

try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except NameError:
    pass


def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc:  # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else:
            raise

def save_matrix(df, matrix, output):
    id_matrix = sparse.csr_matrix(df.id.astype(np.int64)).T
    label_matrix = sparse.csr_matrix(df.label.astype(np.int64)).T

    result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')

    msg = 'The output matrix {} size is {} and data type is {}\n'
    sys.stderr.write(msg.format(output, result.shape, result.dtype))

    with open(output, 'wb') as fd:
        pickle.dump(result, fd, pickle.HIGHEST_PROTOCOL)
    pass

mkdir_p(sys.argv[2])

# Generate train feature matrix
for tp in pd.read_csv(train_input,
                      encoding='utf-8',
                      header=None,
                      delimiter='\t',
                      names=['id', 'label', 'text'],
                      iterator=True,
                      chunksize=50)
   train_words = np.array(tp.text.str.lower().values.astype('U'))
    bag_of_words = CountVectorizer(stop_words='english',
                                  max_features=5000)
    bag_of_words.fit(train_words)
    train_words_binary_matrix = bag_of_words.transform(train_words)
    tfidf = TfidfTransformer(smooth_idf=False)
    tfidf.fit(train_words_binary_matrix)
    tp_tfidf_matrix = tfidf.transform(train_words_binary_matrix)
    df_train = pd.concat(tp, ignore_index=True)
    train_words_tfidf_matrix = pd.concat(tp_tfidf_matrix, ignore_index=True)

save_matrix(df_train, train_words_tfidf_matrix, train_output)
# Generate test feature matrix
for tp in pd.read_csv(test_input,
                      encoding='utf-8',
                      header=None,
                      delimiter='\t',
                      names=['id', 'label', 'text'],
                      iterator=True,
                      chunksize=50)

    test_words = np.array(tp.text.str.lower().values.astype('U'))
    test_words_binary_matrix = bag_of_words.transform(test_words)
    tp_tfidf_matrix = tfidf.transform(test_words_binary_matrix)
    df_test = pd.concat(tp, ignore_index=True)
    test_words_tfidf_matrix = pd.concat(tp_tfidf_matrix, ignore_index=True)

save_matrix(df_test, test_words_tfidf_matrix, test_output)

Can you please check it on your system this is the code for featurizationn.py?
The error shows:

SyntaxError: invalid syntax
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed

I can't get the why is it showing syntax error.

I followed this https://stackoverflow.com/questions/33642951/python-using-pandas-structures-with-large-csviterate-and-chunksize

dnabanita7 on 30 May 2019

@Naba7 it looks like you got the indent wrong:

   train_words = np.array(tp.text.str.lower().values.astype('U'))
    bag_of_words = CountVectorizer(stop_words='english',
                                  max_features=5000)

Also, in those cases (to debug) I would recommend to either run it with -v or just a plain command:

python src/featurization.py data/prepared data/features

so that you can see the full stack trace.

Also, are using any IDE or something? Some of them are highlighting syntax errors.

shcheklein on 1 Jun 2019

No,I used terminal and sublime text editor. Okay,my bad.

dnabanita7 on 1 Jun 2019

https://gist.github.com/Naba7/4a6697b9295848b0fc24ef9059342d01
@shcheklein Can you check this?

dnabanita7 on 4 Jun 2019

Is this issue still relevant?

dashohoxha on 5 Dec 2019

@dashohoxha I think yes, we haven't changes the get started significantly after that. It still requires quite substantial amount of memory to run. It affects Katacoda for example as you know.

shcheklein on 5 Dec 2019

Haven't heard this report in a long time. Closing for now?

p.s. some of these tips could go in the new troubleshooting guide?

jorgeorpinel on 20 Jan 2020

Facing this. Moving to Google Colab to learn for now.

srishti-nema on 20 May 2020

@srishti-nema You can also try this code.
https://github.com/ryokugyu/dvc_tutorial

If the machine spec is low. It has just beefy RAM requirement issue.

ryokugyu on 20 May 2020

👍1

@srishti-nema ~~can you please post the details of the error you're getting so we can try to reproduce?~~

jorgeorpinel on 20 May 2020

shcheklein on 20 May 2020

👍2

I wonder if this is still happening in the rewritten Get Started repo. Maybe, but no reports from users in a while, should we close it for now?

jorgeorpinel on 13 Jul 2020

@jorgeorpinel yes, it has the same problem still.

shcheklein on 13 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

md: support displaying command snippets with tabs

dashohoxha · 4Comments

web: flag to disable title case (capitalization of slugs) in nav sidebar

efiop · 3Comments

document pre-push hook

pared · 4Comments

how to: use NFS as a DVC remote

efiop · 4Comments

add more methods of setting Development Environment

kurianbenoy · 5Comments