similar issue as in #333
While going through getting-started, at running the third stage, training:
$ dvc run -f train.dvc
-d src/train.py -d data/features
-o model.pkl
python src/train.py data/features model.pkl
I got an error like :
Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 75, in
train_words = np.array(df_train.text.str.lower().values.astype('U'))
MemoryError
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed
The error occurs due to consumption of more RAM.
As a suggestion maybe(?) reducing the size of array is an option for running it locally.
@Naba7 thanks for reporting this! it requires 2-3GB memory to run as far as I remember. 2-3GB is a reasonable requirement for the local machine. What was your configuration? was you running it using docker by chance?
I got 4GB RAM,still not able to run it. I used git and terminal on ubuntu 18.04LTS.
What about using stratified sampling in featurization.py?
@dmpetrov I tried changing row dimensions of df_train as df_train.iloc[10000:,:]. But then I am getting valueerror as size of rows doesnt match which is obvious.
Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 85, in
save_matrix(df_train, train_words_tfidf_matrix, train_output)
File "src/featurization.py", line 61, in save_matrix
result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 585, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 10110, expected 20110.
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed
I got 4GB RAM,still not able to run it. I used git and terminal on ubuntu 18.04LTS.
I think RAM is the only issue here @shcheklein . I tried the tutorial and it executed successfully.
My machine has 12 GB RAM.
It;s just beefy RAM requirement I think!
@dmpetrov I tried changing row dimensions of df_train as df_train.iloc[10000:,:]. But then I am getting valueerror as size of rows doesnt match which is obvious.
Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 85, in
save_matrix(df_train, train_words_tfidf_matrix, train_output)
File "src/featurization.py", line 61, in save_matrix
result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 585, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 10110, expected 20110.
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed
Did you tried different test train split sizes?
@ryokugyu @Naba7 to be precise - this about get-started, not tutorial, right? There is a separate ticket for tutorial - #333
Most likely, the reason for both (get-started and tutorial) to consume that amount of memory is the way CountVectorizer works. Not the end result - matrix itself might be large, but not _that_ large.
Check this link and some tricks how to optimize ti - https://medium.com/@AgenceSkoli/how-to-avoid-memory-overloads-using-scikit-learn-f5eb911ae66c . It might be that HashingVectorizer is the solution to this.
@dmpetrov I tried changing row dimensions of df_train as df_train.iloc[10000:,:]. But then I am getting valueerror as size of rows doesnt match which is obvious.
Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 85, in
save_matrix(df_train, train_words_tfidf_matrix, train_output)
File "src/featurization.py", line 61, in save_matrix
result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "/home/chad7/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py", line 585, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 10110, expected 20110.
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failedDid you tried different test train split sizes?
Test set and train set are different files. So, no need to do the splitting
yes,it is about getting-started
@ryokugyu @Naba7 to be precise - this about get-started, not tutorial, right? There is a separate ticket for tutorial - #333
Most likely, the reason for both (get-started and tutorial) to consume that amount of memory is the way
CountVectorizerworks. Not the end result - matrix itself might be large, but not _that_ large.Check this link and some tricks how to optimize ti - https://medium.com/@AgenceSkoli/how-to-avoid-memory-overloads-using-scikit-learn-f5eb911ae66c . It might be that HashingVectorizer is the solution to this.
No HashingVectorizer didnt help. The first line where error occurs is assigning values to df_train. Hence,I am checking all other methods mentioned in the article.
I think first dividing the large train file into chunks by tweaking read_csv's iterator and chunksize parameter as in http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking, then applying the CountVectorizer on each of the chunks and then using append function to create document-matrix may be a good idea.
I think of changing Max_features parameter in bag_of_words from 5000 to 50.
@Naba7 have you tried this change? does it solve this issue? I would expect the problem is not with the number of features (5000) but in the size of the dataset and the way CountVectorizer works - it needs to iterate though _all_ combinations before it can cut 5000 most meaningful.
Can you share the HashingVectorizer experiment?
Hashingvectorizer didn't work because the error was in setting the values
of df_train. I am trying to implement the above mentioned method of
dividing into chunks and then appending it. Till now no success.
On Tue 28 May, 2019, 1:40 AM Ivan Shcheklein, notifications@github.com
wrote:
@Naba7 https://github.com/Naba7 have you tried this change? does it
solve this issue? I would expect the problem is not with the number of
features (5000) but in the size of the dataset and the way CountVectorizer
works - it needs to iterate though all combinations before it can cut
5000 most meaningful.Can you share the HashingVectorizer experiment?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/iterative/dvc.org/issues/380?email_source=notifications&email_token=AHQZX53YF5O7QUBTE4NZ6STPXQ53XA5CNFSM4HPS6A62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKOKYQ#issuecomment-496297314,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQZX5YWQZCBU5ATVZC6P5DPXQ53XANCNFSM4HPS6A6Q
.
The problem with hashingvectorizer is we can't trace back to the labels that has been hashed at one time,so even if memory error will be solved but we wont get a proper labelization of features.
featurization (copy).pdf
Till now,I tried changing read_csv while adding iterator and chunksize operators and writing a for loop for Countvectorizer. The script isn't error-free, I am working on it to make it error free. The problem is the iterator returns TextFileReader and I don't know how to convert it to dataframe.
The error while changing CountVectorizer to HashingVectorizer:
chad7@superuser:~/exp_dvc$ dvc run -f featurize.dvc \
> -d src/featurization.py -d data/prepared \
> -o data/features \
> python src/featurization.py \
> data/prepared data/features
Running command:
python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20110, 3)
Traceback (most recent call last):
File "src/featurization.py", line 75, in <module>
train_words = np.array(df_train.text.str.lower().values.astype('U'))
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed
@Naba7 I don't quite understand why this error happening. Line 75 is even before CountVectorizer. Can you share more details, please?
@shcheklein The thing in here is,the error occurs at line 75, not at CountVectorizer. This arraysize needs to be fixed.
@shcheklein
import os
import sys
import errno
import pandas as pd
import numpy as np
import scipy.sparse as sparse
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
try:
import cPickle as pickle
except ImportError:
import pickle
np.set_printoptions(suppress=True)
if len(sys.argv) != 3 and len(sys.argv) != 5:
sys.stderr.write('Arguments error. Usage:\n')
sys.stderr.write('\tpython featurization.py data-dir-path features-dir-path$
sys.exit(1)
train_input = os.path.join(sys.argv[1], 'train.tsv')
test_input = os.path.join(sys.argv[1], 'test.tsv')
train_output = os.path.join(sys.argv[2], 'train.pkl')
test_output = os.path.join(sys.argv[2], 'test.pkl')
try:
reload(sys)
sys.setdefaultencoding('utf-8')
except NameError:
pass
def mkdir_p(path):
try:
os.makedirs(path)
except OSError as exc: # Python >2.5
if exc.errno == errno.EEXIST and os.path.isdir(path):
pass
else:
raise
def save_matrix(df, matrix, output):
id_matrix = sparse.csr_matrix(df.id.astype(np.int64)).T
label_matrix = sparse.csr_matrix(df.label.astype(np.int64)).T
result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')
msg = 'The output matrix {} size is {} and data type is {}\n'
sys.stderr.write(msg.format(output, result.shape, result.dtype))
with open(output, 'wb') as fd:
pickle.dump(result, fd, pickle.HIGHEST_PROTOCOL)
pass
mkdir_p(sys.argv[2])
# Generate train feature matrix
for tp in pd.read_csv(train_input,
encoding='utf-8',
header=None,
delimiter='\t',
names=['id', 'label', 'text'],
iterator=True,
chunksize=50)
train_words = np.array(tp.text.str.lower().values.astype('U'))
bag_of_words = CountVectorizer(stop_words='english',
max_features=5000)
bag_of_words.fit(train_words)
train_words_binary_matrix = bag_of_words.transform(train_words)
tfidf = TfidfTransformer(smooth_idf=False)
tfidf.fit(train_words_binary_matrix)
tp_tfidf_matrix = tfidf.transform(train_words_binary_matrix)
df_train = pd.concat(tp, ignore_index=True)
train_words_tfidf_matrix = pd.concat(tp_tfidf_matrix, ignore_index=True)
save_matrix(df_train, train_words_tfidf_matrix, train_output)
# Generate test feature matrix
for tp in pd.read_csv(test_input,
encoding='utf-8',
header=None,
delimiter='\t',
names=['id', 'label', 'text'],
iterator=True,
chunksize=50)
test_words = np.array(tp.text.str.lower().values.astype('U'))
test_words_binary_matrix = bag_of_words.transform(test_words)
tp_tfidf_matrix = tfidf.transform(test_words_binary_matrix)
df_test = pd.concat(tp, ignore_index=True)
test_words_tfidf_matrix = pd.concat(tp_tfidf_matrix, ignore_index=True)
save_matrix(df_test, test_words_tfidf_matrix, test_output)
Can you please check it on your system this is the code for featurizationn.py?
The error shows:
SyntaxError: invalid syntax
ERROR: failed to run command - stage 'featurize.dvc' cmd python src/featurization.py data/prepared data/features failed
I can't get the why is it showing syntax error.
I followed this https://stackoverflow.com/questions/33642951/python-using-pandas-structures-with-large-csviterate-and-chunksize
@Naba7 it looks like you got the indent wrong:
train_words = np.array(tp.text.str.lower().values.astype('U'))
bag_of_words = CountVectorizer(stop_words='english',
max_features=5000)
Also, in those cases (to debug) I would recommend to either run it with -v or just a plain command:
python src/featurization.py data/prepared data/features
so that you can see the full stack trace.
Also, are using any IDE or something? Some of them are highlighting syntax errors.
No,I used terminal and sublime text editor. Okay,my bad.
https://gist.github.com/Naba7/4a6697b9295848b0fc24ef9059342d01
@shcheklein Can you check this?
Is this issue still relevant?
@dashohoxha I think yes, we haven't changes the get started significantly after that. It still requires quite substantial amount of memory to run. It affects Katacoda for example as you know.
Haven't heard this report in a long time. Closing for now?
p.s. some of these tips could go in the new troubleshooting guide?
Facing this. Moving to Google Colab to learn for now.
@srishti-nema You can also try this code.
https://github.com/ryokugyu/dvc_tutorial
If the machine spec is low. It has just beefy RAM requirement issue.
@srishti-nema can you please post the details of the error you're getting so we can try to reproduce?
@jorgeorpinel I believe it is the same error. We never fixed it since it would require rebuilding the project, using a different one completely. We just expect users to have at least 8GB machine at this point.
I wonder if this is still happening in the rewritten Get Started repo. Maybe, but no reports from users in a while, should we close it for now?
@jorgeorpinel yes, it has the same problem still.
Most helpful comment
@jorgeorpinel I believe it is the same error. We never fixed it since it would require rebuilding the project, using a different one completely. We just expect users to have at least 8GB machine at this point.