Dvc.org: tutorials: caught MemoryError when "Running in bulk" in deep/define-ml-pipeline#running-in-bulk

Created on 14 May 2019  Â·  30Comments  Â·  Source: iterative/dvc.org

Please provide information about your setup
DVC version: 0.40.2 (installed by pip)
OS: Ubuntu 18.04
RAM: 8GB

I am following a tutorial in https://dvc.org/doc/tutorial/define-ml-pipeline.~~
UPDATE: This refers to http://localhost:3000/doc/tutorials/deep/define-ml-pipeline#running-in-bulk now.

In "Running in bulk" section, I failed to run this command and caught an error.

$ dvc run -d code/featurization.py -d code/conf.py \
            -d data/Posts-train.tsv -d data/Posts-test.tsv \
            -o data/matrix-train.p -o data/matrix-test.p \
            python code/featurization.py
Running command:
    python code/featurization.py
The input data frame data/Posts-train.tsv size is (66999, 3)
Traceback (most recent call last):
  File "code/featurization.py", line 48, in <module>
    train_words = np.array(df_train.text.str.lower().values.astype('U'))
MemoryError
ERROR: failed to run command - stage 'matrix-train.p.dvc' cmd python code/featurization.py failed

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
bug doc-content

Most helpful comment

@Naba7 I am working on a new tutorial. It will be up soon. With a smaller dataset and fewer RAM requirements.

All 30 comments

Hi @mexeniz !

Looks like you are running out of memory :slightly_frowning_face: As opposed to our get-started guide, our tutorial has some beefy requirements on RAM. Have you tried get-started already? https://dvc.org/doc/get-started In essence, it is a simplified tutorial.

Thank you!
I will check that get-started out.

Hi @efiop, I also ran into the same issue. By beefy requirements can you give a precise estimate of it?

Reopening this. As we discussed privately with @kurianbenoy, we need to find a way to modify it a bit so that we can run it on a smaller machine. Ideas to try: filter the dataset artificially, try less features (it's 5000, try 2500 by default), check if there is a way to use some optimized arrays.

@shcheklein I modified the script by trying 2500 features:
bag_of_words = CountVectorizer(stop_words='english', max_features=2500)

Yet I couldn't understand the error message which came following it :

Arguments error. Usage:
    python featurization.py data-dir-path features-dir-path
ERROR: failed to run command - stage 'matrix-train.p.dvc' cmd python code/featurization.py failed

@kurianbenoy can you run it with -v to see the full log?

@kurianbenoy it looks like you took the featurization.py from the get started, not from the tutorial? It seems you have a mismatch between your code and DVC files (command repro is trying to execute).

User from discord is running into the MemoryError on the same step but now in the get-started guide. Discord context: https://discordapp.com/channels/485586884165107732/563406153334128681/581584115644629012

https://github.com/iterative/dvc.org/issues/380

With 12 GB of RAM, I am still getting memory error.
@shcheklein

@efiop what kind of beefy RAM requirements is this? Someone mentioned changing the MaxFeature parameter to 50 from 5000.

@ryokugyu Unfortunately I don't know specific minimal RAM requirements for running the tutorial :(

@ryokugyu Unfortunately I don't know specific minimal RAM requirements for running the tutorial :(

with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.

@ryokugyu Unfortunately I don't know specific minimal RAM requirements for running the tutorial :(

with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.

As @shcheklein said I tried out reducing no of features in Count Vectoriser from 2500, 1000, 100, 50,1 and all of them gave memory error.

I tried chunking into smaller dataframes and then appending it to make a training dataframe. Just try the gist code and please help in resolving the errors. #380

@shcheklein @Naba7 I think there is something wrong with stage.py / run.py.
I tried executing the code on Anaconda 3. It executed successfully.

/doc/tutorial one.

But when running with DVC commands, it is giving an error.
Which eliminates the chances of any error in the code. We have to go through the DVC commands files.

Yes,You may be correct because I tried it with colab and notebooks it worked fine. No,I do not think so, because the error that shows is array size can't be handled properly. May be error is in stage.py or in featurization.py. Because in my terminal it shows error is in featurization.py

Because in my terminal it shows error is in featurization.py

it will show that the error is in featurization.py. But check this gist

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dvc/command/run.py", line 53, in run
    outs_persist_no_cache=self.args.outs_persist_no_cache,
  File "/usr/local/lib/python2.7/dist-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/dvc/repo/run.py", line 69, in run
    stage.run(no_commit=no_commit)
  File "/usr/local/lib/python2.7/dist-packages/dvc/stage.py", line 831, in run
    self._run()
  File "/usr/local/lib/python2.7/dist-packages/dvc/stage.py", line 786, in _run
    raise StageCmdFailedError(self)
StageCmdFailedError: stage 'matrix-train.p.dvc' cmd python code/featurization.py failed

Imo, the program is executing fine.

Let me make you understand what I have understood here. The https://github.com/iterative/dvc/blob/master/dvc/stage.py file stages the commands-reference and or in simple words executes the commands. In this case, it fails to execute the file featurization.py because there is an error in it. The error is not able to hold such a large ndarray in a dataframe. So, the error is in featurization which is to make the memory able to hold onto such a large data and perform operations on it. @shcheklein @efiop @ryokugyu It would be nice and somewhat better if we change using pandas to dask as dask supports parallel computing. I have tried solving this,but no value.

@Naba7 you are correct about the logic behind the stage!

Re the memory error. I still believe this is how CountVectorizer is working. I would try to replace it (I'm not sure if it makes sense), read about some possible optimization techniques.

I would also try to run some memory profiler if possible to see precisely where the bottleneck is.

dask is too advanced for the get started or NLP tutorial. There is a room for a separate dask tutorial though.

Alright!

On Thu 6 Jun, 2019, 11:43 PM Ivan Shcheklein, notifications@github.com
wrote:

@Naba7 https://github.com/Naba7 you are correct about the logic behind
the stage!

Re the memory error. I still believe this is how CountVectorizer is
working. I would try to replace it (I'm not sure if it makes sense), read
about some possible optimization techniques.

I would also try to run some memory profiler if possible to see precisely
where the bottleneck is.

dask is too advanced for the get started or NLP tutorial. There is a room
for a separate dask tutorial though.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/iterative/dvc.org/issues/333?email_source=notifications&email_token=AHQZX5376UDQR7PFH4A6BOTPZFHSXA5CNFSM4HM3AY6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDWN2A#issuecomment-499607272,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQZX55QOSGOSIBPMRO2AWDPZFHSXANCNFSM4HM3AY6A
.

@shcheklein @Naba7 @efiop This is just a problem for RAM requirements.

I ran it on my system which has the following configuration :
config_system

And it executed successfully by using almost 98% of RAM.

imo, RAM requirement is just more than 16 Gb then.

Yes,it has been discussed in discord. The problem is people not having the requirements can't use the tutorial or getting-started.

@Naba7 I am working on a new tutorial. It will be up soon. With a smaller dataset and fewer RAM requirements.

Is this issue still relevant?
I remember that in the Katacoda tutorial I had to reduce the size of the input dataset in order to run it successfully (with 1.5G RAM).

Reducing data set size make metrics look bad as far as I remember. So it's not a solution for this problem.

I don't think anybody is going to check the real value of the metric in an example or tutorial.

@dashohoxha they will. it's really bad to give an ML example that is broken. For the same reason (and it's arguable decision) we decided to use a real-life example vs just a bunch of random text files and scripts - real life example resonate way better with hands-on practitioners.

This issue affects me aswell.

Paste from my Discord message:

What's wrong?
While running featurization.py I get some kind buffer overflow. 16GB of RAM get consumed in seconds and the execution halts after a couple of seconds of system freeze.

I get a The input data frame data/Posts-train.tsv size is (66999, 3) output, so far the code is valid. But the next step most likely goes sideways, because a injected print(test) does not show up after train_words.

My setup includes 16GB of RAM.
Despite the older statements I don't get a memory error raised. I think dvc may not be verbose about python errors.

Tutorial got absorbed with get started. Closing this. For get started we have a separate ticket for this.

Was this page helpful?
0 / 5 - 0 ratings