Dvc.org: tutorials: caught MemoryError when "Running in bulk" in deep/define-ml-pipeline#running-in-bulk

Created on 14 May 2019 · 30Comments · Source: iterative/dvc.org

Please provide information about your setup
DVC version: 0.40.2 (installed by pip)
OS: Ubuntu 18.04
RAM: 8GB

I am following a tutorial in https://dvc.org/doc/tutorial/define-ml-pipeline.~~
UPDATE: This refers to http://localhost:3000/doc/tutorials/deep/define-ml-pipeline#running-in-bulk now.

In "Running in bulk" section, I failed to run this command and caught an error.

$ dvc run -d code/featurization.py -d code/conf.py \ -d data/Posts-train.tsv -d data/Posts-test.tsv \ -o data/matrix-train.p -o data/matrix-test.p \ python code/featurization.py Running command: python code/featurization.py The input data frame data/Posts-train.tsv size is (66999, 3) Traceback (most recent call last): File "code/featurization.py", line 48, in <module> train_words = np.array(df_train.text.str.lower().values.astype('U')) MemoryError ERROR: failed to run command - stage 'matrix-train.p.dvc' cmd python code/featurization.py failed Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!

bug doc-content
Source

~~mexeniz~~

Most helpful comment

@Naba7 I am working on a new tutorial. It will be up soon. With a smaller dataset and fewer RAM requirements.

ryokugyu on 27 Jun 2019

🎉3 🚀2

All 30 comments

Hi @mexeniz !

Looks like you are running out of memory :slightly_frowning_face: As opposed to our get-started guide, our tutorial has some beefy requirements on RAM. Have you tried get-started already? https://dvc.org/doc/get-started In essence, it is a simplified tutorial.

efiop on 14 May 2019

👍2

Thank you!
I will check that get-started out.

mexeniz on 15 May 2019

Hi @efiop, I also ran into the same issue. By beefy requirements can you give a precise estimate of it?

kurianbenoy on 15 May 2019

Reopening this. As we discussed privately with @kurianbenoy, we need to find a way to modify it a bit so that we can run it on a smaller machine. Ideas to try: filter the dataset artificially, try less features (it's 5000, try 2500 by default), check if there is a way to use some optimized arrays.

shcheklein on 17 May 2019

👍2

@shcheklein I modified the script by trying 2500 features:
bag_of_words = CountVectorizer(stop_words='english', max_features=2500)

Yet I couldn't understand the error message which came following it :

Arguments error. Usage: python featurization.py data-dir-path features-dir-path ERROR: failed to run command - stage 'matrix-train.p.dvc' cmd python code/featurization.py failed

kurianbenoy on 19 May 2019

@kurianbenoy can you run it with -v to see the full log?

shcheklein on 19 May 2019

@shcheklein this is log messages:
https://gist.github.com/kurianbenoy/33d087f910fd0e28f2cc89d000c19052

kurianbenoy on 20 May 2019

@kurianbenoy it looks like you took the featurization.py from the get started, not from the tutorial? It seems you have a mismatch between your code and DVC files (command repro is trying to execute).

shcheklein on 21 May 2019

User from discord is running into the MemoryError on the same step but now in the get-started guide. Discord context: https://discordapp.com/channels/485586884165107732/563406153334128681/581584115644629012

https://github.com/iterative/dvc.org/issues/380

efiop on 24 May 2019

👍2

With 12 GB of RAM, I am still getting memory error.
@shcheklein

ryokugyu on 26 May 2019

👀1

@efiop what kind of beefy RAM requirements is this? Someone mentioned changing the MaxFeature parameter to 50 from 5000.

ryokugyu on 27 May 2019

@ryokugyu Unfortunately I don't know specific minimal RAM requirements for running the tutorial :(

efiop on 29 May 2019

@ryokugyu Unfortunately I don't know specific minimal RAM requirements for running the tutorial :(

with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.

ryokugyu on 30 May 2019

👍2

@ryokugyu Unfortunately I don't know specific minimal RAM requirements for running the tutorial :(

with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.

As @shcheklein said I tried out reducing no of features in Count Vectoriser from 2500, 1000, 100, 50,1 and all of them gave memory error.

kurianbenoy on 2 Jun 2019

👍1

I tried chunking into smaller dataframes and then appending it to make a training dataframe. Just try the gist code and please help in resolving the errors. #380

dnabanita7 on 4 Jun 2019

@shcheklein @Naba7 I think there is something wrong with stage.py / run.py.
I tried executing the code on Anaconda 3. It executed successfully.

/doc/tutorial one.

But when running with DVC commands, it is giving an error.
Which eliminates the chances of any error in the code. We have to go through the DVC commands files.

ryokugyu on 4 Jun 2019

Yes,You may be correct because I tried it with colab and notebooks it worked fine. No,I do not think so, because the error that shows is array size can't be handled properly. May be error is in stage.py or in featurization.py. Because in my terminal it shows error is in featurization.py

dnabanita7 on 5 Jun 2019

Because in my terminal it shows error is in featurization.py

it will show that the error is in featurization.py. But check this gist

Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dvc/command/run.py", line 53, in run outs_persist_no_cache=self.args.outs_persist_no_cache, File "/usr/local/lib/python2.7/dist-packages/dvc/repo/scm_context.py", line 4, in run result = method(repo, *args, **kw) File "/usr/local/lib/python2.7/dist-packages/dvc/repo/run.py", line 69, in run stage.run(no_commit=no_commit) File "/usr/local/lib/python2.7/dist-packages/dvc/stage.py", line 831, in run self._run() File "/usr/local/lib/python2.7/dist-packages/dvc/stage.py", line 786, in _run raise StageCmdFailedError(self) StageCmdFailedError: stage 'matrix-train.p.dvc' cmd python code/featurization.py failed

Imo, the program is executing fine.

ryokugyu on 5 Jun 2019

Let me make you understand what I have understood here. The https://github.com/iterative/dvc/blob/master/dvc/stage.py file stages the commands-reference and or in simple words executes the commands. In this case, it fails to execute the file featurization.py because there is an error in it. The error is not able to hold such a large ndarray in a dataframe. So, the error is in featurization which is to make the memory able to hold onto such a large data and perform operations on it. @shcheklein @efiop @ryokugyu It would be nice and somewhat better if we change using pandas to dask as dask supports parallel computing. I have tried solving this,but no value.

dnabanita7 on 6 Jun 2019

@Naba7 you are correct about the logic behind the stage!

Re the memory error. I still believe this is how CountVectorizer is working. I would try to replace it (I'm not sure if it makes sense), read about some possible optimization techniques.

I would also try to run some memory profiler if possible to see precisely where the bottleneck is.

dask is too advanced for the get started or NLP tutorial. There is a room for a separate dask tutorial though.

shcheklein on 6 Jun 2019

Alright!

On Thu 6 Jun, 2019, 11:43 PM Ivan Shcheklein, notifications@github.com
wrote:

@Naba7 https://github.com/Naba7 you are correct about the logic behind
the stage!

Re the memory error. I still believe this is how CountVectorizer is
working. I would try to replace it (I'm not sure if it makes sense), read
about some possible optimization techniques.

I would also try to run some memory profiler if possible to see precisely
where the bottleneck is.

dask is too advanced for the get started or NLP tutorial. There is a room
for a separate dask tutorial though.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/iterative/dvc.org/issues/333?email_source=notifications&email_token=AHQZX5376UDQR7PFH4A6BOTPZFHSXA5CNFSM4HM3AY6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDWN2A#issuecomment-499607272,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQZX55QOSGOSIBPMRO2AWDPZFHSXANCNFSM4HM3AY6A
.

dnabanita7 on 7 Jun 2019

@shcheklein @Naba7 @efiop This is just a problem for RAM requirements.

I ran it on my system which has the following configuration :

And it executed successfully by using almost 98% of RAM.

imo, RAM requirement is just more than 16 Gb then.

ryokugyu on 22 Jun 2019

❤2

Yes,it has been discussed in discord. The problem is people not having the requirements can't use the tutorial or getting-started.

dnabanita7 on 26 Jun 2019

@Naba7 I am working on a new tutorial. It will be up soon. With a smaller dataset and fewer RAM requirements.

ryokugyu on 27 Jun 2019

🎉3 🚀2

Is this issue still relevant?
I remember that in the Katacoda tutorial I had to reduce the size of the input dataset in order to run it successfully (with 1.5G RAM).

dashohoxha on 5 Dec 2019

Reducing data set size make metrics look bad as far as I remember. So it's not a solution for this problem.

shcheklein on 5 Dec 2019

I don't think anybody is going to check the real value of the metric in an example or tutorial.

dashohoxha on 5 Dec 2019

@dashohoxha they will. it's really bad to give an ML example that is broken. For the same reason (and it's arguable decision) we decided to use a real-life example vs just a bunch of random text files and scripts - real life example resonate way better with hands-on practitioners.

shcheklein on 5 Dec 2019

This issue affects me aswell.

Paste from my Discord message:

What's wrong?
While running featurization.py I get some kind buffer overflow. 16GB of RAM get consumed in seconds and the execution halts after a couple of seconds of system freeze.

I get a The input data frame data/Posts-train.tsv size is (66999, 3) output, so far the code is valid. But the next step most likely goes sideways, because a injected print(test) does not show up after train_words.

My setup includes 16GB of RAM.
Despite the older statements I don't get a memory error raised. I think dvc may not be verbose about python errors.

depate on 6 Mar 2020

👀3

Tutorial got absorbed with get started. Closing this. For get started we have a separate ticket for this.

shcheklein on 18 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

docs: should not need index files in every parent sidebar element

jorgeorpinel · 3Comments

term: review "download" in the context of get and import commands, et al.

jorgeorpinel · 4Comments

how to: use NFS as a DVC remote

efiop · 4Comments

md: support displaying command snippets with tabs

dashohoxha · 4Comments

add "Jupyter notebook" article

efiop · 5Comments