Gensim: LDA multicore stuck after a few passes

Created on 16 Sep 2017 · 12Comments · Source: RaRe-Technologies/gensim

after running properly for a 10 passes the process is stuck.
48core : 194GB , 6.7 million documents, 57,000 tokens
num_topics = 64
chunksize = 10000
passes = 20
iterations = 100
logs stopped printing at ----
_2017-09-15 20:10:40,745 INFO 140583487252288 PROGRESS: pass 11, dispatched chunk #147 = documents up to #1480000/6748579, outstanding queue size 109
2017-09-15 20:10:40,932 INFO 140583487252288 PROGRESS: pass 11, dispatched chunk #148 = documents up to #1490000/6748579, outstanding queue size 110
2017-09-15 20:10:41,173 INFO 140583487252288 PROGRESS: pass 11, dispatched chunk #149 = documents up to #1500000/6748579, outstanding queue size 111
2017-09-15 20:10:41,410 DEBUG 140583487252288 9924/10000 documents converged within 100 iterations
2017-09-15 20:10:41,550 INFO 140583487252288 PROGRESS: pass 11, dispatched chunk #150 = documents up to #1510000/6748579, outstanding queue size 112
2017-09-15 20:10:41,680 DEBUG 140583487252288 processed chunk, queuing the result
2017-09-15 20:10:41,955 DEBUG 140583487252288 result put
2017-09-15 20:10:41,956 DEBUG 140583487252288 getting a new job
2017-09-15 20:10:45,384 DEBUG 140554940122880 worker process entering E-step loop
2017-09-15 20:10:45,404 DEBUG 140554940122880 getting a new job
2017-09-15 20:10:49,148 DEBUG 140554940122880 worker process entering E-step loop
2017-09-15 20:10:49,195 DEBUG 140554940122880 getting a new job
2017-09-15 20:10:49,441 DEBUG 140583487252288 9920/10000 documents converged within 100 iterations
2017-09-15 20:10:49,630 DEBUG 140583487252288 processed chunk, queuing the result
2017-09-15 20:10:49,830 DEBUG 140583487252288 result put
2017-09-15 20:10:49,831 DEBUG 140583487252288 getting a new job
2017-09-15 20:10:51,158 DEBUG 140583487252288 9925/10000 documents converged within 100 iterations
2017-09-15 20:10:51,322 DEBUG 140583487252288 processed chunk, queuing the result
2017-09-15 20:10:51,344 DEBUG 140583487252288 result put
2017-09-15 20:10:51,345 DEBUG 140583487252288 getting a new job
2017-09-15 20:10:52,413 DEBUG 140583487252288 9925/10000 documents converged within 100 iterations
2017-09-15 20:10:52,597 DEBUG 140583487252288 processed chunk, queuing the result
2017-09-15 20:10:52,640 DEBUG 140554940122880 worker process entering E-step loop
2017-09-15 20:10:52,642 DEBUG 140554940122880 getting a new job
2017-09-15 20:10:53,889 DEBUG 140583487252288 result put
2017-09-15 20:10:53,891 DEBUG 140583487252288 getting a new job
2017-09-15 20:10:54,895 DEBUG 140583487252288 9934/10000 documents converged within 100 iterations
2017-09-15 20:10:55,026 DEBUG 140583487252288 processed chunk, queuing the result
2017-09-15 20:10:55,054 DEBUG 140583487252288 result put
2017-09-15 20:10:55,201 DEBUG 140583487252288 getting a new job
2017-09-15 20:10:57,169 DEBUG 140554940122880 worker process entering E-step loop
2017-09-15 20:10:57,177 DEBUG 140554940122880 getting a new job
2017-09-15 20:11:01,548 DEBUG 140554940122880 worker process entering E-step loop
2017-09-15 20:11:02,448 DEBUG 140554940122880 getting a new job
2017-09-15 20:11:13,901 DEBUG 140554940122880 worker process entering E-step loop
2017-09-15 20:11:13,905 DEBUG 140554940122880 getting a new job
2017-09-15 20:11:17,107 DEBUG 140554940122880 worker process entering E-step loop
2017-09-15 20:11:17,123 DEBUG 140554940122880 getting a new job
2017-09-15 20:11:21,198 DEBUG 140554940122880 worker process entering E-step loop
2017-09-15 20:11:21,201 DEBUG 140554940122880 getting a new job
2017-09-15 20:12:19,365 DEBUG 140583487252288 processing chunk #49 of 10000 documents
2017-09-15 20:12:19,372 DEBUG 140583487252288 performing inference on a chunk of 10000 documents
2017-09-15 20:13:02,963 DEBUG 140583487252288 9938/10000 documents converged within 100 iterations
2017-09-15 20:13:03,053 DEBUG 140583487252288 processed chunk, queuing the result
2017-09-15 20:13:03,058 DEBUG 140583487252288 result put
2017-09-15 20:13:03,058 DEBUG 140583487252288 getting a new job_

the strace of master process is full of --- poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)

with a little activity n between. ------ like below

_poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
futex(0x8fffc4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8fffc0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250003, {1505552174, 730881000}, ffffffff) = 0
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250005, {1505552174, 730998000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250007, {1505552174, 731179000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250009, {1505552174, 731287000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8fffc0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8fffc4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8fffc0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
futex(0x8fffc4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8fffc0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250017, {1505552174, 731764000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250019, {1505552174, 731798000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8fffc0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
futex(0x8fffc4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8fffc0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250025, {1505552174, 732150000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250027, {1505552174, 732184000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250029, {1505552174, 732269000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x8fffc4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3905250031, {1505552174, 732372000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x8fffc4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x8fffc0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x8fff80, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)_

----- mater process is taking 100% of cpu on one core, all the other process are not taking any cpu.
there is still 5GB free ram on the system.

bug difficulty hard performance

Source

abhishekbuyt

Most helpful comment

I'm facing this error right now while running the LDA Multicore. Not sure why the deadlock occurs.

Specs:
OS - Mac OS High Sierra V 10.13.6
Python 3.5.2
gensim==3.5.0
numpy==1.14.5
scipy==1.1.0

The entire code is present in my colab notebook running on local server:
colab notebook

Here is the data set:
data set file link

Also attached my requirements.txt for the python3 virtual environement.
requirements_py3.txt

Soumithri on 1 Oct 2018

👍4

All 12 comments

I noticed something like this recently. I think I ended up killing all my running Python processes via ps aux | grep python | awk '{print $2}' | xargs kill and then retrying. I think that fixed it, but I don't know why it's actually happening. I tried playing with chunksize, thinking it might have something to do with edge cases on exact chunk boundaries, but that didn't help.

macks22 on 17 Sep 2017

I have done the same. Changed chunksize and ran it again after killing it.
It is going on right know. But I would not call it fixing so much as
ignoring. Seems like a classic dead lock. Probably some corner case where
this happens.
I also tried killing all the worker processes only, they respawned, but the
master did not move forward. Someone with a better understanding of how the
internal works might know where to look for. If I am able to get more data
(if it reoccurs) will add here.

On 17-Sep-2017 4:12 AM, "Mack" notifications@github.com wrote:

I noticed something like this recently. I think I ended up killing all my
running Python processes via ps aux | grep python | awk '{print $2}' |
xargs kill and then retrying. I think that fixed it, but I don't know why
it's actually happening. I tried playing with chunksize, thinking it
might have something to do with edge cases on exact chunk boundaries, but
that didn't help.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/RaRe-Technologies/gensim/issues/1588#issuecomment-329999214,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALUeTdfo8giqhywUqXGtnEE4Fy3vwtZ9ks5sjE7ggaJpZM4PZwtI
.

abhishekbuyt on 17 Sep 2017

so it happened again.
was able to get the num_topic=64 to run with chunksize 6000,
but tried running for 512 topics. the process is stuck in the first pass itself.

Seems to be repetitive. i will not kill for next hour (in case someone wants some details.) will restart with a different chunksize again.

Any leads on what might be causing it will help debug. Thanks

abhishekbuyt on 17 Sep 2017

dont know what might be causing it. found an open bug in the multiprocessing module
https://bugs.python.org/issue29759

trying out the fix reccomended.
** dont know about the internals of python enough to be able to properly understand if it is relevant to out implementation.

abhishekbuyt on 17 Sep 2017

Update : it did not work . :(

abhishekbuyt on 17 Sep 2017

Hi @abhishekbuyt,
can you give me additional info:

OS/Python/Gensim/Numpy/Scipy versions
Concrete code and dataset for reproducing your error

menshikh-iv on 2 Oct 2017

ping @abhishekbuyt

menshikh-iv on 31 Oct 2017

hi @menshikh-iv sorry about forgetting about this thread and thanks fro the reminder. The issue did not reoccur so managed to forget.

Something similar happened yesterday with num_topics = 512 and a dictionary size of 250k (same machine)
while running it with pdb, found some useful errors being thrown ( might be the same as what was the cause of this particular bug. )

found the following stackoverflow question in my research

Its an issue with multiprocessing module of python. have managed to fix it. will add the fix here for future reference.

Note:
Centos 7
Python 3.6.2
gensim==2.3.0
numpy==1.13.1
scipy==0.19.1

The data set is too huge to share -- 6.7 million documents 7.2G mm file
dictionary has 57 tokens

the code was simply load the dictionary, load mmcorpus , run the LDAmulticore with the numtopics and chunksize parameters ( rest was left to default )

** still working on finalizing and testing the multiprocessing module fix pointed to before. will update here in a few days with the changes needed.

abhishekbuyt on 31 Oct 2017

@abhishekbuyt thanks for the info, It will be very nice if you share your dataset too (any file-sharing service), I understand that file is very large, but this is needed for reproducing a bug.

menshikh-iv on 31 Oct 2017

Hello @abhishekbuyt, how about multiprocessing fix?

menshikh-iv on 9 Feb 2018

ping @abhishekbuyt

menshikh-iv on 7 Aug 2018

I'm facing this error right now while running the LDA Multicore. Not sure why the deadlock occurs.

Specs:
OS - Mac OS High Sierra V 10.13.6
Python 3.5.2
gensim==3.5.0
numpy==1.14.5
scipy==1.1.0

The entire code is present in my colab notebook running on local server:
colab notebook

Here is the data set:
data set file link

Also attached my requirements.txt for the python3 virtual environement.
requirements_py3.txt

Soumithri on 1 Oct 2018

👍4

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Feature proposal: model trimming

menshikh-iv · 3Comments

Evaluate similarity search libraries (annoy, faiss, etc).

menshikh-iv · 4Comments

Word2Vec ns_exponent cannot be changed from default

coopwilliams · 3Comments

numpy==1.7.1 and gensim

vlad17 · 4Comments

LDAViz and streaming corpus

franciscojavierarceo · 3Comments