Deeplearning4j: Deadlock using ParagraphVector

Created on 15 Jun 2017 · 21Comments · Source: eclipse/deeplearning4j

Hi, trying to create a word embegging (I use ParagraphVector) with:

0.8.1-SNAPSHOT
Ubuntu 16.04.2 LTS
Hardware RAM 16 GB, Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
File input contains 513316 lines with size of 873,5 MB. Training corpus is a string (7-bit ASCII) with lenght range of [10, 50 000] more or less.
DefaultTokenizerFactory with custom MyPreProcessor

The logs is:

11:35:10.937 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting vocabulary building...
11:35:10.938 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
11:35:10.991 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
11:35:10.991 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
11:35:56.217 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [100000]; Current vocabulary size: [385141]; Sequences/sec: 2208.58; Words/sec: 962757.81;
?11:58:56.372 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [200000]; Current vocabulary size: [588558]; Sequences/sec: 72.46; Words/sec: 26455.10;
11:59:41.377 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [300000]; Current vocabulary size: [667109]; Sequences/sec: 2221.98; Words/sec: 902804.93;
11:59:59.203 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
11:59:59.204 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [764577],  NumWords: [142093275], sequences parsed: [366918], counter: [142093265]
11:59:59.658 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 764577; Words after: 342429;
11:59:59.658 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [342429],  NumWords: [140970370], sequences parsed: [366918], counter: [142093265]
12:00:07.468 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [366918], Current vocabulary size: [342429]; Sequences/sec: [245.18];
12:00:07.486 [Thread-11] INFO  o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [391.88 MB]
12:00:07.806 [Thread-11] INFO  o.d.m.e.inmemory.InMemoryLookupTable - Initializing syn1...
12:00:08.038 [Thread-11] INFO  o.d.m.s.SequenceVectors - Building learning algorithms:
12:00:08.038 [Thread-11] INFO  o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
12:00:08.042 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting learning process...
12:24:22.518 [VectorCalculationsThread 5] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [43254792];  Lines vectorized so far: [100000]; Seq/sec: [68.76]; Words/sec: [29740.19]; learningRate: [0.01732922544852691]
13:05:27.656 [VectorCalculationsThread 10] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [79385098];  Lines vectorized so far: [200000]; Seq/sec: [40.57]; Words/sec: [20253.57]; learningRate: [0.01092213763943002]
13:24:03.599 [VectorCalculationsThread 0] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [111279196];  Lines vectorized so far: [300000]; Seq/sec: [89.61]; Words/sec: [22098.93]; learningRate: [0.005265719332773217]

then, nothing happend, the CPU usage is 0.3% and memory is 52% (-Xmx10000min java options).

I alredy used the same classes and hardware but the file input had 17984 rows and all worked well.

[MyPreProcessor]

public class MyPreprocessor implements TokenPreProcess {
    public MyPreprocessor() { }

    @Override
    public String preProcess(String token) {

        // Clean
        token = StringCleaning.stripPunct(token).toLowerCase();

        // Accents
        token = StringUtils.stripAccents(token);

        // Bad char (only alphanumeric)
        token = token.replaceAll("[^A-Za-z0-9 ]", "");

        // Check if token contains at least one alphabet
        if (!token.matches(".*[a-zA-Z]+.*")) return "";

        // Small "words"
        if (token.length() <= 1) return  "";

        return token;
    }
}

Bug

Source

silvioOlivastri

Most helpful comment

To reproduce on the dl4j-examples open the file Word2VecRawTextExample, change iterations(1)to epochs(100) recompile and run the example.

dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/word2vec/Word2VecRawTextExample.java
@@ -45,7 +45,7 @@ public class Word2VecRawTextExample {
         log.info("Building model....");
         Word2Vec vec = new Word2Vec.Builder()
                 .minWordFrequency(5)
-                .iterations(1)
+                .epochs(100)
                 .layerSize(100)
                 .seed(42)
                 .windowSize(5)

Deadlock will look something like:

o.d.e.n.w.Word2VecRawTextExample - Load & Vectorize Sentences....
o.d.e.n.w.Word2VecRawTextExample - Building model....
o.n.l.f.Nd4jBackend - Loaded [CpuBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for NativeOps: 4
o.n.n.Nd4jBlas - Number of threads used for BLAS: 4
o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [7.0GB];
o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
o.d.e.n.w.Word2VecRawTextExample - Fitting Word2Vec model....
o.d.m.s.SequenceVectors - Starting vocabulary building...
o.d.m.w.w.VocabConstructor - Sequences checked: [97162], Current vocabulary size: [242]; Sequences/sec: [20369.39];
o.d.m.e.l.WordVectorSerializer - Projected memory use for model: [0.18 MB]
o.d.m.e.i.InMemoryLookupTable - Initializing syn1...
o.d.m.s.SequenceVectors - Building learning algorithms:
o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
o.d.m.s.SequenceVectors - Starting learning process...
o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [2]; Words vectorized so far: [634299];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [3]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [4]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [5]; Words vectorized so far: [634296];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [6]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [7]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [8]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [9]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [10]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [11]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [12]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [13]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]

nzv8fan on 8 Aug 2017

👍2

All 21 comments

As discussed in gitter issue is preprocessor that returns empty strings as tokens, making w2v very sad.

raver119 on 15 Jun 2017

👍1

With new PreProcessor the result is:

17:12:21.824 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting vocabulary building...
17:12:21.825 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
17:12:21.850 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
17:12:21.850 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
17:12:42.787 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [100000]; Current vocabulary size: [311509]; Sequences/sec: 4770.76; Words/sec: 1451590.53;
17:13:00.388 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [200000]; Current vocabulary size: [400668]; Sequences/sec: 5681.50; Words/sec: 1586597.64;
17:13:21.565 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [300000]; Current vocabulary size: [616500]; Sequences/sec: 4722.10; Words/sec: 1392651.37;
17:13:41.397 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [400000]; Current vocabulary size: [687898]; Sequences/sec: 5042.36; Words/sec: 1493682.84;
17:13:59.696 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [500000]; Current vocabulary size: [744425]; Sequences/sec: 5464.48; Words/sec: 1500757.60;
17:14:01.942 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
17:14:01.942 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [750331],  NumWords: [148352306], sequences parsed: [513316], counter: [148352306]
17:14:02.375 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 750331; Words after: 337117;
17:14:02.376 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [337117],  NumWords: [147245448], sequences parsed: [513316], counter: [148352306]
17:14:10.179 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [513316], Current vocabulary size: [337117]; Sequences/sec: [4737.40];
17:14:10.197 [Thread-11] INFO  o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [385.80 MB]
17:14:10.507 [Thread-11] INFO  o.d.m.e.inmemory.InMemoryLookupTable - Initializing syn1...
17:14:10.735 [Thread-11] INFO  o.d.m.s.SequenceVectors - Building learning algorithms:
17:14:10.735 [Thread-11] INFO  o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
17:14:10.739 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting learning process...
17:31:13.511 [VectorCalculationsThread 6] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [30173510];  Lines vectorized so far: [100000]; Seq/sec: [97.78]; Words/sec: [29503.49]; learningRate: [0.019877872048037776]
17:46:30.949 [VectorCalculationsThread 0] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [57961694];  Lines vectorized so far: [200000]; Seq/sec: [109.00]; Words/sec: [29874.87]; learningRate: [0.015160038258024791]
18:03:19.589 [VectorCalculationsThread 4] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [87069557];  Lines vectorized so far: [300000]; Seq/sec: [99.14]; Words/sec: [29527.23]; learningRate: [0.010217998046364056]
18:19:33.234 [VectorCalculationsThread 9] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [116527472];  Lines vectorized so far: [400000]; Seq/sec: [102.71]; Words/sec: [29707.95]; learningRate: [0.005215873294772414]
18:34:29.991 [VectorCalculationsThread 1] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [143866928];  Lines vectorized so far: [500001]; Seq/sec: [111.51]; Words/sec: [29852.92]; learningRate: [5.741776818798506E-4]

but then it goes to deadlock again (cpu usage 0.3%)

[MyPreProcessor]

public class MyPreprocessor implements TokenPreProcess {

    private static final String placeholder = "__placeholder__";

    public MyPreprocessor() { }
    @Override
    public String preProcess(String token) {

        // Clean
        token = StringCleaning.stripPunct(token).toLowerCase();

        // Accents
        token = StringUtils.stripAccents(token);

        // Bad char (only alphanumeric)
        token = token.replaceAll("[^A-Za-z0-9 ]", "");

        // Check if token contains at least one alphabet
        if (!token.matches(".*[a-zA-Z]+.*")) return placeholder;

        // Small "words"
        if (token.length() <= 1) return placeholder;

        return token;
    }
}

silvioOlivastri on 16 Jun 2017

I don't know why, but there are few empty tokens. I'm going to correct them.

silvioOlivastri on 16 Jun 2017

what tokenizer you're using there?

raver119 on 16 Jun 2017

I use DefaultTokenizerFactory.

What happened when the text was a sequence of x "white space" characters?
I watched that the token array (from tokenizer) was empty. Have you managed this situation?

silvioOlivastri on 16 Jun 2017

Ok,

I disable parallelTokenizer, cleaned training data and with 5 epochs works well. I'm going to try with 50 epochs.

So, for now, the problem can be parallelTokenizer=true config.

silvioOlivastri on 21 Jun 2017

That's bad :(

Okay, i'll check that part of code too.

raver119 on 22 Jun 2017

👍1

Ok perfect,

one more info, maybe it can help you: for getting training data via .csv I use

import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;

whit this class I create MyCSVIterator wrapped into SourceIterator class.

silvioOlivastri on 22 Jun 2017

Updates here?

agibsonccc on 1 Jul 2017

👍1

Unfortunately, no.

I Just use word2vec with parallelTokenizer=false :/

silvioOlivastri on 3 Jul 2017

Hmm, so after you got rid of empty tokens - problem still there?

Can I have your corpus?

3 июля 2017 г., в 21:43, silvioOlivastri notifications@github.com написал(а):

Unfortunately, no.

I Just use word2vec with parallelTokenizer=false :/

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or mute the thread.

raver119 on 3 Jul 2017

@raver119 we will look if we can provide you a sample that has this issue.

loretoparisi on 4 Jul 2017

That would be very helpful.

raver119 on 4 Jul 2017

The deadlock happens sporadically (on a random epoch) when I train ParaVec. I noticed that it is possible to make it less frequent if make the batch size smaller (with big batches it almost inevitable). May it be a memory allocation issue?

tezer on 18 Jul 2017

Nah. I suspect it's just issue with bad tokens caught somewhere. But if you can share corpus which has this issue - share it with me, and i'll find what's wrong.

@loretoparisi @silvioOlivastri any chance for at least part of your corpus?

raver119 on 18 Jul 2017

I'm having a similar issue with both word2vec and ParagraphVectors. Initially I thought it was my custom tokenizer however I've had the same thing happen with the StemmingPreprocessor. It looks like the issue may be some sort of deadlock, despite the preprocess method being synchronized. I tried to get things working using a ReentrantLock but still had issues. The problem is random and hard to reproduce but is occurring across both word2vec and ParagraphVectors training algorithms, both custom and standard DL4J token preprocessors. I've also had it happen on two different datasets. I'm planning on trying to see if I can the issue to reproduce on a different open corpus as my datasets are not publically available.

Update: problem occurs with even the CommonPreprocessor and DefaultTokenizerFactory on a dataset containing just below 3,000 labelled sentences (around 70 labels in total), as the dataset is small each epoch only takes around 1 second on a 4 core CPU. Setting allowParallelTokenization(false) works around the problem. However, has a large performance slowdown.

nzv8fan on 8 Aug 2017

To reproduce on the dl4j-examples open the file Word2VecRawTextExample, change iterations(1)to epochs(100) recompile and run the example.

dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/word2vec/Word2VecRawTextExample.java
@@ -45,7 +45,7 @@ public class Word2VecRawTextExample {
         log.info("Building model....");
         Word2Vec vec = new Word2Vec.Builder()
                 .minWordFrequency(5)
-                .iterations(1)
+                .epochs(100)
                 .layerSize(100)
                 .seed(42)
                 .windowSize(5)

Deadlock will look something like:

o.d.e.n.w.Word2VecRawTextExample - Load & Vectorize Sentences....
o.d.e.n.w.Word2VecRawTextExample - Building model....
o.n.l.f.Nd4jBackend - Loaded [CpuBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for NativeOps: 4
o.n.n.Nd4jBlas - Number of threads used for BLAS: 4
o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [7.0GB];
o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
o.d.e.n.w.Word2VecRawTextExample - Fitting Word2Vec model....
o.d.m.s.SequenceVectors - Starting vocabulary building...
o.d.m.w.w.VocabConstructor - Sequences checked: [97162], Current vocabulary size: [242]; Sequences/sec: [20369.39];
o.d.m.e.l.WordVectorSerializer - Projected memory use for model: [0.18 MB]
o.d.m.e.i.InMemoryLookupTable - Initializing syn1...
o.d.m.s.SequenceVectors - Building learning algorithms:
o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
o.d.m.s.SequenceVectors - Starting learning process...
o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [2]; Words vectorized so far: [634299];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [3]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [4]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [5]; Words vectorized so far: [634296];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [6]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [7]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [8]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [9]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [10]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [11]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [12]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [13]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]

nzv8fan on 8 Aug 2017

👍2

O_o

raver119 on 8 Aug 2017

Why would you want to use this with multiple epochs or iterations though?

EDIT: To clarify, won't that typically overfit the data with word2vec-type training? AFAIK, the epoch only ends once things converge, right?

rothn on 3 Apr 2018

1) Closing this since issue fixed long ago.
2) Not necessary, it depends on corpus size etc. And no, epoch ends once training corpus ends. 1 epoch = 1 pass over training corpus.

raver119 on 3 Apr 2018

👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.