Deeplearning4j: Deadlock using ParagraphVector

Created on 15 Jun 2017  ·  21Comments  ·  Source: eclipse/deeplearning4j

Hi, trying to create a word embegging (I use ParagraphVector) with:

  • 0.8.1-SNAPSHOT
  • Ubuntu 16.04.2 LTS
  • Hardware RAM 16 GB, Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
  • File input contains 513316 lines with size of 873,5 MB. Training corpus is a string (7-bit ASCII) with lenght range of [10, 50 000] more or less.
  • DefaultTokenizerFactory with custom MyPreProcessor

The logs is:

11:35:10.937 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting vocabulary building...
11:35:10.938 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
11:35:10.991 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
11:35:10.991 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
11:35:56.217 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [100000]; Current vocabulary size: [385141]; Sequences/sec: 2208.58; Words/sec: 962757.81;
?11:58:56.372 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [200000]; Current vocabulary size: [588558]; Sequences/sec: 72.46; Words/sec: 26455.10;
11:59:41.377 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [300000]; Current vocabulary size: [667109]; Sequences/sec: 2221.98; Words/sec: 902804.93;
11:59:59.203 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
11:59:59.204 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [764577],  NumWords: [142093275], sequences parsed: [366918], counter: [142093265]
11:59:59.658 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 764577; Words after: 342429;
11:59:59.658 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [342429],  NumWords: [140970370], sequences parsed: [366918], counter: [142093265]
12:00:07.468 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [366918], Current vocabulary size: [342429]; Sequences/sec: [245.18];
12:00:07.486 [Thread-11] INFO  o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [391.88 MB]
12:00:07.806 [Thread-11] INFO  o.d.m.e.inmemory.InMemoryLookupTable - Initializing syn1...
12:00:08.038 [Thread-11] INFO  o.d.m.s.SequenceVectors - Building learning algorithms:
12:00:08.038 [Thread-11] INFO  o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
12:00:08.042 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting learning process...
12:24:22.518 [VectorCalculationsThread 5] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [43254792];  Lines vectorized so far: [100000]; Seq/sec: [68.76]; Words/sec: [29740.19]; learningRate: [0.01732922544852691]
13:05:27.656 [VectorCalculationsThread 10] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [79385098];  Lines vectorized so far: [200000]; Seq/sec: [40.57]; Words/sec: [20253.57]; learningRate: [0.01092213763943002]
13:24:03.599 [VectorCalculationsThread 0] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [111279196];  Lines vectorized so far: [300000]; Seq/sec: [89.61]; Words/sec: [22098.93]; learningRate: [0.005265719332773217]

then, nothing happend, the CPU usage is 0.3% and memory is 52% (-Xmx10000min java options).

I alredy used the same classes and hardware but the file input had 17984 rows and all worked well.

[MyPreProcessor]

public class MyPreprocessor implements TokenPreProcess {
    public MyPreprocessor() { }

    @Override
    public String preProcess(String token) {

        // Clean
        token = StringCleaning.stripPunct(token).toLowerCase();

        // Accents
        token = StringUtils.stripAccents(token);

        // Bad char (only alphanumeric)
        token = token.replaceAll("[^A-Za-z0-9 ]", "");

        // Check if token contains at least one alphabet
        if (!token.matches(".*[a-zA-Z]+.*")) return "";

        // Small "words"
        if (token.length() <= 1) return  "";

        return token;
    }
}
Bug

Most helpful comment

To reproduce on the dl4j-examples open the file Word2VecRawTextExample, change iterations(1)to epochs(100) recompile and run the example.

dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/word2vec/Word2VecRawTextExample.java
@@ -45,7 +45,7 @@ public class Word2VecRawTextExample {
         log.info("Building model....");
         Word2Vec vec = new Word2Vec.Builder()
                 .minWordFrequency(5)
-                .iterations(1)
+                .epochs(100)
                 .layerSize(100)
                 .seed(42)
                 .windowSize(5)

Deadlock will look something like:

o.d.e.n.w.Word2VecRawTextExample - Load & Vectorize Sentences....
o.d.e.n.w.Word2VecRawTextExample - Building model....
o.n.l.f.Nd4jBackend - Loaded [CpuBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for NativeOps: 4
o.n.n.Nd4jBlas - Number of threads used for BLAS: 4
o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [7.0GB];
o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
o.d.e.n.w.Word2VecRawTextExample - Fitting Word2Vec model....
o.d.m.s.SequenceVectors - Starting vocabulary building...
o.d.m.w.w.VocabConstructor - Sequences checked: [97162], Current vocabulary size: [242]; Sequences/sec: [20369.39];
o.d.m.e.l.WordVectorSerializer - Projected memory use for model: [0.18 MB]
o.d.m.e.i.InMemoryLookupTable - Initializing syn1...
o.d.m.s.SequenceVectors - Building learning algorithms:
o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
o.d.m.s.SequenceVectors - Starting learning process...
o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [2]; Words vectorized so far: [634299];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [3]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [4]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [5]; Words vectorized so far: [634296];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [6]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [7]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [8]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [9]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [10]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [11]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [12]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [13]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]

All 21 comments

As discussed in gitter issue is preprocessor that returns empty strings as tokens, making w2v very sad.

With new PreProcessor the result is:

17:12:21.824 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting vocabulary building...
17:12:21.825 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
17:12:21.850 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
17:12:21.850 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
17:12:42.787 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [100000]; Current vocabulary size: [311509]; Sequences/sec: 4770.76; Words/sec: 1451590.53;
17:13:00.388 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [200000]; Current vocabulary size: [400668]; Sequences/sec: 5681.50; Words/sec: 1586597.64;
17:13:21.565 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [300000]; Current vocabulary size: [616500]; Sequences/sec: 4722.10; Words/sec: 1392651.37;
17:13:41.397 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [400000]; Current vocabulary size: [687898]; Sequences/sec: 5042.36; Words/sec: 1493682.84;
17:13:59.696 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [500000]; Current vocabulary size: [744425]; Sequences/sec: 5464.48; Words/sec: 1500757.60;
17:14:01.942 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
17:14:01.942 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [750331],  NumWords: [148352306], sequences parsed: [513316], counter: [148352306]
17:14:02.375 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 750331; Words after: 337117;
17:14:02.376 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [337117],  NumWords: [147245448], sequences parsed: [513316], counter: [148352306]
17:14:10.179 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [513316], Current vocabulary size: [337117]; Sequences/sec: [4737.40];
17:14:10.197 [Thread-11] INFO  o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [385.80 MB]
17:14:10.507 [Thread-11] INFO  o.d.m.e.inmemory.InMemoryLookupTable - Initializing syn1...
17:14:10.735 [Thread-11] INFO  o.d.m.s.SequenceVectors - Building learning algorithms:
17:14:10.735 [Thread-11] INFO  o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
17:14:10.739 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting learning process...
17:31:13.511 [VectorCalculationsThread 6] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [30173510];  Lines vectorized so far: [100000]; Seq/sec: [97.78]; Words/sec: [29503.49]; learningRate: [0.019877872048037776]
17:46:30.949 [VectorCalculationsThread 0] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [57961694];  Lines vectorized so far: [200000]; Seq/sec: [109.00]; Words/sec: [29874.87]; learningRate: [0.015160038258024791]
18:03:19.589 [VectorCalculationsThread 4] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [87069557];  Lines vectorized so far: [300000]; Seq/sec: [99.14]; Words/sec: [29527.23]; learningRate: [0.010217998046364056]
18:19:33.234 [VectorCalculationsThread 9] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [116527472];  Lines vectorized so far: [400000]; Seq/sec: [102.71]; Words/sec: [29707.95]; learningRate: [0.005215873294772414]
18:34:29.991 [VectorCalculationsThread 1] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [143866928];  Lines vectorized so far: [500001]; Seq/sec: [111.51]; Words/sec: [29852.92]; learningRate: [5.741776818798506E-4]

but then it goes to deadlock again (cpu usage 0.3%)

[MyPreProcessor]

public class MyPreprocessor implements TokenPreProcess {

    private static final String placeholder = "__placeholder__";

    public MyPreprocessor() { }
    @Override
    public String preProcess(String token) {

        // Clean
        token = StringCleaning.stripPunct(token).toLowerCase();

        // Accents
        token = StringUtils.stripAccents(token);

        // Bad char (only alphanumeric)
        token = token.replaceAll("[^A-Za-z0-9 ]", "");

        // Check if token contains at least one alphabet
        if (!token.matches(".*[a-zA-Z]+.*")) return placeholder;

        // Small "words"
        if (token.length() <= 1) return placeholder;

        return token;
    }
}

I don't know why, but there are few empty tokens. I'm going to correct them.

what tokenizer you're using there?

I use DefaultTokenizerFactory.

What happened when the text was a sequence of x "white space" characters?
I watched that the token array (from tokenizer) was empty. Have you managed this situation?

Ok,

I disable parallelTokenizer, cleaned training data and with 5 epochs works well. I'm going to try with 50 epochs.

So, for now, the problem can be parallelTokenizer=true config.

That's bad :(

Okay, i'll check that part of code too.

Ok perfect,

one more info, maybe it can help you: for getting training data via .csv I use

import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;

whit this class I create MyCSVIterator wrapped into SourceIterator class.

Updates here?

Unfortunately, no.

I Just use word2vec with parallelTokenizer=false :/

Hmm, so after you got rid of empty tokens - problem still there?

Can I have your corpus?

3 июля 2017 г., в 21:43, silvioOlivastri notifications@github.com написал(а):

Unfortunately, no.

I Just use word2vec with parallelTokenizer=false :/


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or mute the thread.

@raver119 we will look if we can provide you a sample that has this issue.

That would be very helpful.

The deadlock happens sporadically (on a random epoch) when I train ParaVec. I noticed that it is possible to make it less frequent if make the batch size smaller (with big batches it almost inevitable). May it be a memory allocation issue?

Nah. I suspect it's just issue with bad tokens caught somewhere. But if you can share corpus which has this issue - share it with me, and i'll find what's wrong.

@loretoparisi @silvioOlivastri any chance for at least part of your corpus?

I'm having a similar issue with both word2vec and ParagraphVectors. Initially I thought it was my custom tokenizer however I've had the same thing happen with the StemmingPreprocessor. It looks like the issue may be some sort of deadlock, despite the preprocess method being synchronized. I tried to get things working using a ReentrantLock but still had issues. The problem is random and hard to reproduce but is occurring across both word2vec and ParagraphVectors training algorithms, both custom and standard DL4J token preprocessors. I've also had it happen on two different datasets. I'm planning on trying to see if I can the issue to reproduce on a different open corpus as my datasets are not publically available.

Update: problem occurs with even the CommonPreprocessor and DefaultTokenizerFactory on a dataset containing just below 3,000 labelled sentences (around 70 labels in total), as the dataset is small each epoch only takes around 1 second on a 4 core CPU. Setting allowParallelTokenization(false) works around the problem. However, has a large performance slowdown.

To reproduce on the dl4j-examples open the file Word2VecRawTextExample, change iterations(1)to epochs(100) recompile and run the example.

dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/word2vec/Word2VecRawTextExample.java
@@ -45,7 +45,7 @@ public class Word2VecRawTextExample {
         log.info("Building model....");
         Word2Vec vec = new Word2Vec.Builder()
                 .minWordFrequency(5)
-                .iterations(1)
+                .epochs(100)
                 .layerSize(100)
                 .seed(42)
                 .windowSize(5)

Deadlock will look something like:

o.d.e.n.w.Word2VecRawTextExample - Load & Vectorize Sentences....
o.d.e.n.w.Word2VecRawTextExample - Building model....
o.n.l.f.Nd4jBackend - Loaded [CpuBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for NativeOps: 4
o.n.n.Nd4jBlas - Number of threads used for BLAS: 4
o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [7.0GB];
o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
o.d.e.n.w.Word2VecRawTextExample - Fitting Word2Vec model....
o.d.m.s.SequenceVectors - Starting vocabulary building...
o.d.m.w.w.VocabConstructor - Sequences checked: [97162], Current vocabulary size: [242]; Sequences/sec: [20369.39];
o.d.m.e.l.WordVectorSerializer - Projected memory use for model: [0.18 MB]
o.d.m.e.i.InMemoryLookupTable - Initializing syn1...
o.d.m.s.SequenceVectors - Building learning algorithms:
o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
o.d.m.s.SequenceVectors - Starting learning process...
o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [2]; Words vectorized so far: [634299];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [3]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [4]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [5]; Words vectorized so far: [634296];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [6]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [7]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [8]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [9]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [10]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [11]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [12]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [13]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]

O_o

Why would you want to use this with multiple epochs or iterations though?

EDIT: To clarify, won't that typically overfit the data with word2vec-type training? AFAIK, the epoch only ends once things converge, right?

1) Closing this since issue fixed long ago.
2) Not necessary, it depends on corpus size etc. And no, epoch ends once training corpus ends. 1 epoch = 1 pass over training corpus.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxgfr picture maxgfr  ·  4Comments

hanbaoan123 picture hanbaoan123  ·  4Comments

Vigasto picture Vigasto  ·  3Comments

AlexDBlack picture AlexDBlack  ·  5Comments

AlexDBlack picture AlexDBlack  ·  4Comments