Deeplearning4j: VocabConstructor: Lock after Vocabulary is built

Created on 26 Jul 2017 · 11Comments · Source: eclipse/deeplearning4j

There is a lock after VocabConstructor has tokenized, that prevents to return immediately after the vocabulary has been built.

TfidfVectorizer tfidfVectorizer = new TfidfVectorizer.Builder()
                    .setIterator(new CollectionSentenceIterator(sentence))
                    .setTokenizerFactory(this.tokenizer)
                    .setMinWordFrequency(1)
                    .allowParallelTokenization(false)
                    .build();
            tfidfVectorizer.fit();

You can see the log here

12:55:05.809 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...

The full logs:

12:55:05.736 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
12:55:05.738 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
12:55:05.738 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
12:55:05.809 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
12:55:05.810 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [727],  NumWords: [16323], sequences parsed: [41], counter: [16323]
12:55:05.812 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 727; Words after: 727;
12:55:05.812 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [727],  NumWords: [16323], sequences parsed: [41], counter: [16323]
12:55:08.208 [main] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [41], Current vocabulary size: [727]; Sequences/sec: [16.59];

Source

loretoparisi

👍1

Most helpful comment

Ah... and i was investigating deadlocks...

Okay, 2 sleeps removed. Will merge later.

raver119 on 15 Sep 2017

🎉1 👍1

All 11 comments

@raver119 any update on this issue? Thank you!

loretoparisi on 14 Sep 2017

I've missed this issue somehow, sorry. Looks like easy fix though.

raver119 on 14 Sep 2017

👍1

1) Can you confirm you're running 0.9.1?
2) Can i have corpus (or subset of corpus) which reproduces this behavior?

raver119 on 15 Sep 2017

At current master i can't reproduce your issue. With or without exceptions within VocabConstructor - vocabulary is built, and TfIdf is fit.

raver119 on 15 Sep 2017

https://github.com/deeplearning4j/deeplearning4j/blob/6e05f3289741d316f2e237c53b83db9826728c63/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/test/java/org/deeplearning4j/bagofwords/vectorizer/TfidfVectorizerTest.java

Here's few more tests i've added, they either throw exception at tokenizer, or tokenizerfactory. Still works fine, nothing gets stuck.

So ye, i need to see corpus which reproduces this issue.

raver119 on 15 Sep 2017

Hello @raver119 we have just updated DL4J to the latest version, since we were using the snapshot until today (the bug problem was few weeks ago), so we are going to test it again, please wait.

loretoparisi on 15 Sep 2017

👍1

Sure.

P.s. If you want, yo can send me corpus in pm, i'll keep it private.

raver119 on 15 Sep 2017

DL4J version: 0.9.1 (not 0.9.2-SNAPSHOT)

Code:

        TokenizerFactory tokenizer = new DefaultTokenizerFactory();
        tokenizer.setTokenPreProcessor(new TokenPreProcess() {
            @Override
            public String preProcess(String s) {
                return s;
            }
        });

        List<String> sentence = new ArrayList<>();

        sentence.add("Lorem ipsum dolor sit amet, consectetur adipiscing elit");
        sentence.add("consectetur adipiscing elit, sed do eiusmod tempor");

        TfidfVectorizer tfidfVectorizer = new TfidfVectorizer.Builder()
                .setIterator(new CollectionSentenceIterator(sentence))
                .setTokenizerFactory(tokenizer)
                .setMinWordFrequency(1)
                .allowParallelTokenization(false)
                .build();
        tfidfVectorizer.fit();

Output:

15:45:37.578 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
15:45:37.592 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
15:45:37.593 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
15:45:37.608 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
15:45:37.609 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [13],  NumWords: [15], sequences parsed: [2], counter: [15]
15:45:37.610 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 13; Words after: 13;
15:45:37.613 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [13],  NumWords: [15], sequences parsed: [2], counter: [15]
15:45:39.676 [main] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [2], Current vocabulary size: [13]; Sequences/sec: [0.96];

Here the process wait 2 seconds. That is the "issue".
The our problem is: if we try to build and fit TfidfVectorizer with new sentence (i.e.) 100 time, we should wait 100*2 seconds.