There is a lock after VocabConstructor has tokenized, that prevents to return immediately after the vocabulary has been built.
TfidfVectorizer tfidfVectorizer = new TfidfVectorizer.Builder()
.setIterator(new CollectionSentenceIterator(sentence))
.setTokenizerFactory(this.tokenizer)
.setMinWordFrequency(1)
.allowParallelTokenization(false)
.build();
tfidfVectorizer.fit();
You can see the log here
12:55:05.809 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
The full logs:
12:55:05.736 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
12:55:05.738 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
12:55:05.738 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
12:55:05.809 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
12:55:05.810 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [727], NumWords: [16323], sequences parsed: [41], counter: [16323]
12:55:05.812 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 727; Words after: 727;
12:55:05.812 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [727], NumWords: [16323], sequences parsed: [41], counter: [16323]
12:55:08.208 [main] INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [41], Current vocabulary size: [727]; Sequences/sec: [16.59];
@raver119 any update on this issue? Thank you!
I've missed this issue somehow, sorry. Looks like easy fix though.
1) Can you confirm you're running 0.9.1?
2) Can i have corpus (or subset of corpus) which reproduces this behavior?
At current master i can't reproduce your issue. With or without exceptions within VocabConstructor - vocabulary is built, and TfIdf is fit.
Here's few more tests i've added, they either throw exception at tokenizer, or tokenizerfactory. Still works fine, nothing gets stuck.
So ye, i need to see corpus which reproduces this issue.
Hello @raver119 we have just updated DL4J to the latest version, since we were using the snapshot until today (the bug problem was few weeks ago), so we are going to test it again, please wait.
Sure.
P.s. If you want, yo can send me corpus in pm, i'll keep it private.
DL4J version: 0.9.1 (not 0.9.2-SNAPSHOT)
Code:
TokenizerFactory tokenizer = new DefaultTokenizerFactory();
tokenizer.setTokenPreProcessor(new TokenPreProcess() {
@Override
public String preProcess(String s) {
return s;
}
});
List<String> sentence = new ArrayList<>();
sentence.add("Lorem ipsum dolor sit amet, consectetur adipiscing elit");
sentence.add("consectetur adipiscing elit, sed do eiusmod tempor");
TfidfVectorizer tfidfVectorizer = new TfidfVectorizer.Builder()
.setIterator(new CollectionSentenceIterator(sentence))
.setTokenizerFactory(tokenizer)
.setMinWordFrequency(1)
.allowParallelTokenization(false)
.build();
tfidfVectorizer.fit();
Output:
15:45:37.578 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
15:45:37.592 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
15:45:37.593 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
15:45:37.608 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
15:45:37.609 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [13], NumWords: [15], sequences parsed: [2], counter: [15]
15:45:37.610 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 13; Words after: 13;
15:45:37.613 [main] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [13], NumWords: [15], sequences parsed: [2], counter: [15]
15:45:39.676 [main] INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [2], Current vocabulary size: [13]; Sequences/sec: [0.96];
Here the process wait 2 seconds. That is the "issue".
The our problem is: if we try to build and fit TfidfVectorizer with new sentence (i.e.) 100 time, we should wait 100*2 seconds.
Ah... and i was investigating deadlocks...
Okay, 2 sleeps removed. Will merge later.
Thank you so much :)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Ah... and i was investigating deadlocks...
Okay, 2 sleeps removed. Will merge later.