Spacy: Spacy predict no result for the large dataset?

Created on 29 Jul 2019 · 11Comments · Source: explosion/spaCy

I have tried the following code to retrain the spacy model:

import spacy
import random
from sklearn.externals import joblib
nlp = spacy.load('en')
nlp.entity.add_label('Brand')
nlp.entity.add_label('Celebrity')
nlp.entity.add_label('Community')
nlp.entity.add_label('GPE')
nlp.entity.add_label('Publisher')
nlp.entity.add_label('Show')
TRAIN_DATA = l[0:1800]
print(len(TRAIN_DATA))
try:
    optimizer = nlp.begin_training()
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations],drop=0.3, sgd=optimizer)
    nlp.to_disk("./model")
except Exception as e:
    print(e)
nlp = spacy.load('./model')
text = "Google is a Company.Elon Musk is world number one innovator.WordPress is good for SEO website. I live in America"
doc = nlp(text)
for ent in doc.ents:
    print(ent.text,ent.label_)

I have 1817 sentance in my test data. There is no prediction after training for the test data.However when I try to retrain the model using slicing i.e. first I train the model for the slice l[0:200] and then l[200:500],l[500:800],l[800:1000],l[1000:1300],l[1200:1600],l[1600:1817] for all slice then I got the prediction output. It means my dataset is correct. However when I take data size l[0:1800] I am not getting any prediction result. What is the reason of this issue?

feat / ner feat / tok2vec usage

Source

imsaiful

All 11 comments

Could you add the information about the system, python and spaCy version you're using?

BreakBB on 29 Jul 2019

Python 3.6.7
spaCy v2.1.6
System config
system Computer
/0 bus Motherboard
/0/0 memory 31GiB System memory
/0/1 processor Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
/0/100 bridge 4th Gen Core Processor DRAM Controller
/0/100/1 bridge Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller
/0/100/1/0 display GM107 [GeForce GTX 750 Ti]
/0/100/1/0.1 multimedia NVIDIA Corporation

imsaiful on 29 Jul 2019

Even I am getting the prediction result from slice 0:500 and 500:600 but my doc.ents is empty for the data slice 0:600

imsaiful on 29 Jul 2019

Just for clarification.

You expect that this code prints some predictions:

text = "Google is a Company.Elon Musk is world number one innovator.WordPress is good for SEO website. I live in America"
doc = nlp(text)
for ent in doc.ents:
    print(ent.text,ent.label_)

And it is working fine if you use smaller slices like l[0:200] but you don't get any results when you use l[0:1800], correct?

Would you mind and use disable pipes before starting the training? You want to only train the ner pipe and not the tagger or parser I assume.

BreakBB on 29 Jul 2019

👍1

is this the correct way:

with nlp.disable_pipes('tagger', 'parser'):
    optimizer = nlp.begin_training()

for text, annotations in TRAIN_DATA:
    nlp.update([text], [annotations],drop=0.3, sgd=optimizer)
nlp.to_disk("./model")

If yes, then there is no effect. I got the empty doc.ents.

imsaiful on 29 Jul 2019

You want to encapsulate the nlp.update inside the with statement as well (that way you have the optimizer in the same scope as well).

BreakBB on 29 Jul 2019

👍1

I have now tried this way and also got the prediction result. Please review which one should I go. I have large dataset i.e. 80,000 sentance to train 5000 entities.

import spacy
import random
from sklearn.externals import joblib
nlp = spacy.blank('en')
if "ner" not in nlp.pipe_names:
   ner = nlp.create_pipe("ner")
   nlp.add_pipe(ner)
else:
   ner = nlp.get_pipe("ner")
nlp.entity.add_label('Brand')
nlp.entity.add_label('Celebrity')
nlp.entity.add_label('Community')
nlp.entity.add_label('GPE')
nlp.entity.add_label('Publisher')
nlp.entity.add_label('Show')
TRAIN_DATA = l[0:1800]
print(len(TRAIN_DATA))
try:
   for i in range(10):
       random.shuffle(TRAIN_DATA)
       optimizer = nlp.begin_training()
       for text, annotations in TRAIN_DATA:
           nlp.update([text], [annotations], sgd=optimizer)
   nlp.to_disk("./model")
except Exception as e:
   print(e)
nlp = None
nlp = spacy.load('./model')
text = "Google is a Company.Elon Musk is world number one innovator.WordPress is good for SEO website. I live in America"
doc = nlp(text)
for ent in doc.ents:
   print(ent.text,ent.label_)

imsaiful on 29 Jul 2019

I am not sure why you don't just follow the example, since it includes everything you need for the training of an NER model and is easy to adjust for your situation.

If you don't want any evaluation while training then simply remove those parts, but you really should use disable_pipe while training to use your training data just for the training of the NER pipe.

BreakBB on 29 Jul 2019

Thanks for your support sir. Since I am using blank model now so there is no need to disable_pipe. May I kindly know which is the good way to retrain spacy. Train with the blank model or to use 'en' model and retrain it. The code for both approaches is following.
With en model

import spacy
import random
from sklearn.externals import joblib
nlp = spacy.load('en')
nlp.entity.add_label('Brand')
nlp.entity.add_label('Celebrity')
nlp.entity.add_label('Community')
nlp.entity.add_label('GPE')
nlp.entity.add_label('Publisher')
nlp.entity.add_label('Show')
TRAIN_DATA = l[0:1800]
print(len(TRAIN_DATA))
try:
    optimizer = nlp.begin_training()
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations],drop=0.3, sgd=optimizer)
    nlp.to_disk("./model")
except Exception as e:
    print(e)
nlp = spacy.load('./model')
text = "Google is a Company.Elon Musk is world number one innovator.WordPress is good for SEO website. I live in America"
doc = nlp(text)
for ent in doc.ents:
    print(ent.text,ent.label_)

With Blank Model:

import spacy
import random
from sklearn.externals import joblib
nlp = spacy.blank('en')
if "ner" not in nlp.pipe_names:
   ner = nlp.create_pipe("ner")
   nlp.add_pipe(ner)
else:
   ner = nlp.get_pipe("ner")
nlp.entity.add_label('Brand')
nlp.entity.add_label('Celebrity')
nlp.entity.add_label('Community')
nlp.entity.add_label('GPE')
nlp.entity.add_label('Publisher')
nlp.entity.add_label('Show')
TRAIN_DATA = l[0:1800]
print(len(TRAIN_DATA))
try:
   for i in range(10):
       random.shuffle(TRAIN_DATA)
       optimizer = nlp.begin_training()
       for text, annotations in TRAIN_DATA:
           nlp.update([text], [annotations], sgd=optimizer)
   nlp.to_disk("./model")
except Exception as e:
   print(e)
nlp = None
nlp = spacy.load('./model')
text = "Google is a Company.Elon Musk is world number one innovator.WordPress is good for SEO website. I live in America"
doc = nlp(text)
for ent in doc.ents:
   print(ent.text,ent.label_)

Kindly help. Thanks for your support.

imsaiful on 30 Jul 2019

Since you're adding a lot of new categories, you probably want to be starting with a blank model. Otherwise, you might get very unpredictable results. Your categories are also overlapping with categories in the existing pre-traind model, so trying to teach it that sometimes PERSON is now a Celebrity seems hard. You might also want to revise your label scheme, because som of these categories are pretty specific and possibly difficult for the model to learn.