(u"Unitnumber Level Buildingname Streetname Sublocality Locality City State ",{'entities':[(0,11,'UNITNUMBER'),(12,17,'Level'),(18,30,'Building Name'),(31,41,'STREET'),(42,53,'SUBLOCALITY'),(54,62,'LOCALITY'),(63,67,'CITY'),(68,73,'STATE')]}),

Training data should contain examples of how the data might look, as you have done in your second example. In your first data, you have just typed in the Entity names you want as training string.
From the spaCy documentaion
Training data: Examples and their annotations.
All spaCy models support learning, so you can update a pretrained model with new examples. You’ll usually need to provide many examples to meaningfully improve the system.

The string should be an example of the type of data on which you will run the model

abinpaul1 on 20 Apr 2020

👍1

Thank you Abin for your guidance.
I have new model called nlp2 trained with data.
how do I use this model to get the result?
See the code below.. the final goal of labelling the input string is not working.

from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
from pathlib import Path

training data

TRAIN_DATA = [
(u"504 Purple Pride Accord IT Park Baner Road Baner Pune 411045 ", {'entities': [ ( 0, 3, 'Unit'), (4,16,'Building'),(17,31,'Locality'),(32,42,'Road'), (43,48,'Suburb'),(49,53,'City'),(54,58,'City'),(59,65,'Pincode')]}),
(u"12 Sena Marg Mantri road New Delhi 100101 ", {'entities': [ (0,2,'Unit'), (3,12,'Road'), (13,24,'City'),(25,34,'City'),(35,41,'Pincode')]}),(u"Nitya-Nilayam Sri Venkatesa Mills Post Udumalpet Coimbatore Tamil Nadu 642128 ",{'entities': [(0,14,'Unit'),(15,39,'Locality'),(40,49,'Suburb'),(50,60,'City'),(61,71,'State'),(72,78,'Pincode')]}),
(u"1st Cross Tavarekere Main Rd 1st Block Krishna Murthi Layout S.G. Palya Bengaluru Karnataka 560029 ", {'entities': [(0,29,'Road'),(30,61,'Sublocality'),(62,72,'Suburb'),(73,82,'City'),(83,92,'State'),(93,99,'Pincode')]}),
(u"1 Kanjurmarg Station Rd Ambedkar Nagar Kanjurmarg West Bhandup West Mumbai Maharashtra 400078 India ", {'entities': [(0,2,'Unit'),(3,24,'Road'),(25,39,'Sublocality'),(40,55,'Locality'),(56,68,'Suburb'),(69,75,'City'),(76,87,'State'),(88,94,'Pincode')]}),
("B-602,Tower 3, Mantri Apartments, Baner, Pune,India",{"entities": [(0, 5, "UNIT")]}),
]

TEST_DATA = "B-602,Tower 3, Mantri Apartments, Baner, Pune,India"

@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)

def main(model='en_core_web_sm', output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)

# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(other_pipes): # only train NER
# reset and initialize the weights randomly – but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)

# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

#test the model with new input
print("Testing new Input", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# figure out the string-- THIS PART NOT WORKING -----
test = nlp2(u"B-602,Tower 3, Mantri Apartments, Baner, Pune,India")
for entity in test.ents:
print(entity.label_,' | ',entity.text)

if __name__ == "__main__":
#plac.call(main('en_core_web_sm', Path.cwd(), 100))
main('en_core_web_sm', Path.cwd(), 100)

What could be wrong here. Please help.
regards
NK

From: Abin K Paul notifications@github.com
Sent: Tuesday, April 21, 2020 12:28 AM
To: explosion/spaCy spaCy@noreply.github.com
Cc: Narasimhan Krishna narasimhan.krishna@geospoc.com; Author author@noreply.github.com
Subject: Re: [explosion/spaCy] Training data issue (#5329)

(u"Unitnumber Level Buildingname Streetname Sublocality Locality City State ",{'entities':[(0,11,'UNITNUMBER'),(12,17,'Level'),(18,30,'Building Name'),(31,41,'STREET'),(42,53,'SUBLOCALITY'),(54,62,'LOCALITY'),(63,67,'CITY'),(68,73,'STATE')]}),

Training data should contain examples of how the data might look, as you have done in your second example. In your first data, you have just typed in the Entity names you want as training string.
From the spaCy documentaion
Training data: Examples and their annotations.
All spaCy models support learning, so you can update a pretrained model with new examples. You’ll usually need to provide many examples to meaningfully improve the system.

The string should be an example of the type of data on which you will run the model

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/explosion/spaCy/issues/5329#issuecomment-616746561, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AN46UV6TQAH5JQDWBSVLURDRNSLO3ANCNFSM4MMRONMA.

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient or if you have received this e-mail in error, please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the contents of this e-mail is strictly prohibited. GeoSpoc Geospatial Services Pvt Ltd. is a company registered in Pune, India. Registered number U74900PN2015PTC155597, Registered office address: 504, Purple Pride Accord IT Park, Baner Road, Baner, Pune 411045. Please consider the environment before printing this e-mail.

narasimhankrishna on 21 Apr 2020

Is the nlp2 model returning correct values when run against the TRAIN_DATA?

What is the output when you run the part which is not working?

Also ensure your labels are consistent across examples. You have added similar labels: Unit and UNIT. This will probabably affect the model training.
Another note is that you would want to store the model in a new folder instead of the current working directory. Add a folder name to your output_dir.

Also ensure consistent examples. Your model will only be as good as your training data.
(u"504 Purple Pride Accord IT Park Baner Road Baner Pune 411045 ", {'entities': [ ( 0, 3, 'Unit'), (4,16,'Building'),(17,31,'Locality'),(32,42,'Road'), (43,48,'Suburb'),(49,53,'City'),(54,58,'City'),(59,65,'Pincode')]}),
Here you mentioned City twice.
Also when deciding entities, make sure they are distinguishable.
For instance in your second training data example
Sena Marg Mantri road Both of these could be classified as Roads right? Because based on your other examples names with road in it are classified as Roads, but Mantri road is tagged as City.
Incosistency in examples can be detrimental to the final result.

abinpaul1 on 21 Apr 2020

👍1

# figure out the string-- THIS PART NOT WORKING -----
test = nlp2(u"B-602,Tower 3, Mantri Apartments, Baner, Pune,India")
for entity in test.ents:
print(entity.label_,' | ',entity.text)

the above code part is not working - nothing is printed.

Thanks for your advise on other things. I am editing the code and labels.
regards
NK

From: Abin K Paul notifications@github.com
Sent: Tuesday, April 21, 2020 12:27 PM
To: explosion/spaCy spaCy@noreply.github.com
Cc: Narasimhan Krishna narasimhan.krishna@geospoc.com; Author author@noreply.github.com
Subject: Re: [explosion/spaCy] Training data issue (#5329)

Is the nlp2 model returning correct values when run against the TRAIN_DATA?

What is the output when you run the part which is not working?

Also ensure your labels are consistent across examples. You have added similar labels: Unit and UNIT. This will probabably affect the model training.
Another note is that you would want to store the model in a new folder instead of the current working directory. Add a folder name to your output_dir.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/explosion/spaCy/issues/5329#issuecomment-616991612, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AN46UVZD5ERLPRER3OYVPHDRNU7WHANCNFSM4MMRONMA.

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient or if you have received this e-mail in error, please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the contents of this e-mail is strictly prohibited. GeoSpoc Geospatial Services Pvt Ltd. is a company registered in Pune, India. Registered number U74900PN2015PTC155597, Registered office address: 504, Purple Pride Accord IT Park, Baner Road, Baner, Pune 411045. Please consider the environment before printing this e-mail.

narasimhankrishna on 21 Apr 2020

If you get output while running nlp2 on TRAIN DATA, then the only conclusion would be the model did not recognize any entities on the your test string.

abinpaul1 on 21 Apr 2020

This is train_data output:
Entities [('504', 'Unit'), ('Purple Pride Accord IT Park Baner Road Baner Pune 411045', 'Building')]
Tokens [('504', 'Unit', 3), ('Purple', 'Building', 3), ('Pride', 'Building', 1), ('Accord', 'Building', 1), ('IT', 'Building', 1), ('Park', 'Building', 1), ('Baner', 'Building', 1), ('Road', 'Building', 1), ('Baner', 'Building', 1), ('Pune', 'Building', 1), ('411045', 'Building', 1)]
Entities [('1st', 'Unit')]
Tokens [('1st', 'Unit', 3), ('Cross', '', 2), ('Tavarekere', '', 2), ('Main', '', 2), ('Rd', '', 2), ('1st', '', 2), ('Block', '', 2), ('Krishna', '', 2), ('Murthi', '', 2), ('Layout', '', 2), ('S.G.', '', 2), ('Palya', '', 2), ('Bengaluru', '', 2), ('Karnataka', '', 2), ('560029', '', 2)]
Entities []
Tokens [('B-602,Tower', '', 2), ('3', '', 2), (',', '', 2), ('Mantri', '', 2), ('Apartments', '', 2), (',', '', 2), ('Baner', '', 2), (',', '', 2), ('Pune', '', 2), (',', '', 2), ('India', '', 2)]
Entities []
Tokens [('Nitya', '', 2), ('-', '', 2), ('Nilayam', '', 2), ('Sri', '', 2), ('Venkatesa', '', 2), ('Mills', '', 2), ('Post', '', 2), ('Udumalpet', '', 2), ('Coimbatore', '', 2), ('Tamil', '', 2), ('Nadu', '', 2), ('642128', '', 2)]
Entities [('1 Kanjurmarg Station', 'ORG')]
Tokens [('1', 'ORG', 3), ('Kanjurmarg', 'ORG', 1), ('Station', 'ORG', 1), ('Rd', '', 2), ('Ambedkar', '', 2), ('Nagar', '', 2), ('Kanjurmarg', '', 2), ('West', '', 2), ('Bhandup', '', 2), ('West', '', 2), ('Mumbai', '', 2), ('Maharashtra', '', 2), ('400078', '', 2), ('India', '', 2)]
Entities [('12', 'Unit')]
Tokens [('12', 'Unit', 3), ('Sena', '', 2), ('Marg', '', 2), ('Mantri', '', 2), ('Apartments', '', 2), ('New', '', 2), ('Delhi', '', 2), ('100101', '', 2)]
Saved model to C:\Users\NKAppData\Local\Programs\Python\Python37
Loading from C:\Users\NKAppData\Local\Programs\Python\Python37
Entities [('504', 'Unit'), ('Purple Pride Accord IT Park Baner Road Baner Pune 411045', 'Building')]
Tokens [('504', 'Unit', 3), ('Purple', 'Building', 3), ('Pride', 'Building', 1), ('Accord', 'Building', 1), ('IT', 'Building', 1), ('Park', 'Building', 1), ('Baner', 'Building', 1), ('Road', 'Building', 1), ('Baner', 'Building', 1), ('Pune', 'Building', 1), ('411045', 'Building', 1)]
Entities [('1st', 'Unit')]
Tokens [('1st', 'Unit', 3), ('Cross', '', 2), ('Tavarekere', '', 2), ('Main', '', 2), ('Rd', '', 2), ('1st', '', 2), ('Block', '', 2), ('Krishna', '', 2), ('Murthi', '', 2), ('Layout', '', 2), ('S.G.', '', 2), ('Palya', '', 2), ('Bengaluru', '', 2), ('Karnataka', '', 2), ('560029', '', 2)]
Entities []
Tokens [('B-602,Tower', '', 2), ('3', '', 2), (',', '', 2), ('Mantri', '', 2), ('Apartments', '', 2), (',', '', 2), ('Baner', '', 2), (',', '', 2), ('Pune', '', 2), (',', '', 2), ('India', '', 2)]
Entities []
Tokens [('Nitya', '', 2), ('-', '', 2), ('Nilayam', '', 2), ('Sri', '', 2), ('Venkatesa', '', 2), ('Mills', '', 2), ('Post', '', 2), ('Udumalpet', '', 2), ('Coimbatore', '', 2), ('Tamil', '', 2), ('Nadu', '', 2), ('642128', '', 2)]
Entities [('1 Kanjurmarg Station', 'ORG')]
Tokens [('1', 'ORG', 3), ('Kanjurmarg', 'ORG', 1), ('Station', 'ORG', 1), ('Rd', '', 2), ('Ambedkar', '', 2), ('Nagar', '', 2), ('Kanjurmarg', '', 2), ('West', '', 2), ('Bhandup', '', 2), ('West', '', 2), ('Mumbai', '', 2), ('Maharashtra', '', 2), ('400078', '', 2), ('India', '', 2)]
Entities [('12', 'Unit')]
Tokens [('12', 'Unit', 3), ('Sena', '', 2), ('Marg', '', 2), ('Mantri', '', 2), ('Apartments', '', 2), ('New', '', 2), ('Delhi', '', 2), ('100101', '', 2)]
Testing new Input C:\Users\NKAppData\Local\Programs\Python\Python37
Entities [('504', 'Unit'), ('Purple Pride Accord IT Park Baner Road Baner Pune 411045', 'Building')]
Tokens [('504', 'Unit', 3), ('Purple', 'Building', 3), ('Pride', 'Building', 1), ('Accord', 'Building', 1), ('IT', 'Building', 1), ('Park', 'Building', 1), ('Baner', 'Building', 1), ('Road', 'Building', 1), ('Baner', 'Building', 1), ('Pune', 'Building', 1), ('411045', 'Building', 1)]
Entities [('1st', 'Unit')]
Tokens [('1st', 'Unit', 3), ('Cross', '', 2), ('Tavarekere', '', 2), ('Main', '', 2), ('Rd', '', 2), ('1st', '', 2), ('Block', '', 2), ('Krishna', '', 2), ('Murthi', '', 2), ('Layout', '', 2), ('S.G.', '', 2), ('Palya', '', 2), ('Bengaluru', '', 2), ('Karnataka', '', 2), ('560029', '', 2)]
Entities []
Tokens [('B-602,Tower', '', 2), ('3', '', 2), (',', '', 2), ('Mantri', '', 2), ('Apartments', '', 2), (',', '', 2), ('Baner', '', 2), (',', '', 2), ('Pune', '', 2), (',', '', 2), ('India', '', 2)]
Entities []
Tokens [('Nitya', '', 2), ('-', '', 2), ('Nilayam', '', 2), ('Sri', '', 2), ('Venkatesa', '', 2), ('Mills', '', 2), ('Post', '', 2), ('Udumalpet', '', 2), ('Coimbatore', '', 2), ('Tamil', '', 2), ('Nadu', '', 2), ('642128', '', 2)]
Entities [('1 Kanjurmarg Station', 'ORG')]
Tokens [('1', 'ORG', 3), ('Kanjurmarg', 'ORG', 1), ('Station', 'ORG', 1), ('Rd', '', 2), ('Ambedkar', '', 2), ('Nagar', '', 2), ('Kanjurmarg', '', 2), ('West', '', 2), ('Bhandup', '', 2), ('West', '', 2), ('Mumbai', '', 2), ('Maharashtra', '', 2), ('400078', '', 2), ('India', '', 2)]
Entities [('12', 'Unit')]
Tokens [('12', 'Unit', 3), ('Sena', '', 2), ('Marg', '', 2), ('Mantri', '', 2), ('Apartments', '', 2), ('New', '', 2), ('Delhi', '', 2), ('100101', '', 2)]

narasimhankrishna on 21 Apr 2020

It seems your model is not trained sufficiently. This is inferred from the poor perfomance of the model on the training data.
To get better output it is imperative that you have sufficient number of examples. Also try increasing the number of iterations for training. But with such a less number of examples for training, there is a high chance the model will be overfitted.
Also your last training example (which you also use as test string) seems to be tagged incorrectly.
("B-602,Tower 3, Mantri Apartments, Baner, Pune,India",{"entities": [(0, 5, "UNIT")]})

abinpaul1 on 21 Apr 2020

👍1

Thanks again.
In the example what should be the start, end indexes. Should I include space or comma also in the index count?
Should I start with zero for first char?

Please correct my index.
Regards. NK.

Get Outlook for Androidhttps://aka.ms/ghei36

From: Abin K Paul notifications@github.com
Sent: Tuesday, April 21, 2020 6:05:31 PM
To: explosion/spaCy spaCy@noreply.github.com
Cc: Narasimhan Krishna narasimhan.krishna@geospoc.com; Author author@noreply.github.com
Subject: Re: [explosion/spaCy] Training data issue (#5329)

It seems your model is not trained sufficiently. This is inferred from the poor perfomance of the model on the training data.
To get better output it is imperative that you have sufficient number of examples. Also try increasing the number of iterations for training. But with such a less number of examples for training, there is a high chance the model will be overfitted.
Also your last training example (which you also use as test string) seems to be tagged incorrectly.
("B-602,Tower 3, Mantri Apartments, Baner, Pune,India",{"entities": [(0, 5, "UNIT")]})

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/explosion/spaCy/issues/5329#issuecomment-617151750, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AN46UV35NRWABOPMEJPTKS3RNWHJHANCNFSM4MMRONMA.

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient or if you have received this e-mail in error, please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the contents of this e-mail is strictly prohibited. GeoSpoc Geospatial Services Pvt Ltd. is a company registered in Pune, India. Registered number U74900PN2015PTC155597, Registered office address: 504, Purple Pride Accord IT Park, Baner Road, Baner, Pune 411045. Please consider the environment before printing this e-mail.

narasimhankrishna on 21 Apr 2020

Index need not contain commas, because you are tagging the entities. First char would be zero.

It's not the indexes that is the main issue, its your choice of entities to label. Make sure your entities are thoroughly distinguishable. Since your example are mostly unstructured, you have to try to ensure your entities can be diffrentiated.

Try thinking from the perspective of the model.
For instanc from your example, on simply seeing Nitya-Nilayam Sri Venkatesa Mills
its possible to tag it in more than one way due to the lack of contextual information.

Also some entities like State would be much easily extracted by comparing it with a list of states (Or you could make use of spaCy's rule based matcher too). Pincode can also be extracted using pattern matching. Maybe 'Unit' can also be done in the same way.
The remaining entities Suburb,Locality,Roads are all basically 'LOC' entities in spaCy's provided models. Try applying the model to tag these as such and then try writing some logic to maybe separate them.

abinpaul1 on 21 Apr 2020

👍1

Thanks a ton Abin Paul for your wonderful support. I am closing this issue.

narasimhankrishna on 22 Apr 2020

Sorry to reopen this issue. becos i dont want this thread to get lost.
I trained a model with about 10,000 rows of data.
It took about 2 hours. After that it could identify only two strings as labels;
also many strings in the original input is lost.

But if I train with some 20 rows of data, the results are better :-)
it returns all the strings with about 40% correct labelling.

What is going wrong here?
Should I remove duplicate rows from the training set;
have only one label per row in the training set

narasimhankrishna on 23 Apr 2020

Sorry to re-open this issue. I only want continuity of the discussion..
I trained a model with about 10,000 rows of data.
It took about 2 hours. After that it could identify only two strings as labels;
also many strings in the original input is lost.

But if I train with some 20 rows of data, the results are better :-)
it returns all the strings with about 40% correct labelling.

What is going wrong here?
Should I remove duplicate rows from the training set;
have only one label per row in the training set

narasimhankrishna on 23 Apr 2020

Should I remove duplicate rows from the training set;

Duplicating data will not improve your training data. It is not needed.

But if I train with some 20 rows of data, the results are better :-)
it returns all the strings with about 40% correct labelling.
It took about 2 hours. After that it could identify only two strings as labels;
also many strings in the original input is lost

There might also be the problem of 'catastrophic forgetting' mentioned in the spaCy docs. Also try changing the parameters of training and training again. You might have to play with the parameters to get the right model.

have only one label per row in the training set
Add all the labels that are present in your data in each row.

abinpaul1 on 23 Apr 2020

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] on 10 Jun 2020

Spacy: Training data issue

Your Environment

All 14 comments

training data

TEST_DATA = "B-602,Tower 3, Mantri Apartments, Baner, Pune,India"

Related issues

Spacy: Training data issue

Your Environment

All 14 comments

training data​

TEST_DATA = "B-602,Tower 3, Mantri Apartments, Baner, Pune,India"​

Related issues

training data

TEST_DATA = "B-602,Tower 3, Mantri Apartments, Baner, Pune,India"