Spacy: How to find what are the Named Entity a existing model contains?

Created on 1 Jul 2016  Â·  9Comments  Â·  Source: explosion/spaCy

Hi,

I want to find all existing NER Label in a model in Spacy.
Can anyone tell, how to find that.

Thank you

usage

Most helpful comment

For anyone still looking for this list, the english ones are listed here.

If you want to view it for your current model, they appear to be stored in the model's entity.cfg attribute. Namely:

>>> nlp = spacy.load('en')
>>> nlp.entity.cfg[u'actions']
{u'1': [u'CARDINAL', u'DATE', u'EVENT', u'FAC', u'GPE', u'LANGUAGE', u'LAW', u'LOC', u'MONEY', u'NORP', u'ORDINAL', u'ORG', u'PERCENT', u'PERSON', u'PRODUCT', u'QUANTITY', u'TIME', u'WORK_OF_ART'], u'0': [u''], u'3': [u'CARDINAL', u'DATE', u'EVENT', u'FAC', u'GPE', u'LANGUAGE', u'LAW', u'LOC', u'MONEY', u'NORP', u'ORDINAL', u'ORG', u'PERCENT', u'PERSON', u'PRODUCT', u'QUANTITY', u'TIME', u'WORK_OF_ART'], u'2': [u'CARDINAL', u'DATE', u'EVENT', u'FAC', u'GPE', u'LANGUAGE', u'LAW', u'LOC', u'MONEY', u'NORP', u'ORDINAL', u'ORG', u'PERCENT', u'PERSON', u'PRODUCT', u'QUANTITY', u'TIME', u'WORK_OF_ART'], u'5': [u''], u'4': [u'CARDINAL', u'DATE', u'EVENT', u'FAC', u'GPE', u'LANGUAGE', u'LAW', u'LOC', u'MONEY', u'NORP', u'ORDINAL', u'ORG', u'PERCENT', u'PERSON', u'PRODUCT', u'QUANTITY', u'TIME', u'WORK_OF_ART']}

If you add a new label, it is stored under entity.cfg['extra_labels']

All 9 comments

You can find it in the docs.

from spacy.en import English
nlp = English()
tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
ents = list(tokens.ents)

Hi anasamoudi,

Thanks for reply.

But what u told it will give you the list of entities that document contains. But i wanted to ask, all the Existing or trained NER label.

For anyone still looking for this list, the english ones are listed here.

If you want to view it for your current model, they appear to be stored in the model's entity.cfg attribute. Namely:

>>> nlp = spacy.load('en')
>>> nlp.entity.cfg[u'actions']
{u'1': [u'CARDINAL', u'DATE', u'EVENT', u'FAC', u'GPE', u'LANGUAGE', u'LAW', u'LOC', u'MONEY', u'NORP', u'ORDINAL', u'ORG', u'PERCENT', u'PERSON', u'PRODUCT', u'QUANTITY', u'TIME', u'WORK_OF_ART'], u'0': [u''], u'3': [u'CARDINAL', u'DATE', u'EVENT', u'FAC', u'GPE', u'LANGUAGE', u'LAW', u'LOC', u'MONEY', u'NORP', u'ORDINAL', u'ORG', u'PERCENT', u'PERSON', u'PRODUCT', u'QUANTITY', u'TIME', u'WORK_OF_ART'], u'2': [u'CARDINAL', u'DATE', u'EVENT', u'FAC', u'GPE', u'LANGUAGE', u'LAW', u'LOC', u'MONEY', u'NORP', u'ORDINAL', u'ORG', u'PERCENT', u'PERSON', u'PRODUCT', u'QUANTITY', u'TIME', u'WORK_OF_ART'], u'5': [u''], u'4': [u'CARDINAL', u'DATE', u'EVENT', u'FAC', u'GPE', u'LANGUAGE', u'LAW', u'LOC', u'MONEY', u'NORP', u'ORDINAL', u'ORG', u'PERCENT', u'PERSON', u'PRODUCT', u'QUANTITY', u'TIME', u'WORK_OF_ART']}

If you add a new label, it is stored under entity.cfg['extra_labels']

@lgenerknol thank you, I was digging through the cython source trying to find this!

I'm experimenting with how the data is stored in this attribute, because I want to write a training routing which checks for entities which exist already in the model, and adds them if they do not exist.

I've found

>>>nlp.entity.cfg[u'actions'][u'1'] == nlp.entity.cfg[u'actions'][u'2']
True

>>> nlp.entity.add_label('TEST')
>>> nlp.entity.cfg['extra_labels']
['TEST', 'TEST', 'TEST', 'TEST', 'TEST']

#add again to see what happens
>>> nlp.entity.add_label('TEST')
>>> nlp.entity.cfg['extra_labels']
['TEST', 'TEST', 'TEST', 'TEST', 'TEST']

#add an in-built type to see what happens
>>> nlp.entity.add_label('CARDINAL')
>>> nlp.entity.cfg['extra_labels']
['TEST', 'TEST', 'TEST', 'TEST', 'TEST', 'CARDINAL']

A few questions

  1. What are the numeric keys in nlp.entity.cfg[u'actions']? They appear to hold identical in-built types, and my guess is that this is to make each label a valid 'state' to correspond to each action in the parser, so they will be identical while the parser is in it's initial state?
  2. Is it safe to assume, therefore, that each key in nlp.entity.cfg[u'actions'] holds identical labels, and I can lazily check if my entity is not in nlp.entity.cfg[u'actions'][u'1'] before adding it?
  3. Nothing terrible appears to happen when you add an existing MANUALLY ADDED label - a duplicate does not appear in nlp.entity.cfg['extra_labels']. Is it safe then to add a label which I have already added?
  4. Trying to manually add an in-built label (like 'CARDINAL') creates a single element in nlp.entity.cfg['extra_labels'], rather than creating one for each element in nlp.entity.cfg['actions']. Obviously, this is something you shouldn't do anyway, but I wonder what the consequence is?

I am trying the followinf code:

import spacy
nlp = spacy.load('en')

tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
ents = list(tokens.ents)

print 'ents:', ents

e = nlp.entity.cfg[u'actions']
print 'all entitiy cfg info:', e

This gave me error:
ents: [Best, New York, Saturday, morning]
Traceback (most recent call last):
File "spacy-103.py", line 10, in
e = nlp.entity.cfg[u'actions']
KeyError: u'actions'

Has something changed?

@honnibal @ines : can you help here.
Now what we see from nlp.entity.cfg is a dict without the actions key.

`nlp.entity.cfg

{u'beam_density': 0.0,
u'beam_width': 1,
u'cnn_maxout_pieces': 3,
u'hidden_depth': 1,
u'hidden_width': 200,
u'hist_size': 0,
u'hist_width': 0,
u'maxout_pieces': 2,
u'nr_class': 73,
u'pretrained_dims': 300,
u'token_vector_width': 128}

`

One way to get an idea around this (NOT the best way though) is to look into moves file.

spacy/data/en_core_web_md/en_core_web_md-2.0.0/ner$ vi moves

This gives an idea around:
"NORP", "DATE", "CARDINAL", "GPE", "PERCENT", "ORG", "EVENT", "MONEY" andso on...

Another way is to understand the nature of the data on which it was trained on.
https://spacy.io/models/en#en_core_web_md

and finding the source on which it was trained. Eg - ONTONOTES 5
https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

2.6 Entity Names Annotation
Names (often referred to as “Named Entities”) are annotated according to the following
set of types:
PERSON People, including fictional
NORP Nationalities or religious or political groups
FACILITY Buildings, airports, highways, bridges, etc.
ORGANIZATION Companies, agencies, institutions, etc.
GPE Countries, cities, states
LOCATION Non-GPE locations, mountain ranges, bodies of water
PRODUCT Vehicles, weapons, foods, etc. (Not services)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK OF ART Titles of books, songs, etc.
LAW Named documents made into laws 
 OntoNotes Release 5.0
22
LANGUAGE Any named language
The following values are also annotated in a style similar to names:
DATE Absolute or relative dates or periods
TIME Times smaller than a day
PERCENT Percentage (including “%”)
MONEY Monetary values, including unit
QUANTITY Measurements, as of weight or distance
ORDINAL “first”, “second”
CARDINAL Numerals that do not fall under another typ


But there has to be an easier way of getting this.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings