Spacy: Best way to train new entity type for more than 10,000 list of words

Created on 7 Dec 2017 · 4Comments · Source: explosion/spaCy

label = "CHEMICAL"

train_data = [ 'Alcohol','methanol','methyl alcohol','Ethanol',....]

Is there any way to update the words directly if the new_entity is created, or do we have to pass the sentences for all the words.

Your Environment

Operating System: Windows 10
Python Version Used: 3.6.2
spaCy Version Used: 2.0.3
Environment Information:

training usage

Source

akshay-1993

Most helpful comment

Thanks for such an detailed explanation. Will surely help me.

akshay-1993 on 7 Dec 2017

👍2

All 4 comments

Hi! The question in #1655 is pretty similar to what you're asking – I've posted a longer reply there with a few examples and strategies.

I assume you want to train an NER model that can recognise your chemicals in context, and generalise from there, right? If you're only looking to find the exact 10,000 words and label them as entities, a rule-based approach might be enough – you could write a custom pipeline component that uses the PhraseMatcher to find the chemicals, and create a new CHEMICAL entity for them.

If you want spaCy to recognise the entities and similar words in context, you also need training examples of those entities in context. The examples in context should also be similar to the data you later want to use the model on. This is very important – for example, are you looking to process clinical notes, shipping receipts from a manufacturing company, or posts from an online message board for Chemistry students? All those texts are very different and will require different training data.

Here's a quick summary of the ideas in #1655:

Create sentence templates that are similar to the data you're looking to analyse, and randomly fill them in with entries from your database.

Use the PhraseMatcher and create match patterns [...] and run it over a large corpus of sentences. Then you can use the sentences containing matches to create training data that's closer to real-world examples.

Given the size of your database, you'll likely end up with a very large training corpus as well. So you should also look into some tips and strategies for batching up your training examples and experimenting with different hyperparameters.

ines on 7 Dec 2017

👍2

Thanks for such an detailed explanation. Will surely help me.

akshay-1993 on 7 Dec 2017

👍2

@ines : How about using the default ner model (doc.ents) OR noun chunk model (doc.noun_chunks) and find all the named entities.

I would expect chemicals like methanol, ethanol to be an individual entity - the only problem at time might be the wrong tag (it as a PER, for instance). Since we are interested in a particular domain, lets say the chemistry database - we query for these named entities found in the db. How do you feel about it?

For example:
input sentence is:
Ethanol, also called alcohol, ethyl alcohol, and drinking alcohol, is a chemical compound

`
# print named entities:
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

# print noun chunks:
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_ , chunk.root.head.text)

Output:

named entities:
## comes [] <EMPTY>

noun chunks:
(u'Ethanol', u'Ethanol', u'nsubj', u'called')
(u'alcohol', u'alcohol', u'dobj', u'called')
(u'ethyl alcohol', u'alcohol', u'conj', u'alcohol')
(u'alcohol', u'alcohol', u'dobj', u'drinking')
(u'a chemical compound', u'compound', u'attr', u'is')

In my experience , noun chunks can be used here better than the ents.

@honnibal : In one of the videos, you talk about annotating Drugs and build a new drug entity type.
Do you think that the above approach could have worked as well?

Example:
input sentence is:
I have taken paracetamol to cure my fever

named entities:
## Again empty

noun chunks:
(u'I', u'I', u'nsubj', u'taken')
(u'paracetamol', u'paracetamol', u'dobj', u'taken')
(u'my fever', u'fever', u'dobj', u'cure')

The only catch here is to smartly query the db. Perhaps its a naive way - Let me know your feedback