label = "CHEMICAL"
train_data = [ 'Alcohol','methanol','methyl alcohol','Ethanol',....]
Is there any way to update the words directly if the new_entity is created, or do we have to pass the sentences for all the words.
Hi! The question in #1655 is pretty similar to what you're asking – I've posted a longer reply there with a few examples and strategies.
I assume you want to train an NER model that can recognise your chemicals in context, and generalise from there, right? If you're only looking to find the exact 10,000 words and label them as entities, a rule-based approach might be enough – you could write a custom pipeline component that uses the PhraseMatcher to find the chemicals, and create a new CHEMICAL entity for them.
If you want spaCy to recognise the entities and similar words in context, you also need training examples of those entities in context. The examples in context should also be similar to the data you later want to use the model on. This is very important – for example, are you looking to process clinical notes, shipping receipts from a manufacturing company, or posts from an online message board for Chemistry students? All those texts are very different and will require different training data.
Here's a quick summary of the ideas in #1655:
- Create sentence templates that are similar to the data you're looking to analyse, and randomly fill them in with entries from your database.
- Use the
PhraseMatcherand create match patterns [...] and run it over a large corpus of sentences. Then you can use the sentences containing matches to create training data that's closer to real-world examples.
Given the size of your database, you'll likely end up with a very large training corpus as well. So you should also look into some tips and strategies for batching up your training examples and experimenting with different hyperparameters.
Thanks for such an detailed explanation. Will surely help me.
@ines : How about using the default ner model (doc.ents) OR noun chunk model (doc.noun_chunks) and find all the named entities.
I would expect chemicals like methanol, ethanol to be an individual entity - the only problem at time might be the wrong tag (it as a PER, for instance). Since we are interested in a particular domain, lets say the chemistry database - we query for these named entities found in the db. How do you feel about it?
For example:
input sentence is:
Ethanol, also called alcohol, ethyl alcohol, and drinking alcohol, is a chemical compound
`
# print named entities:
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
# print noun chunks:
for chunk in doc.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_ , chunk.root.head.text)
`
Output:
named entities:
## comes [] <EMPTY>
noun chunks:
(u'Ethanol', u'Ethanol', u'nsubj', u'called')
(u'alcohol', u'alcohol', u'dobj', u'called')
(u'ethyl alcohol', u'alcohol', u'conj', u'alcohol')
(u'alcohol', u'alcohol', u'dobj', u'drinking')
(u'a chemical compound', u'compound', u'attr', u'is')
In my experience , noun chunks can be used here better than the ents.
@honnibal : In one of the videos, you talk about annotating Drugs and build a new drug entity type.
Do you think that the above approach could have worked as well?
Example:
input sentence is:
I have taken paracetamol to cure my fever
named entities:
## Again empty
noun chunks:
(u'I', u'I', u'nsubj', u'taken')
(u'paracetamol', u'paracetamol', u'dobj', u'taken')
(u'my fever', u'fever', u'dobj', u'cure')
The only catch here is to smartly query the db. Perhaps its a naive way - Let me know your feedback
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Thanks for such an detailed explanation. Will surely help me.