Spacy: NER Training does not work when using BILOU tagging

Created on 1 Dec 2016 · 7Comments · Source: explosion/spaCy

NER training is not working per document/tutorials.

Specifically, offsets _do_ appear to work. entity labels _do not_ appear to work. Also the documentation is in conflict with itself which confuses the situation.

Using the entity-label NER training example

e.g.

nlp = spacy.load('en')
doc = Doc(nlp.vocab, [u'rats', u'make', u'good', u'pets'])
gold = GoldParse(doc, [u'U-ANIMAL', u'O', u'O', u'O'])
ner = EntityRecognizer(nlp.vocab, entity_types=['ANIMAL'])
ner.update(doc, gold)

As far as I know my syntax is correct. This doesn't work either:

doc = Doc(nlp.vocab, words=["The", "Law", "and", "Justice", "party", "is", "growing", "in", "Poland"])
gold = GoldParse(doc, ['O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'GPE'])
ner.update(doc, gold)

I get the following error:

TypeError Traceback (most recent call last)
in ()
----> 1 ner.update(doc, gold)

/usr/local/lib/python3.5/dist-packages/spacy/syntax/parser.cpython-35m-x86_64-linux-gnu.so in spacy.syntax.parser.Parser.update (spacy/syntax/parser.cpp:7788)()

/usr/local/lib/python3.5/dist-packages/spacy/syntax/ner.cpython-35m-x86_64-linux-gnu.so in spacy.syntax.ner.BiluoPushDown.preprocess_gold (spacy/syntax/ner.cpp:4782)()

/usr/local/lib/python3.5/dist-packages/spacy/syntax/ner.cpython-35m-x86_64-linux-gnu.so in spacy.syntax.ner.BiluoPushDown.lookup_transition (spacy/syntax/ner.cpp:5145)()

TypeError: argument of type 'NoneType' is not iterable

I think this is a bug in GoldParse since offsets appear to work.

e.g.

ner = EntityRecognizer(nlp.vocab, entity_types=['ORG'])
doc = nlp.make_doc('The Law and Justice party is growing')
gold = GoldParse(doc, entities=[(4,25,'ORG')])
ner.update(doc, gold)
ner(doc)
print(doc.ents[0].text,doc.ents[0].label_)
--> Law and Justice party ORG

Also, the documentation is very inconsistent/confusing right now.

Conflicting examples:

example 1 does not work
example 2 works for token offsets, does not work for token-level entity annotation.
example 3 is linked from 1 (as the 'full example'), and they are totally different examples.

Your Environment

Ubuntu
Python 3.5.2
1.2, latest PIP

bug docs

Source

brianrusso

Most helpful comment

Fixed in v1.8.0 :tada:

>>> nlp = spacy.load('en')
>>> from spacy.gold import GoldParse
>>> doc = nlp.make_doc(u'Facebook is a company')
>>> nlp.tagger(doc)
>>> gold = GoldParse(doc, entities=['U-ORG', 'O', 'O', 'O'])
>>> [t for t in gold.ner]
['U-ORG', 'O', 'O', 'O']
>>> nlp.entity.update(doc, gold)
1.0
>>> nlp.entity.update(doc, gold)
1.0
>>> nlp.entity.update(doc, gold)
0.0
>>> nlp.entity(doc)
>>> for ent in doc:
...   print(ent.text, ent.label_)
Facebook ORG
...

honnibal on 17 Apr 2017

👍3

All 7 comments

Same here. I get this error:

  File "temp.py", line 18, in <module>
    tagger.update(doc, gold)

  File "spacy/tagger.pyx", line 253, in spacy.tagger.Tagger.update (spacy/tagger.cpp:6803)
    self.vocab.morphology.assign_tag(&tokens.c[i], eg.guess)

  File "spacy/morphology.pyx", line 41, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:3943)
    tag_id = self.reverse_index[tag]

KeyError: 0

when executing this:

pos_tag_map = {
    'N': {"pos": "NOUN"},
    'V': {"pos": "VERB"},
    'J': {"pos": "ADJ"}
}
vocab = Vocab(tag_map=pos_tag_map)
tagger = Tagger(vocab)
train_words = ["I", "like", "green", "eggs"]
doc = Doc(vocab, words=train_words)
train_tags = ["N", "V", "J", "N"]
gold = GoldParse(doc, tags=train_tags)
tagger.update(doc, gold)

I've traced the value being passed to tag_id = self.reverse_index[tag] and it's coming from linalg in thinc

    @staticmethod
    cdef inline int arg_max_if_true(
            const weight_t* scores, const int* is_valid, const int n_classes) nogil:
        cdef int i
        cdef int best = 0
        cdef weight_t mode = -900000
        for i in range(n_classes):
            if is_valid[i] and scores[i] > mode:
                mode = scores[i]
                best = i
        return best

Now this returns an integer, not a string which tag_id = self.reverse_index[tag] is expecting.

himarora on 1 Dec 2016

Fixed in v1.8.0 :tada:

>>> nlp = spacy.load('en')
>>> from spacy.gold import GoldParse
>>> doc = nlp.make_doc(u'Facebook is a company')
>>> nlp.tagger(doc)
>>> gold = GoldParse(doc, entities=['U-ORG', 'O', 'O', 'O'])
>>> [t for t in gold.ner]
['U-ORG', 'O', 'O', 'O']
>>> nlp.entity.update(doc, gold)
1.0
>>> nlp.entity.update(doc, gold)
1.0
>>> nlp.entity.update(doc, gold)
0.0
>>> nlp.entity(doc)
>>> for ent in doc:
...   print(ent.text, ent.label_)
Facebook ORG
...

honnibal on 17 Apr 2017

👍3

I am using spacy 1.9.0 for updating 'en' model with my own tag.

Here is a code snippet:

model_name = 'en'
entity_label = 'U-CATCHPHRASE'
output_directory = './data/spacy_ner.model'

def train_ner(nlp, train_data, output_dir):
    # Add new words to vocab
    for raw_text, _ in train_data:
        doc = nlp.make_doc(raw_text)
        for word in doc:
            _ = nlp.vocab[word.orth]

    for itn in range(5):
        random.shuffle(train_data)
        print("NER Training iteration {}...".format(itn))
        for raw_text, entity_tags in train_data:
            print("NER Training raw text {}".format(raw_text))
            print("NER Training tags {}".format(entity_tags))

            doc = nlp.make_doc(raw_text)
            gold = GoldParse(doc, entities=entity_tags)
            nlp.tagger(doc)
            loss = nlp.entity.update(doc, gold)
    nlp.end_training()
    nlp.save_to_directory(output_dir)

Raw text is like : "view of the fact that the suit "
Entity tags are: [ 'O', 'O', 'O', 'U-CATCHPHRASE', 'O', 'U-CATCHPHRASE', 'U-CATCHPHRASE']
Getting error:

Traceback (most recent call last):

  File "spacy/syntax/parser.pyx", line 320, in spacy.syntax.parser.Parser.update (spacy/syntax/parser.cpp:10286)
    self.moves.preprocess_gold(gold)
  File "spacy/syntax/ner.pyx", line 100, in spacy.syntax.ner.BiluoPushDown.preprocess_gold (spacy/syntax/ner.cpp:4960)
    gold.c.ner[i] = self.lookup_transition(gold.ner[i])
  File "spacy/syntax/ner.pyx", line 136, in spacy.syntax.ner.BiluoPushDown.lookup_transition (spacy/syntax/ner.cpp:5653)
    raise KeyError(name)
KeyError: 'U-CATCHPHRASE'

What's wrong here? Any suggestions?

yogeshhk on 1 Sep 2017

Try nlp.entity.add_label('CATCHPHRASE') before training?

honnibal on 1 Sep 2017

Yes that helped. Thanks!!

So add_label takes the new NER tag word, but the training data needs to be provided with "U-" at the start, else there is error of not finding U- or B- at the start of the tags.

I ran with a few sample sentences and IOB tags, converted them to iob_to_biluo and that seems to work ok.

But when I ran with another training set, I ran into another problem (sorry to keep bugging you).
Now the new label is 'LEGAL' and here is the code:

model_name = 'en'
entity_label = 'LEGAL'
output_directory = './data/spacy_ner.model'

nlp = spacy.load(model_name)
nlp.entity.add_label(entity_label)
train_data = [(" ".join(text), tags) for text, tags in zip(X,y)]
ner = train_ner(nlp, train_data, output_directory)

def train_ner(nlp, train_data, output_dir):
    for raw_text, _ in train_data:
        doc = nlp.make_doc(raw_text)
        for word in doc:
            _ = nlp.vocab[word.orth]

    for itn in range(10):
        random.shuffle(train_data)
        for raw_text, entity_tags in train_data:
            doc = nlp.make_doc(raw_text)
            gold = GoldParse(doc, entities=entity_tags)
            nlp.tagger(doc)
            loss = nlp.entity.update(doc, gold)
    nlp.end_training()
    nlp.save_to_directory(output_dir)

and the error is:

Traceback (most recent call last):

    gold = GoldParse(doc, entities=entity_tags)
  File "spacy/gold.pyx", line 294, in spacy.gold.GoldParse.__init__ (spacy/gold.cpp:10834)
    self.ner[i] = entities[gold_i]
IndexError: list index out of range

I checked that number of words in doc and tags in entity_tags are of same number. So there is one-to-one correspondence.
Then I saw that some of the tags were not converted by iob_to_biluoto BILUO format. That could be a bug for further investigation. As a workaround, when I converted them manually like:

        tags = [w.replace('B-LEGAL', "U-" + entity_label) for w in tags]
        tags = [w.replace('I-LEGAL', "U-" + entity_label) for w in tags]

still the error of "index out of range" persisted. Can you guess?

yogeshhk on 2 Sep 2017

For experiment, I skipped the Index Exception, using Try block, and allowed non-error sentences to be passed for training.
The output was as below:

LEGAL consideration of other aspects
LEGAL consideration of other aspects
LEGAL qualification awarded by
LEGAL qualification is
LEGAL circumstantial evidence see
LEGAL circumstantial evidence its not
LEGAL circumstantial evidence the
LEGAL circumstantial evidence is

This obviously does not look very appropriate. Some extra tokens are appearing towards end. Am I missing something, say hyperparamaters? Is it unidirectional or bidirectional RNN/LSTM that is being used? Or just that low volume of training data is causing it?

yogeshhk on 6 Sep 2017

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.