Spacy: Using POS tagger on existing tokens

Created on 1 Feb 2016 · 5Comments · Source: explosion/spaCy

I've got a large amount of text that's already been tokenised into lists of strings by an external process.

Is there any way to pass that through Spacy's part of speech tagger? I can see that it can be called on a Doc object, but I can't see any way of creating that other than from using an English object's call method on a string.

docs

Source

dschallis

👍3

Most helpful comment

Hey,

The docs should be clearer on this. You want:

tokens = nlp.tokenizer.tokens_from_list(token_strings)
nlp.tagger(tokens)

If you want to apply preset tags, use:

nlp.tagger.tag_from_strings(tokens, tag_strs)

The doc.from_array method is also useful for loading annotations.

Btw, note that tagging text with tokenization that differs from what was used during training can sometimes result in low accuracy, depending on the divergence between the tokenization schemes. Be particularly careful of escaping schemes. If brackets are escaped in your data, you'll want to unescape them before sending them to the tagger.

honnibal on 1 Feb 2016

👍3 🎉1

All 5 comments

Hey,

The docs should be clearer on this. You want:

tokens = nlp.tokenizer.tokens_from_list(token_strings)
nlp.tagger(tokens)

If you want to apply preset tags, use:

nlp.tagger.tag_from_strings(tokens, tag_strs)

The doc.from_array method is also useful for loading annotations.

honnibal on 1 Feb 2016

👍3 🎉1

Thanks for that, knowing I can create tokens from a list of strings should help with a few things.

Is there anything else I need to do in order to initialise the tagger by the way? I'm currently getting None when using your code above, e.g.:

import spacy.en
NLP = spacy.en.English()

sent = 'This is a sentence .'
words = sent.split()
tokens = NLP.tokenizer.tokens_from_list(words)
print('Tokens:', tokens)
tags = NLP.tagger(tokens)
print('Tags:', tags)

outputs:

Tokens: This is a sentence . 
Tags: None

dschallis on 1 Feb 2016

The tagger, parser and NER all modify the tokens object in place. The design is optimized to use the annotations via the Doc, Token and Span objects, so that it's easy to hop between the different representations. This also allows us to store everything efficiently, and handle string<->integer encodings.

To get a list of strings do:

[token.tag_ for token in doc]

This produces the PTB-style tags like VBZ, VBD, etc. You can instead get the Google universal tags, at token.pos_. The integer encoded versions are at token.tag and token.pos, without the trailing underscore.

honnibal on 1 Feb 2016

Ah, thanks again, that works perfectly.

dschallis on 1 Feb 2016

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.