Spacy: Custom Tokenizers

Created on 16 Nov 2015 · 4Comments · Source: explosion/spaCy

Hi, I have a question regarding the tokenizers. I am using this for biomedical text with a sentence such as Enhanced expression of AEG-1 via a replication-incompetent adenovirus (Ad.AEG-1) in HeLa cells markedly increased binding of the transcriptional activator p50/p65 complex of NF-kappaB.

The default tokenizer unfortunately splits (Ad.AEG-1) into (, Ad, ., AEG-1 and ), which causes sentence splitting also to be at .. The actual splitting for my purposes should be (, Ad.AEG-1 and ).

I was hoping you could give me some help with implementing my own tokenizer, so I can avoid this issue.

Source

ashishbaghudana

Most helpful comment

def my_split_function(string):
    return string.split()
old_tokenizer = nlp.tokenizer
nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(my_split_function(string))

aliabbasjp on 16 Nov 2015

❤3 👍1

All 4 comments

It is possible to implement your own tokenizer do search the other threads
they have explained.
On Nov 16, 2015 5:55 AM, "Ashish Baghudana" [email protected]
wrote:

Hi, I have a question regarding the tokenizers. I am using this for
biomedical text with a sentence such as Enhanced expression of AEG-1 via
a replication-incompetent adenovirus (Ad.AEG-1) in HeLa cells markedly
increased binding of the transcriptional activator p50/p65 complex of
NF-kappaB.

The default tokenizer unfortunately splits _(Ad.AEG-1)_ into (, Ad, .,
AEG-1 and ), which causes sentence splitting also to be at _._. The
actual splitting for my purposes should be (, Ad.AEG-1 and ).

I was hoping you could give me some help with implementing my own
tokenizer, so I can avoid this issue.

—
Reply to this email directly or view it on GitHub
https://github.com/honnibal/spaCy/issues/182.

aliabbasjp on 16 Nov 2015

def my_split_function(string):
    return string.split()
old_tokenizer = nlp.tokenizer
nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(my_split_function(string))

aliabbasjp on 16 Nov 2015

❤3 👍1

@aliabbasjp Thank you so much!

ashishbaghudana on 16 Nov 2015

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.