Hi, I have a question regarding the tokenizers. I am using this for biomedical text with a sentence such as Enhanced expression of AEG-1 via a replication-incompetent adenovirus (Ad.AEG-1) in HeLa cells markedly increased binding of the transcriptional activator p50/p65 complex of NF-kappaB.
The default tokenizer unfortunately splits (Ad.AEG-1) into (, Ad, ., AEG-1 and ), which causes sentence splitting also to be at .. The actual splitting for my purposes should be (, Ad.AEG-1 and ).
I was hoping you could give me some help with implementing my own tokenizer, so I can avoid this issue.
It is possible to implement your own tokenizer do search the other threads
they have explained.
On Nov 16, 2015 5:55 AM, "Ashish Baghudana" [email protected]
wrote:
Hi, I have a question regarding the tokenizers. I am using this for
biomedical text with a sentence such as Enhanced expression of AEG-1 via
a replication-incompetent adenovirus (Ad.AEG-1) in HeLa cells markedly
increased binding of the transcriptional activator p50/p65 complex of
NF-kappaB.The default tokenizer unfortunately splits _(Ad.AEG-1)_ into (, Ad, .,
AEG-1 and ), which causes sentence splitting also to be at _._. The
actual splitting for my purposes should be (, Ad.AEG-1 and ).I was hoping you could give me some help with implementing my own
tokenizer, so I can avoid this issue.—
Reply to this email directly or view it on GitHub
https://github.com/honnibal/spaCy/issues/182.
def my_split_function(string):
return string.split()
old_tokenizer = nlp.tokenizer
nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(my_split_function(string))
@aliabbasjp Thank you so much!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment