hello, how can I pass the tokenizer which implemented by myself?
I use the code:
def my_tokenizer(x):
# this is my tokenizer implementation
# which is similar to space tokenizer
# what I changed is replacing space with \x02.
# because I wanna segment the sentence with \x02
# then I pass it into Sentence
Sentence('hello\x02world',use_tokenizer=my_tokenizer)
but it doesn't work. So how can implement my personal tokenizer?
what I can find is this comment in flair.data. I know that we can not pass a function to it, just as the comment read. but how can I complete my requirement?
thank you in advance!
Hi @bbruceyuan, a small example would be:
from typing import List
from flair.data import Sentence, Token
def own_tokenizer(text: str) -> List[Token]:
"""
Tokenizer based on space character only.
"""
tokens: List[Token] = [Token(token) for token in text.split()]
return tokens
s = Sentence("Berlin and Munich .", use_tokenizer=own_tokenizer)
print(s)
Make sure, that your own tokenizer is returning a List of Token objects :)
Notice: This is just a simple example. A more sophisticated solution would be to also store token offsets, see this example.
thanks for your reply. I do this implementation just like you. but it doesn't work.
Just like your example. you split input text by space character. As a result, we got a Sentence with 4 Tokens.
if you print the result s, you will get output Sentence: "Berlin and Munich ." - 4 Tokens.
But if you replace the code with:
from typing import List
from flair.data import Sentence, Token
def own_tokenizer(text: str) -> List[Token]:
"""
Tokenizer based on space character only.
"""
tokens: List[Token] = [Token(token) for token in text.split('\x02')]
return tokens
s = Sentence("Berlin\x02and\x02Munich\x02.", use_tokenizer=own_tokenizer)
print(s)
Just replace split() with split('\x02'). and the output should be
Sentence: "Berlin and Munich ." - 4 Tokens
But now, we got
Sentence: "Berlin and Munich ." - 7 Tokens
It's should be 4 Tokens, because we split the text with \x02.
So is there something wrong about my understanding of it?
OTHER INFO
OS info by screenfetch :
OS: Ubuntu 18.04 bionic
Kernel: x86_64 Linux 4.15.0-65-generic
Uptime: 11h 29m
Packages: 1816
Shell: bash 4.4.19
CPU: Intel Core i7-7800X @ 12x 4GHz
GPU: GeForce RTX 2080 Ti, GeForce RTX 2080 Ti
RAM: 1079MiB / 96358MiB
CUDA version: 10.2
Flair package info:
flair 0.4.3
@bbruceyuan Could you try the latest master version? I tried it with b81bf847f20d785abfccccab8e37d1a087fc6ded and the errors seems to be fixed now.
I'm getting:
In [5]: print(s)
Sentence: "Berlin and Munich ." - 4 Tokens
now :)
@stefan-it
just now, I have tried the master version and got the right result.
It seems the bug was fixed.
Most helpful comment
Hi @bbruceyuan, a small example would be:
Make sure, that your own tokenizer is returning a List of
Tokenobjects :)Notice: This is just a simple example. A more sophisticated solution would be to also store token offsets, see this example.