Flair: how can I pass my personal tokenizer in Sentence Class?

Created on 30 Sep 2019  路  4Comments  路  Source: flairNLP/flair

hello, how can I pass the tokenizer which implemented by myself?

I use the code:

def my_tokenizer(x):
    # this is my tokenizer implementation
    # which is similar to space tokenizer
    # what I changed is replacing space with \x02.
    # because I wanna segment the sentence with \x02
# then I pass it into  Sentence
Sentence('hello\x02world',use_tokenizer=my_tokenizer)

but it doesn't work. So how can implement my personal tokenizer?

what I can find is this comment in flair.data. I know that we can not pass a function to it, just as the comment read. but how can I complete my requirement?

thank you in advance!

question

Most helpful comment

Hi @bbruceyuan, a small example would be:

from typing import List
from flair.data import Sentence, Token

def own_tokenizer(text: str) -> List[Token]:
    """
    Tokenizer based on space character only.
    """
    tokens: List[Token] = [Token(token) for token in text.split()]

    return tokens

s = Sentence("Berlin and Munich .", use_tokenizer=own_tokenizer)

print(s)

Make sure, that your own tokenizer is returning a List of Token objects :)

Notice: This is just a simple example. A more sophisticated solution would be to also store token offsets, see this example.

All 4 comments

Hi @bbruceyuan, a small example would be:

from typing import List
from flair.data import Sentence, Token

def own_tokenizer(text: str) -> List[Token]:
    """
    Tokenizer based on space character only.
    """
    tokens: List[Token] = [Token(token) for token in text.split()]

    return tokens

s = Sentence("Berlin and Munich .", use_tokenizer=own_tokenizer)

print(s)

Make sure, that your own tokenizer is returning a List of Token objects :)

Notice: This is just a simple example. A more sophisticated solution would be to also store token offsets, see this example.

thanks for your reply. I do this implementation just like you. but it doesn't work.

Just like your example. you split input text by space character. As a result, we got a Sentence with 4 Tokens.

if you print the result s, you will get output Sentence: "Berlin and Munich ." - 4 Tokens.

But if you replace the code with:

from typing import List
from flair.data import Sentence, Token

def own_tokenizer(text: str) -> List[Token]:
    """
    Tokenizer based on space character only.
    """
    tokens: List[Token] = [Token(token) for token in text.split('\x02')]

    return tokens

s = Sentence("Berlin\x02and\x02Munich\x02.", use_tokenizer=own_tokenizer)

print(s)

Just replace split() with split('\x02'). and the output should be
Sentence: "Berlin and Munich ." - 4 Tokens
But now, we got
Sentence: "Berlin and Munich ." - 7 Tokens

It's should be 4 Tokens, because we split the text with \x02.

So is there something wrong about my understanding of it?

OTHER INFO

OS info by screenfetch :

OS: Ubuntu 18.04 bionic
Kernel: x86_64 Linux 4.15.0-65-generic
Uptime: 11h 29m
Packages: 1816
Shell: bash 4.4.19
CPU: Intel Core i7-7800X @ 12x 4GHz
GPU: GeForce RTX 2080 Ti, GeForce RTX 2080 Ti
RAM: 1079MiB / 96358MiB

CUDA version: 10.2

Flair package info:
flair                     0.4.3

@bbruceyuan Could you try the latest master version? I tried it with b81bf847f20d785abfccccab8e37d1a087fc6ded and the errors seems to be fixed now.

I'm getting:

In [5]: print(s)
Sentence: "Berlin and Munich ." - 4 Tokens

now :)

@stefan-it

just now, I have tried the master version and got the right result.
It seems the bug was fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

inyukwo1 picture inyukwo1  路  3Comments

davidsbatista picture davidsbatista  路  3Comments

Y4rd13 picture Y4rd13  路  3Comments

isanvicente picture isanvicente  路  3Comments

Aditya715 picture Aditya715  路  3Comments