Keras: Unicode support for keras.preprocessing.text

Created on 24 Nov 2015 · 16Comments · Source: keras-team/keras

When using the tokenizer.texts_to_sequences() function with unicode inputs, it appears that the system breaks here,

File "build/bdist.macosx-10.10-x86_64/egg/keras/preprocessing/text.py", line 104, in texts_to_sequences
  File "build/bdist.macosx-10.10-x86_64/egg/keras/preprocessing/text.py", line 118, in texts_to_sequences_generator
  File "build/bdist.macosx-10.10-x86_64/egg/keras/preprocessing/text.py", line 31, in text_to_word_sequence
TypeError: character mapping must return integer, None or unicode

On further digging, line 31 specifically,

    text = text.translate(maketrans(filters, split*len(filters)))

From the stackoverflow gods [1], it appears that there's a solution for this like so,

def translate_non_alphanumerics(to_translate, translate_to=u'_'):
    not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
    translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)
    return to_translate.translate(translate_table)

>>> translate_non_alphanumerics(u'<foo>!')
u'_foo__'

Is this something that would be worth adding, or am I overlooking something obvious?

Thanks!

[1] http://stackoverflow.com/questions/1324067/how-do-i-get-str-translate-to-work-with-unicode-strings

Source

viksit

👍8

Most helpful comment

Same problem. I think that this simple transformation should work.

Instead of the line 31:

text = text.translate(maketrans(filters, split*len(filters)))

We need:

try :
    text = unicode(text, "utf-8")
except TypeError:
    pass
translate_table = {ord(c): ord(t) for c,t in  zip(filters, split*len(filters)) }
text = text.translate(translate_table)

rsgreen on 23 Aug 2016

👍6

All 16 comments

Same problem. Want to get it to work with mongodb.

Bullettoothtobi on 28 Feb 2016

Same problem. I think that this simple transformation should work.

Instead of the line 31:

text = text.translate(maketrans(filters, split*len(filters)))

We need:

try :
    text = unicode(text, "utf-8")
except TypeError:
    pass
translate_table = {ord(c): ord(t) for c,t in  zip(filters, split*len(filters)) }
text = text.translate(translate_table)

rsgreen on 23 Aug 2016

👍6

hello everyone
I face the same problem but I sort out it by converting list
list = [s.encode('ascii') for s in list]
And it works

VedantYadav on 2 Nov 2016

👍2

Can we please have this fix released? I cannot use the Keras Tokenizer for this reason. Thanks!

Qululu on 30 Jan 2017

i have the same issue. is it fixed?

abhishekkrthakur on 28 Feb 2017

Will it be fixed on Keras 2?

laviavigdor on 13 Mar 2017

i had edited the code in the keras/preprocessing/text.py file as mentioned below as one of the above comments mentioned and it worked like a charm for me
try :
text = unicode(text, "utf-8")
except TypeError:
pass
translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
text = text.translate(translate_table)

atyamsriharsha on 21 Mar 2017

👍1

Thank you @frnsys @atyamsriharsha. I cloned the repo and edited the file to replace line 38 and installed from the local repository. It worked

EHDEV on 4 Apr 2017

This works! Is there a plan merging the code into Keras master?

Vimos on 14 Apr 2017

While the fix is still missing (1.5 years and counting to merge a one-liner, wat?), here's a temporary monkey-patch. Calling this code before you start using Tokenizer seems to do the job without the need to edit your site-packages or pull from separate branches.

import keras.preprocessing.text

def text_to_word_sequence(text,
                          filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                          lower=True, split=" "):
    if lower: text = text.lower()
    if type(text) == unicode:
        translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
    else:
        translate_table = maketrans(filters, split * len(filters))
    text = text.translate(translate_table)
    seq = text.split(split)
    return [i for i in seq if i]

keras.preprocessing.text.text_to_word_sequence = text_to_word_sequence

konstantint on 20 Apr 2017

👍3

you can also do:

from unidecode import unidecode
fit_on_texts(unidecode(textdata))

abhishekkrthakur on 4 May 2017

👍1

Note if you're using python 3, then you should replace

text = unicode(text, "utf-8") with text = str(text, "utf-8")

in @rsgreen solution.

mac995 on 12 Jul 2017

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 10 Oct 2017

While the solution presented by @konstantint worked for me, I had to make a small change for it to work in python 3. Contrary to what was the case in python 2, in 3, strings are unicode by default (stackoverflow answer source).

As such, in line 7, if type(text) == unicode: as to be changed to if type(text) == str: for it to work in python 3.

In the end, the code that worked for me is the following:

import keras.preprocessing.text

def text_to_word_sequence(text,
                          filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                          lower=True, split=" "):
    if lower: text = text.lower()
    if type(text) == str:
        translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
    else:
        translate_table = maketrans(filters, split * len(filters))
    text = text.translate(translate_table)
    seq = text.split(split)
    return [i for i in seq if i]

keras.preprocessing.text.text_to_word_sequence = text_to_word_sequence