When using the tokenizer.texts_to_sequences() function with unicode inputs, it appears that the system breaks here,
File "build/bdist.macosx-10.10-x86_64/egg/keras/preprocessing/text.py", line 104, in texts_to_sequences
File "build/bdist.macosx-10.10-x86_64/egg/keras/preprocessing/text.py", line 118, in texts_to_sequences_generator
File "build/bdist.macosx-10.10-x86_64/egg/keras/preprocessing/text.py", line 31, in text_to_word_sequence
TypeError: character mapping must return integer, None or unicode
On further digging, line 31 specifically,
text = text.translate(maketrans(filters, split*len(filters)))
From the stackoverflow gods [1], it appears that there's a solution for this like so,
def translate_non_alphanumerics(to_translate, translate_to=u'_'):
not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)
return to_translate.translate(translate_table)
>>> translate_non_alphanumerics(u'<foo>!')
u'_foo__'
Is this something that would be worth adding, or am I overlooking something obvious?
Thanks!
[1] http://stackoverflow.com/questions/1324067/how-do-i-get-str-translate-to-work-with-unicode-strings
Same problem. Want to get it to work with mongodb.
Same problem. I think that this simple transformation should work.
Instead of the line 31:
text = text.translate(maketrans(filters, split*len(filters)))
We need:
try :
text = unicode(text, "utf-8")
except TypeError:
pass
translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
text = text.translate(translate_table)
hello everyone
I face the same problem but I sort out it by converting list
list = [s.encode('ascii') for s in list]
And it works
Can we please have this fix released? I cannot use the Keras Tokenizer for this reason. Thanks!
i have the same issue. is it fixed?
Will it be fixed on Keras 2?
i had edited the code in the keras/preprocessing/text.py file as mentioned below as one of the above comments mentioned and it worked like a charm for me
try :
text = unicode(text, "utf-8")
except TypeError:
pass
translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
text = text.translate(translate_table)
Thank you @frnsys @atyamsriharsha. I cloned the repo and edited the file to replace line 38 and installed from the local repository. It worked
This works! Is there a plan merging the code into Keras master?
While the fix is still missing (1.5 years and counting to merge a one-liner, wat?), here's a temporary monkey-patch. Calling this code before you start using Tokenizer seems to do the job without the need to edit your site-packages or pull from separate branches.
import keras.preprocessing.text
def text_to_word_sequence(text,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True, split=" "):
if lower: text = text.lower()
if type(text) == unicode:
translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
else:
translate_table = maketrans(filters, split * len(filters))
text = text.translate(translate_table)
seq = text.split(split)
return [i for i in seq if i]
keras.preprocessing.text.text_to_word_sequence = text_to_word_sequence
you can also do:
from unidecode import unidecode
fit_on_texts(unidecode(textdata))
Note if you're using python 3, then you should replace
text = unicode(text, "utf-8") with text = str(text, "utf-8")
in @rsgreen solution.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
While the solution presented by @konstantint worked for me, I had to make a small change for it to work in python 3. Contrary to what was the case in python 2, in 3, strings are unicode by default (stackoverflow answer source).
As such, in line 7, if type(text) == unicode: as to be changed to if type(text) == str: for it to work in python 3.
In the end, the code that worked for me is the following:
import keras.preprocessing.text
def text_to_word_sequence(text,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True, split=" "):
if lower: text = text.lower()
if type(text) == str:
translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
else:
translate_table = maketrans(filters, split * len(filters))
text = text.translate(translate_table)
seq = text.split(split)
return [i for i in seq if i]
keras.preprocessing.text.text_to_word_sequence = text_to_word_sequence
@rsgreen It works for me. Thanks a lot.
Closing as this is resolved
Most helpful comment
Same problem. I think that this simple transformation should work.
Instead of the line 31:
We need: