The unidecode example in https://nim-lang.org/docs/unidecode.html#unidecode,string does not work:
unidecode("x53x17x4ExB0")
The "x" are interpreted as literal "x" chars.
If I fix the example to do:
unidecode("北京")
where 北 is 0x5317 and 京 is 0x4eac, I get the string "Qiong Mu", not "Bei Jing".
unidecode("北京") is equivalent to unidecode("\xe5\x8c\x97\xe4\xba\xac").
Both evaluate to "Qiong Mu".
Now I wonder what evaluates to "Bei Jing" ..
I tried the python version and unidecode("北京") does actually result in Bei Jing, so I think this is a bug.
@GULPF Interestingly.
I don't know Chinese, so I cannot comment on that for sure.
But I know Gujarati, and I use that in https://scripter.co/notes/nim/#unidecode.
Nim is evaluating echo unidecode("મારુ નામ કૌશલ મોદી છે. હૂં અમદાવાદ થી છું.") correctly.
OK, I confirm that unidecode gives the same output for that Gujarati script between Nim and Python.
But gives different output for the Chinese script.
So yep, looks like a bug.
To fix it, run python gen.py.
@Araq I don't get it.. what is gen.py .. OK, got it: https://github.com/nim-lang/Nim/blob/devel/lib/pure/unidecode/gen.py
I tried updating the .dat file, but couldn't.. looks like that gen.py doesn't work for python3:
File "gen.py", line 12
u = eval("u'\u%04x'" % x)
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
So hoping someone else fixes that file.
Maybe changing it to u = eval("u'\\u%04x'" % x) works?
I just tried that.. got a new error:
Traceback (most recent call last):
File "gen.py", line 24, in <module>
main2()
File "gen.py", line 11, in main2
for x in xrange(128, 0xffff + 1):
NameError: name 'xrange' is not defined
I think that script should be updated to work with python3, and the unidecode.dat generation should be part of the Nim build process.
Change xrange to range.
Heh, thanks, was just reading https://stackoverflow.com/questions/15014310/why-is-there-no-xrange-function-in-python3#comment21094703_15014310 :D
Now I got hundreds of warnings, again ending in another error:
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffc' will be ignored. You might be using a narrow Python build.
return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffd' will be ignored. You might be using a narrow Python build.
return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffe' will be ignored. You might be using a narrow Python build.
return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udfff' will be ignored. You might be using a narrow Python build.
return _unidecode(string)
Traceback (most recent call last):
File "gen.py", line 24, in <module>
main2()
File "gen.py", line 20, in main2
f.write("%s\n" % d)
TypeError: a bytes-like object is required, not 'str'
f.write(bytes(...))
I can a port it, wait a minute.
thanks
The following works with Python 3 and Python 2
#! usr/bin/env python
# -*- coding: utf-8 -*-
# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf
from unidecode import unidecode
import warnings
warnings.simplefilter("ignore")
def main2():
data = []
for x in range(128, 0xffff + 1):
u = eval("u'\\u%04x'" % x)
val = unidecode(u)
data.append(val)
f = open("unidecode.dat", "w+")
for d in data:
f.write("%s\n" % d)
f.close()
main2()
@tim-st Thank you! I was able to regenerate the unidecode.dat file using your script, but ..
@Araq The bug that unidecode outputs "Qiong Mu" instead of "Bei Jing" still remains.
So for debug, I wrote this code:
#! usr/bin/env python3
# -*- coding: utf-8 -*-
# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf
from unidecode import unidecode
try:
import warnings
warnings.simplefilter("ignore")
except ImportError:
pass
def main2():
f = open("unidecode.dat", "w+")
for x in range(128, 0xffff + 1):
u = eval("u'\\u%04x'" % x)
val = unidecode(u)
f.write("%x | " % x)
f.write("%s\n" % val)
f.close()
main2()
That produces https://ptpb.pw/QMlT/text.
Now in "北京":
But when I run:
import unidecode
let datfile = "/home/kmodi/downloads/git/Nim/lib/pure/unidecode/unidecode.dat" # location where I generated that .dat file
loadUnidecodeTable(datfile)
echo unidecode("北京")
I get:
: 5314 | Qiong 4ea9 | Mu
So, somewhere in unidecode.nim, that offset of 3 (5317-3=5314; 4eac-3=43a9) is introduced.
Found the culprit!
2027 | .
2028 |
2029 |
202a |
202b |
3 blank lines in the dat file!
@Araq This fixes it:
#! usr/bin/env python3
# -*- coding: utf-8 -*-
# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf
from unidecode import unidecode
try:
import warnings
warnings.simplefilter("ignore")
except ImportError:
pass
def main2():
f = open("unidecode.dat", "w+")
for x in range(128, 0xffff + 1):
u = eval("u'\\u%04x'" % x)
val = unidecode(u)
# f.write("%x | " % x)
if x==0x2028: # U+2028 = LINE SEPARATOR
val = ""
elif x==0x2029: # U+2028 = PARAGRAPH SEPARATOR
val = ""
f.write("%s\n" % val)
f.close()
main2()
@tim-st Any idea if the refactoring can be done so that the .dat file can be written in binary format ("wb+") as before? Then the above hack won't be needed.
the unidecode.dat generation should be part of the Nim build process.
No, then the results would depend on the version of Python's unidecode implementation that happens to be installed on the building machine. Just create a PR that patches the script and updates unidecode.dat. And while we're at it, this should be a Nimble package.
There's a typo in the fix:
unidecode.nim(72, 3) Error: undeclared identifier: 'doAassert'
unidecode.nim(72, 13) Error: attempting to call undeclared routine: 'doAassert'
unidecode.nim(72, 13) Error: attempting to call undeclared routine: 'doAassert'
unidecode.nim(72, 13) Error: expression 'doAassert' cannot be called
I noticed, working on it.
Thanks! Nim devel now builds fine.