Nim: [unidecode] Fix the `unidecode` example

Created on 24 Aug 2018 · 25Comments · Source: nim-lang/Nim

The unidecode example in https://nim-lang.org/docs/unidecode.html#unidecode,string does not work:

unidecode("x53x17x4ExB0")

The "x" are interpreted as literal "x" chars.

Stdlib Unicode

Source

kaushalmodi

All 25 comments

If I fix the example to do:

unidecode("北京")

where 北 is 0x5317 and 京 is 0x4eac, I get the string "Qiong Mu", not "Bei Jing".

kaushalmodi on 24 Aug 2018

unidecode("北京") is equivalent to unidecode("\xe5\x8c\x97\xe4\xba\xac").

Both evaluate to "Qiong Mu".

Now I wonder what evaluates to "Bei Jing" ..

kaushalmodi on 24 Aug 2018

I tried the python version and unidecode("北京") does actually result in Bei Jing, so I think this is a bug.

GULPF on 24 Aug 2018

@GULPF Interestingly.

I don't know Chinese, so I cannot comment on that for sure.

But I know Gujarati, and I use that in https://scripter.co/notes/nim/#unidecode.

Nim is evaluating echo unidecode("મારુ નામ કૌશલ મોદી છે. હૂં અમદાવાદ થી છું.") correctly.

kaushalmodi on 24 Aug 2018

OK, I confirm that unidecode gives the same output for that Gujarati script between Nim and Python.

But gives different output for the Chinese script.

So yep, looks like a bug.

kaushalmodi on 24 Aug 2018

To fix it, run python gen.py.

Araq on 24 Aug 2018

@Araq I don't get it.. what is gen.py .. OK, got it: https://github.com/nim-lang/Nim/blob/devel/lib/pure/unidecode/gen.py

kaushalmodi on 24 Aug 2018

I tried updating the .dat file, but couldn't.. looks like that gen.py doesn't work for python3:

  File "gen.py", line 12
    u = eval("u'\u%04x'" % x)
            ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape

So hoping someone else fixes that file.

kaushalmodi on 24 Aug 2018

Maybe changing it to u = eval("u'\\u%04x'" % x) works?

tim-st on 24 Aug 2018

I just tried that.. got a new error:

Traceback (most recent call last):
  File "gen.py", line 24, in <module>
    main2()
  File "gen.py", line 11, in main2
    for x in xrange(128, 0xffff + 1):
NameError: name 'xrange' is not defined

I think that script should be updated to work with python3, and the unidecode.dat generation should be part of the Nim build process.

kaushalmodi on 24 Aug 2018

Change xrange to range.

tim-st on 24 Aug 2018

👍1

Heh, thanks, was just reading https://stackoverflow.com/questions/15014310/why-is-there-no-xrange-function-in-python3#comment21094703_15014310 :D

kaushalmodi on 24 Aug 2018

Now I got hundreds of warnings, again ending in another error:

/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffc' will be ignored. You might be using a narrow Python build.
  return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffd' will be ignored. You might be using a narrow Python build.
  return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffe' will be ignored. You might be using a narrow Python build.
  return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udfff' will be ignored. You might be using a narrow Python build.
  return _unidecode(string)
Traceback (most recent call last):
  File "gen.py", line 24, in <module>
    main2()
  File "gen.py", line 20, in main2
    f.write("%s\n" % d)
TypeError: a bytes-like object is required, not 'str'

kaushalmodi on 24 Aug 2018

f.write(bytes(...))
I can a port it, wait a minute.

tim-st on 24 Aug 2018

thanks

kaushalmodi on 24 Aug 2018

The following works with Python 3 and Python 2

#! usr/bin/env python
# -*- coding: utf-8 -*-

# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf

from unidecode import unidecode
import warnings

warnings.simplefilter("ignore")

def main2(): 
  data = []
  for x in range(128, 0xffff + 1):
    u = eval("u'\\u%04x'" % x)

    val = unidecode(u)
    data.append(val)

  f = open("unidecode.dat", "w+") 
  for d in data:
    f.write("%s\n" % d)
  f.close()


main2()

tim-st on 24 Aug 2018

👍1

@tim-st Thank you! I was able to regenerate the unidecode.dat file using your script, but ..

@Araq The bug that unidecode outputs "Qiong Mu" instead of "Bei Jing" still remains.

kaushalmodi on 24 Aug 2018

👍1

So for debug, I wrote this code:

#! usr/bin/env python3
# -*- coding: utf-8 -*-

# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf

from unidecode import unidecode
try:
  import warnings
  warnings.simplefilter("ignore")
except ImportError:
  pass

def main2():
  f = open("unidecode.dat", "w+")
  for x in range(128, 0xffff + 1):
    u = eval("u'\\u%04x'" % x)

    val = unidecode(u)

    f.write("%x | " % x)
    f.write("%s\n" % val)

  f.close()

main2()

That produces https://ptpb.pw/QMlT/text.

Now in "北京":

北 is U+5317. From https://ptpb.pw/QMlT/text#L-21147, 5317 does translate to "Bei"
京 is U+4eac. From https://ptpb.pw/QMlT/text#L-20016, 4eac does translate to "Jing"

But when I run:

import unidecode

let datfile = "/home/kmodi/downloads/git/Nim/lib/pure/unidecode/unidecode.dat" # location where I generated that .dat file
loadUnidecodeTable(datfile)

echo unidecode("北京")

I get:

: 5314 | Qiong 4ea9 | Mu

So, somewhere in unidecode.nim, that offset of 3 (5317-3=5314; 4eac-3=43a9) is introduced.

kaushalmodi on 24 Aug 2018

Found the culprit!

3 blank lines in the dat file!

kaushalmodi on 24 Aug 2018

@Araq This fixes it:

#! usr/bin/env python3
# -*- coding: utf-8 -*-

# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf

from unidecode import unidecode
try:
  import warnings
  warnings.simplefilter("ignore")
except ImportError:
  pass

def main2():
  f = open("unidecode.dat", "w+")
  for x in range(128, 0xffff + 1):
    u = eval("u'\\u%04x'" % x)

    val = unidecode(u)

    # f.write("%x | " % x)
    if x==0x2028: # U+2028 = LINE SEPARATOR
      val = ""
    elif x==0x2029: # U+2028 = PARAGRAPH SEPARATOR
      val = ""
    f.write("%s\n" % val)

  f.close()

main2()

kaushalmodi on 24 Aug 2018

@tim-st Any idea if the refactoring can be done so that the .dat file can be written in binary format ("wb+") as before? Then the above hack won't be needed.

kaushalmodi on 24 Aug 2018

the unidecode.dat generation should be part of the Nim build process.

No, then the results would depend on the version of Python's unidecode implementation that happens to be installed on the building machine. Just create a PR that patches the script and updates unidecode.dat. And while we're at it, this should be a Nimble package.

Araq on 25 Aug 2018

There's a typo in the fix:

unidecode.nim(72, 3) Error: undeclared identifier: 'doAassert'
unidecode.nim(72, 13) Error: attempting to call undeclared routine: 'doAassert'
unidecode.nim(72, 13) Error: attempting to call undeclared routine: 'doAassert'
unidecode.nim(72, 13) Error: expression 'doAassert' cannot be called

kaushalmodi on 30 Aug 2018

I noticed, working on it.

Araq on 30 Aug 2018

👍1

Thanks! Nim devel now builds fine.

kaushalmodi on 30 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings