Nim: [unidecode] Fix the `unidecode` example

Created on 24 Aug 2018  ·  25Comments  ·  Source: nim-lang/Nim

The unidecode example in https://nim-lang.org/docs/unidecode.html#unidecode,string does not work:

unidecode("x53x17x4ExB0")

The "x" are interpreted as literal "x" chars.

Stdlib Unicode

All 25 comments

If I fix the example to do:

unidecode("北京")

where is 0x5317 and is 0x4eac, I get the string "Qiong Mu", not "Bei Jing".

unidecode("北京") is equivalent to unidecode("\xe5\x8c\x97\xe4\xba\xac").

Both evaluate to "Qiong Mu".

Now I wonder what evaluates to "Bei Jing" ..

I tried the python version and unidecode("北京") does actually result in Bei Jing, so I think this is a bug.

@GULPF Interestingly.

I don't know Chinese, so I cannot comment on that for sure.

But I know Gujarati, and I use that in https://scripter.co/notes/nim/#unidecode.

Nim is evaluating echo unidecode("મારુ નામ કૌશલ મોદી છે. હૂં અમદાવાદ થી છું.") correctly.

OK, I confirm that unidecode gives the same output for that Gujarati script between Nim and Python.

But gives different output for the Chinese script.

So yep, looks like a bug.

To fix it, run python gen.py.

@Araq I don't get it.. what is gen.py .. OK, got it: https://github.com/nim-lang/Nim/blob/devel/lib/pure/unidecode/gen.py

I tried updating the .dat file, but couldn't.. looks like that gen.py doesn't work for python3:

  File "gen.py", line 12
    u = eval("u'\u%04x'" % x)
            ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape

So hoping someone else fixes that file.

Maybe changing it to u = eval("u'\\u%04x'" % x) works?

I just tried that.. got a new error:

Traceback (most recent call last):
  File "gen.py", line 24, in <module>
    main2()
  File "gen.py", line 11, in main2
    for x in xrange(128, 0xffff + 1):
NameError: name 'xrange' is not defined

I think that script should be updated to work with python3, and the unidecode.dat generation should be part of the Nim build process.

Change xrange to range.

Now I got hundreds of warnings, again ending in another error:

/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffc' will be ignored. You might be using a narrow Python build.
  return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffd' will be ignored. You might be using a narrow Python build.
  return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udffe' will be ignored. You might be using a narrow Python build.
  return _unidecode(string)
/home/kmodi/.local/lib/python3.7/site-packages/unidecode/__init__.py:50: RuntimeWarning: Surrogate character '\udfff' will be ignored. You might be using a narrow Python build.
  return _unidecode(string)
Traceback (most recent call last):
  File "gen.py", line 24, in <module>
    main2()
  File "gen.py", line 20, in main2
    f.write("%s\n" % d)
TypeError: a bytes-like object is required, not 'str'

f.write(bytes(...))
I can a port it, wait a minute.

thanks

The following works with Python 3 and Python 2

#! usr/bin/env python
# -*- coding: utf-8 -*-

# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf

from unidecode import unidecode
import warnings

warnings.simplefilter("ignore")

def main2(): 
  data = []
  for x in range(128, 0xffff + 1):
    u = eval("u'\\u%04x'" % x)

    val = unidecode(u)
    data.append(val)

  f = open("unidecode.dat", "w+") 
  for d in data:
    f.write("%s\n" % d)
  f.close()


main2()

@tim-st Thank you! I was able to regenerate the unidecode.dat file using your script, but ..

@Araq The bug that unidecode outputs "Qiong Mu" instead of "Bei Jing" still remains.

So for debug, I wrote this code:

#! usr/bin/env python3
# -*- coding: utf-8 -*-

# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf

from unidecode import unidecode
try:
  import warnings
  warnings.simplefilter("ignore")
except ImportError:
  pass

def main2():
  f = open("unidecode.dat", "w+")
  for x in range(128, 0xffff + 1):
    u = eval("u'\\u%04x'" % x)

    val = unidecode(u)

    f.write("%x | " % x)
    f.write("%s\n" % val)

  f.close()

main2()

That produces https://ptpb.pw/QMlT/text.

Now in "北京":

But when I run:

import unidecode

let datfile = "/home/kmodi/downloads/git/Nim/lib/pure/unidecode/unidecode.dat" # location where I generated that .dat file
loadUnidecodeTable(datfile)

echo unidecode("北京")

I get:

: 5314 | Qiong 4ea9 | Mu 

So, somewhere in unidecode.nim, that offset of 3 (5317-3=5314; 4eac-3=43a9) is introduced.

Found the culprit!

2027 | . 
2028 | 

2029 | 


202a |  
202b |  

3 blank lines in the dat file!

@Araq This fixes it:

#! usr/bin/env python3
# -*- coding: utf-8 -*-

# Generates the unidecode.dat module
# (c) 2010 Andreas Rumpf

from unidecode import unidecode
try:
  import warnings
  warnings.simplefilter("ignore")
except ImportError:
  pass

def main2():
  f = open("unidecode.dat", "w+")
  for x in range(128, 0xffff + 1):
    u = eval("u'\\u%04x'" % x)

    val = unidecode(u)

    # f.write("%x | " % x)
    if x==0x2028: # U+2028 = LINE SEPARATOR
      val = ""
    elif x==0x2029: # U+2028 = PARAGRAPH SEPARATOR
      val = ""
    f.write("%s\n" % val)

  f.close()

main2()

@tim-st Any idea if the refactoring can be done so that the .dat file can be written in binary format ("wb+") as before? Then the above hack won't be needed.

the unidecode.dat generation should be part of the Nim build process.

No, then the results would depend on the version of Python's unidecode implementation that happens to be installed on the building machine. Just create a PR that patches the script and updates unidecode.dat. And while we're at it, this should be a Nimble package.

There's a typo in the fix:

unidecode.nim(72, 3) Error: undeclared identifier: 'doAassert'
unidecode.nim(72, 13) Error: attempting to call undeclared routine: 'doAassert'
unidecode.nim(72, 13) Error: attempting to call undeclared routine: 'doAassert'
unidecode.nim(72, 13) Error: expression 'doAassert' cannot be called

I noticed, working on it.

Thanks! Nim devel now builds fine.

Was this page helpful?
0 / 5 - 0 ratings