Nim: walkDirRec returning incorrect chinese file name on windows

Created on 19 Aug 2015 · 8Comments · Source: nim-lang/Nim

given two file with chinese character
傅温良.txt
涓甲午.txt

the following code will raise exception

import os

for fileName in walkDirRec("", {pcFile}):
   var file = open(fileName)

with error message:

Traceback (most recent call last)
bug.nim(4) bug
system.nim(2431) sysFatal
Error: unhandled exception: cannot open: Õéàµ©®´┐¢.txt [IOError]

but it is ok if I call open("傅温良.txt") directly

var file = open("傅温良.txt")
var file2 = open("涓甲午.txt")

looks like walkDirRec have problem with those characters, it also happened to other files with chinese character too

nim -v
Nim Compiler Version 0.11.3 (2015-08-10) [Windows: i386]
Copyright (c) 2006-2015 by Andreas Rumpf

active boot switches:

do i need to use certain encoding for walkDirRec?

High Priority Stdlib

Source

jangko

All 8 comments

Everything in os.nim uses the Windows *W functions which are Unicode aware and Nim strings are converted from UTF-8 to UTF-16 for this to work. No idea what's going on here.

Araq on 19 Aug 2015

把文件编码换成utf-8试试?

egmkang on 19 Aug 2015

import os, strutils

const
  one = "傅温良.txt"

proc toHex(input: string): string =
  result = ""
  for c in input:
    result.add toHex(ord(c), 2)

for fileName in walkDirRec("", {pcFile}):
  let path = splitFile(fileName)
  if path.ext.len() == 0: continue
  let ext = toLower(path.ext)
  if ext != ".txt": continue
  echo toHex(fileName)
  echo toHex(one)

the result is(I added the '-'):

E58285E6B8A9 - EFBFBD - 2E747874
E58285E6B8A9 - E889AF - 2E747874

the first line is what come out from walkDirRec
the second line is UTF-8 from editor

I think the problem lies in nim's internal conversion, from UCS-2 to UTF-8
and as far as I know, UCS-2 != UTF-16

jangko on 19 Aug 2015

i found the culprit, it is in widestrs.nim

proc `$`*(w: WideCString, estimate: int): string =

it produced different result compared to WideCharToMultiByte

proc $ will return: E58285E6B8A9 - EFBFBD - 2E747874
while using WideCharToMultiByte the result will be: E58285E6B8A9 - E889AF - 2E747874

wide char contains Unicode character in Windows is encoded in UCS-2, not UTF-16, probably that is the problem, need further investigation

jangko on 20 Aug 2015

UTF-16 is used for text in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE.[10] Older Windows NT systems (prior to Windows 2000) only support UCS-2.[11]

https://en.wikipedia.org/wiki/UTF-16

egmkang on 20 Aug 2015

👍1

So... What version of Windows are you using?

Varriount on 20 Aug 2015

👍1

@egmkang: thanks for the clarification
@Varriount: Windows 7 Ultimate 64bit

yes, it has nothing to do with UCS-2, i have compared it with utfcpp and again the result is E58285E6B8A9E889AF2E747874, same as WideCharToMultiByte,
i think it's time to make a PR

jangko on 20 Aug 2015

👍1

Going to assume that #3231 fixed this.