I am replacing a Bash script with a Nodejs script. Part of my old script did this:
find dirpath -print >files
As a replacement, in Nodejs I wrote a directory traversal routine that contains this line:
var files = fs.readdirSync(dirpath)
My regression test caught that the file name ordering was different between the old Bash script and the new Nodejs script. For details I that aren't important, it's best that the Nodejs replacement functions identically as the the old Bash process, however, I couldn't find a solution.
After digging into the source, I found the line in node/deps/uv/src/unix/fs.c(342)
n = scandir(req->path, &dents, uv__fs_scandir_filter, alphasort);
Notice the alphasort? So, it looks like as implemented there is no solution to my problem.
Because fs.readdir() is not really a wrapper on readdir(3) as the documentation says (but is a wrapper for scandir) you cannot get the files in the order that readdir(3) returns. This means for instance, that a programs like find
and ls -U
are not implementable in Nodejs with identical behavior.
It appears the window implementation does not sort. So this sorting behavior is also inconsistent across platforms. Therefore if you wanted sorting and you wanted to be cross platform you'd have to do it manually away:
var files = fs.readdirSync(dirpath).sort();
Could this be changed so fs.readdir returns entries in system order? I believe this could be done by replacing the above line with (not tested just a guess):
n = scandir(req->path, &dents, uv__fs_scandir_filter, NULL);
I'm very reluctant to change the current behavior because there's almost certainly someone somewhere relying on the result being sorted, on UNIX systems at least. I remember bug reports against node v0.4 about fs.readdir()
sorting one way and fs.readdirSync()
another way, and that was when node had a fraction of the user base it has today.
All is perhaps not lost though: work is under way to implement streaming readdir (https://github.com/libuv/libuv/pull/416) that returns the results in directory order.
@bnoordhuis what about an option to disable sorting ?
Technically possible but requires semver-minor changes to libuv and node.js, in that order.
@bnoordhuis, I understand the reluctance to change it, but here's my best case to change it:
readdir(3)
)strcoll(3)
others use strcmp(3)
)strcoll(3)
sorts UTF strings properly... a quick Google search shows that in PHP strcoll()
has been a source of these types of issues ... I personally find doing sorting at the C level rather that the Javascript level odd because it will mean that fs.readdirSync()
will likely not match fs.readdirSync().sort()
depending on LC_COLLATE setting and platformClearly, the last issue is the most important practical one for me so a option to disable sorting would be sufficient.
The first issues are more about keeping Node clean... i.e. if you looked at the library 5 years from now, not knowing the history, would you think it was done right and has the most expressive power.
Thanks for reading this far. :)
I have been thinking about the legacy behavior issue. How about this as a win/win solution:
// two NEW functions
fs.scandir(path[,options],callback)
fs.scandirSync(path[,options])
// where
var options = {
filter: filterFunc, // optional
compare: compareFunc, // optional
}
and then deprecate and/or leave readdir(path,callback)
and readdirSync(path)
as is.
Pros:
fs.scandir*()
is just like scandir(3)
)Cons:
Thoughts?
Without expressing an opinion on the merits of the change request, I wanted to add a note for the record about how Windows works...
Some people think Windows sorts directory enumerations, but it doesn't. Enumerations on NTFS are usually sorted, and that's what most people use, but even there it isn't guaranteed. As per MSDN:
The order in which this function returns the file names is dependent on the file system type. With the NTFS file system and CDFS file systems, the names are usually returned in alphabetical order. With FAT file systems, the names are usually returned in the order the files were written to the disk, which may or may not be in alphabetical order. However, as stated previously, these behaviors are not guaranteed.
In addition, enumeration over a network filesystem may affect the results. Even if the server is using NTFS, and you'd get sorted names locally, you're not guaranteed to get sorted names over the network. The reverse can also be true (network filesystem may sort results even if server doesn't).
Also... There is no universal definition of what "sorted" means on Windows. "DIR" in cmd.exe does a simple alphabetic sort, but the Windows Explorer GUI (which also provides the file open/save dialogs for most applications) has additional rules (see KB319827 and this).
@bhxr You should open an issue at libuv/libuv for hashing out the C API. Be prepared to do the leg work; I'll review patches but it's not a topic I'm otherwise invested in.
@bnoordhuis which solution are you referring to?
Options:
Option 1 requires a change to libuv that I would need to bring up with them.
Option 2 is a nodejs issue and requires no change from their current path assuming opendir API gets done.
After considering all options and also reading what @mcnameej posted. It is clear that fs.readdir*()
does not behave in a consistent, cross platform and i18n friendly way. Thus, any code that is written assuming fs.readdir()
sorts is also flawed because fs.readdir()
does not consistently sort. Therefore, I think the best path forward is make it consistent across platforms and documented.
libuv/libuv/src/unix/fs.c
nodejs/node/lib/fs.js
If you are okay with my base proposal, I'll go bug the libuv team. Otherwise, how do we resolve nodejs side?
Turn of alphasort in uv__fs_scandir()
If by "turn off" you mean "add a flag", yes, that requires an (acceptable IMO) change in libuv.
Implement scandir*() in Javascript in nodejs/node/lib/fs.js in terms of new streaming opendir API
There is more than one way to skin this cat. It could be either a libuv change, a node.js change, or both.
I don't think a libuv issue was ever filed, or if there was, I can't find it. I'll go ahead and close this issue; it's been inactive for over half a year.
Thoughts?
In C readdir iterates over a directory. Modern file systems must be able to handle millions of file in a directory. ( which makes node a no go, on some large scale systems)
Irritating is unpractical with an interpreted language.
Another often used solution is to do some filtering at a low level, with wildcards. (Glob function)
Sorting should of cause always be optional :)
I tracked down the origins of doing any manner of sorting in the first place to this commit, for those who may come here from a web search in the future and are curious.
https://github.com/libuv/libuv/commit/74999f8f9974970e4d71a0df521e178e3149f3df#diff-6a16903c26af4b4035eda9922a73ecc9R182
So It seems it have to use such way manually to get an ordered file.
var files = fs.readdirSync(dirpath).sort();
after sort manually:
So It seems it have to use such way manually to get an ordered file.
var files = fs.readdirSync(dirpath).sort();
after sort manually:
@MuYunyun what do you mean by after sorting manually. Both the following will result into the "LeetCode first output".
var files = fs.readdirSync(dirpath);
var files = fs.readdirSync(dirpath).sort();
I mean the directory in my mac environment is sorted as '1.xx 2.xx 11.xx', and the result of fs.readdirSync(dirpath)
is 1.xx 11.xx 2.xx
, so I've to use fs.readdirSync(dirpath).sort(() => { // do some logic })
to get correct order.
Most helpful comment
@bnoordhuis, I understand the reluctance to change it, but here's my best case to change it:
readdir(3)
)strcoll(3)
others usestrcmp(3)
)strcoll(3)
sorts UTF strings properly... a quick Google search shows that inPHP strcoll()
has been a source of these types of issues ... I personally find doing sorting at the C level rather that the Javascript level odd because it will mean thatfs.readdirSync()
will likely not matchfs.readdirSync().sort()
depending on LC_COLLATE setting and platformClearly, the last issue is the most important practical one for me so a option to disable sorting would be sufficient.
The first issues are more about keeping Node clean... i.e. if you looked at the library 5 years from now, not knowing the history, would you think it was done right and has the most expressive power.
Thanks for reading this far. :)