Node: A proposal to add fs.scandir method to FS module

Created on 30 Sep 2017  Â·  7Comments  Â·  Source: nodejs/node

Problem

Now any interaction with files and directories in the File System is as follows:

const fs = require('fs');

const entries = fs.readdirSync('path_to_directory');
const stats = entries.map(fs.statSync);  // Unnecessary File System access

const dirs = [];
const files = [];

entries.forEach((entry, index) => {
    if (stats[index].isDirectory()) {
        dirs.push(entry);
    } else {
        files.push(entry);
    }
});

The problem here is that we are call File System a second time due to the fact that we don't know the directory in front of us, or file (or symlink).

But we can reduce twice File System calls by creating fs.scandir method that can return d_name and d_type. This information is returned from uv_dirent_t (scandir) (libuv). For example, this is implemented in the Luvit and pyuv (also use libuv).

Motivation

  • Performance – reduce twice File System calls
  • More consistency with other platforms/languages
  • Should be easy to implement String → Object → d_name + d_type

    • For fs.readdir: return converted Object to String

    • For fs.scandir: return As is

  • Early speed up for node-glob (also for each package that uses fs.readdir for traversing directories) in most cases (need fs.stat when d_type is a DT_UNKNOWN on the old FS)

Proposed solution

Add a methods fs.scandir and fs.scandirSync in a standard Fyle System module.

const fs = require('fs');

const entries = fs.scandirSync('path_to_directory');

const dirs = [];
const files = [];

entries.forEach((entry) => {
    if (entry.type === fs.constants.S_IFDIR) {
        dirs.push(entry);
    } else {
        files.push(entry);
    }
});

Where entries is an array of objects:

  • name {String} – filename (d_name in libuv)
  • type {Number} – fs.constants.S_* (d_type in libuv)

Final words

Now I solved this problem by creating C++ Addon but... But I don't speak C++ or speak but very bad (i try 😅 ) and it requires you compile when you install a package that requires additional manipulation to the end user (like https://github.com/nodejs/node-gyp#on-windows).

blocked feature request fs libuv

Most helpful comment

Speaking from the wild (20+ years of syseng experience): large directories are a recurring ops problem I use as an interview question because I have seen it multiple times, and as recently as 2013. Even at companies like Disney and Amazon, there are developers who don't see a problem with using a directory as a flat key-value store, or creating empty files and never removing them. Eventually an operator like me has to do something with hundreds of millions of files all in the same directory. In the past I've used Perl to deal with it, but I fell out of love with Perl years ago.

The usual tools are usually useless specifically _because_ (as I found out with strace) they stat each entry. While stat itself isn't super expensive, it ends up driving the memory and CPU cost of a scan higher than it needs to be if there's enough information in the file name to make decisions from.

(Though while writing this I discovered 'ls -1' uses mmap and does not perform any stat()s.)

If/when y'all decide to tackle this issue, I would ask that you also offer fs.*dir* so that I can do precisely what I want to:

cleanUp = (problemDir, prefix) ->
  new Promise (resolve, reject) ->
    fs.opendir problemDir, (err, handle) ->
      reject err

      do next = ->
        handle.readdir (err, stat) ->
          switch
            when err              then reject err
            when not stat         then resolve()
            when stat.isDirectory then next()

            when (fileName = stat.name).startsWith prefix
              fullPath = path.resolve problemDir, fileName

              fs.unlink fullPath, (err) ->
                if err then reject err else next()

            else next()

cleanUp 'incoming', 'system.system'
  .then -> console.log 'Completed'
  .catch (err) -> console.log err

That said, this kind of feature probably has a small audience and third-party
modules exist which addess
he problem, so I woundn't blame you if this never became a high priority.

All 7 comments

I think if we're going to introduce scandir*() methods, they should work more or less exactly the same as the underlying C function of the same name. One benefit of such a function over the readdir*() methods is that there would be no forced buffering of entry names, which is nice for very large directories. To improve performance, we could allow an option that dictates how many directories to buffer before calling out to the JS callbacks.

For context: https://github.com/libuv/libuv/pull/416 - stalled, and itself a continuation of an older, also stalled PR.

Speaking from the wild (20+ years of syseng experience): large directories are a recurring ops problem I use as an interview question because I have seen it multiple times, and as recently as 2013. Even at companies like Disney and Amazon, there are developers who don't see a problem with using a directory as a flat key-value store, or creating empty files and never removing them. Eventually an operator like me has to do something with hundreds of millions of files all in the same directory. In the past I've used Perl to deal with it, but I fell out of love with Perl years ago.

The usual tools are usually useless specifically _because_ (as I found out with strace) they stat each entry. While stat itself isn't super expensive, it ends up driving the memory and CPU cost of a scan higher than it needs to be if there's enough information in the file name to make decisions from.

(Though while writing this I discovered 'ls -1' uses mmap and does not perform any stat()s.)

If/when y'all decide to tackle this issue, I would ask that you also offer fs.*dir* so that I can do precisely what I want to:

cleanUp = (problemDir, prefix) ->
  new Promise (resolve, reject) ->
    fs.opendir problemDir, (err, handle) ->
      reject err

      do next = ->
        handle.readdir (err, stat) ->
          switch
            when err              then reject err
            when not stat         then resolve()
            when stat.isDirectory then next()

            when (fileName = stat.name).startsWith prefix
              fullPath = path.resolve problemDir, fileName

              fs.unlink fullPath, (err) ->
                if err then reject err else next()

            else next()

cleanUp 'incoming', 'system.system'
  .then -> console.log 'Completed'
  .catch (err) -> console.log err

That said, this kind of feature probably has a small audience and third-party
modules exist which addess
he problem, so I woundn't blame you if this never became a high priority.

Another not mutually exclusive option is to add a new type of stream that emits scandir-like entries or full-blown stats

Not essential but it would be nice to have an option to produce a recursive scan. I guess it wouldn’t follow symlinks though to avoid falling into a closed loop

This would be great to have.

I made a very simple parallel find implementation to see if I could use node's async nature to keep a higher queue depth. I'm not sure, but I really think the lack of exposure to dirent- specifically that inability to see that an entry is a directory- is what keeps Node from trouncing find.

I went looking into libuv and found that libuv does expose the dirent type, &c, it's just Node's readdir that is limiting. Then I found this thread. scandir would help my immediate problem.

But also: having access to things like inodes, extents, &c would open up Node's viability for a wide array of system tools. I'd expect 95% of use cases to be recursing directory trees, but I wanted to call out that there's a lot of other good helpful systems stuff that scandir does.

Proposal: add an option to fs.readdir to get full directory entities. Scandir is useful for recursing, but there's still really good useful things that can be done with readdir(3) that libuv permits, but Node doesn't. I'd love to see dirents results available for Node's readdir!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ksushilmaurya picture ksushilmaurya  Â·  3Comments

dfahlander picture dfahlander  Â·  3Comments

danielstaleiny picture danielstaleiny  Â·  3Comments

stevenvachon picture stevenvachon  Â·  3Comments

danialkhansari picture danialkhansari  Â·  3Comments