Node: streaming / iterative fs.readdir

Created on 24 Jan 2015  Â·  93Comments  Â·  Source: nodejs/node

since we're in ES6 territory now, i'm thinking the sync version should be an iterable

var dirs = fs.readdirIter(__dirname);
for (dir of dirs) {

}

and have the async version be an object stream:

var stream = fs.readdirStream(__dirname);
stream.on('data', dir => )

See: https://github.com/joyent/node/issues/388

feature request fs libuv

Most helpful comment

Landed in cbd8d715b2286e5726e6988921f5c870cbf74127 as fs{Promises}.opendir(), which returns an fs.Dir, which exposes an async iterator. 🎉

All 93 comments

Using generators/iterators in io.js where they make sense would be a good addition.

Does this depend on the work in joyent/libuv#1574 to be merged first? I don't think it's in libuv yet.

:+1:

in the future asynchronous version could return an observable:

for (dir on fs.readdir(__dirname)) {

}

:+1:

I'm all for the new named versions. We don't want to break compatibility with existing code by changing behaviour of existing functions, but new stuff could work. I'm not sure what the policy is yet on new interfaces that deviate from node.js though.

@iojs/tc Thoughts?

Where else does an interface like this make sense? Would this be a one-off of would there be a flood of requests to add the same style interface to other core APIs?

@rvagg I looked for any other APIs that would look like this but I couldn't find any. I think this may be a one-off.

I think the way forward on this is for someone here to propose a change in a PR and then we'll escalate that to a TC discussion because it would be good to cover the question of whether adopting a new style of API is desirable. If it's too much work to come up with an initial implementation then we could just elevate this issue, it'd just be more of a hypothetical discussion then.

I think the streams version makes sense. I would test the iterator design in userland though. It can apply to any sort of stream, really. And it's easy enough to abstract a stream as an iterator in userland.

The eventual observables thing sounds pretty rad.

@Qard how would you create an iterator from a stream? streams are async... iterators are not...

@cjb i don't think it's merged. i'm going to open an issue in libuv now

Just use co? (or something similar)

OK, a few things:

  • Iterables make no sense here; they are synchronous. (Generators are just iterables with .throw() and .return() methods; they don't make any sense either.)
  • There are vague designs by one TC39 committee member for adding RxJS-style observables to the language, along with for-on syntax. I doubt that will make the ES2016 train since no implementers have yet expressed interest, but I could be wrong. Regardless, it's very tentative. (There's another competing proposal, async iterables, that might be a better fit for this use case in particular.)
  • In the meantime, Node.js has an awkward version of observables/async-iterables/etc. already: object-mode streams. That's clearly the right choice for this situation, even if in the long run there becomes a language-level primitive for plural asynchronous values (similar to promises being the primitive for singular asynchronous values, or iterables being the primitive for plural synchronous values).

So I don't think this should be a new style of API. Just an object mode stream is fine.

Whoa, yeah this is a totally bogus suggestion as iterators and generators can't be used for iterating over async. This tripped me up at first as well.

But note that though it's not exactly what you had in mind fs.readFileSync(file, 'utf8') is already iterable by way of a String being iterable, and soon fs.readFileSync(file) will too be iterable as Buffers implement the iterable interface: https://github.com/iojs/io.js/pull/525

So re-reading the OP: readdirSync already returns an array which is iterable

It does, but readdirSync() is (too) eager: it slurps the directory in one pass. If you apply it to a directory with 1M+ entries, it will likely bring down the process.

An iterator that walks the entries piecemeal would be great but it requires a number of changes to libuv. @misterdjules already did some of the prep work but I recall that there were still some loose ends. I'm not even sure if we reached consensus that his approach was really the best way forward.

as iterators and generators can't be used for iterating over async

they can!

for (let promise of iterator) {
  let val = await promise;
}

@bnoordhuis I see, so you'd block on each call to .next(), instead of all at once.

@vkurchatkin yes, you can invent contracts on top of iterators. Especially ones that only work in ES2016. But they won't necessarily be respected. Someone could "forget" an await, or a .then, or just call .next() twice in a row. Now what? How do you know if it's end-of-iterable or I/O error or ...?

@vkurchatkin essentially you are proposing async iterable as next() : -> Iteration<Promise<T>>, whereas a more coherent design is next(): -> Promise<Iteration<T>>.

For example in your example the producer can't decide when to return or throw from the generator side, because the producer has to wait for the async process to complete before doing so, but the only way to signal completion or errors is synchronous return/throw.

you'd block on each call to .next(), instead of all at once.

Yes, but I'm mostly concerned with memory usage; an iterator brings it down from O(n) to O(1). Currently, trying to scan a large directory exhausts the heap and terminates the process.

@domenic agreed, ugly and far from ideal.

a more coherent design is next(): -> Promise>

I can't see how this can be achieved using existing ES2015/2016 primitives. Thanks for the link, probably it has explanation.

Yeah the iterative version in my head was a synchronous version that doesn't load the entire array in memory at once. But i care way more about the async version.

@domenic While generators on their own are sync, they _can_ be used in an async way via co. Maybe you don't like the weird promise trampoline hack it uses, but it works and many people are already using it in that way. I suggested using co because I had already suggested doing the iterator thing in userland, which means those sort of tools are available. If you prefer some different async iterator module, you can write a streams abstraction on that.

I wouldn't mind this being in core, I just suggested a userland approach because I expected some members of the core team might disagree.

btw let's not do the iterator design. i realized that if something like this happens:

var i = 0;
for (name of fs.readdirIter(__dirname)) {
  if (i++ > 5) throw new Error('boom');
}

assuming fs.readdirIter() creates a file descriptor, the above may cause a file descriptor leak as the iterator may remain in a paused state. this is assuming fs.readdirIter() opens a file descriptor until it's complete. having users manually close the file descriptor may be too much to ask (not worth people complaining about leaks when they don't know about this).

so... let's just do the stream version for now :D

@jonathanong as long as the iterator implements iterator.return() that closes the FD, there will be no leak.

@domenic why isn't return here: https://people.mozilla.org/~jorendorff/es6-draft.html#sec-iterator-interface

@vkurchatkin same reason throw isn't.

not working:

var arr = [1,2,3,4,5];
var it = arr[Symbol.iterator]();

it.return = function() {
  cobsole.log('return');
}


for (var v of it) {
  if (v > 3) break;
  console.log(v)
}

Ah yeah, forgot it wasn't implemented in V8 yet (bug).

aha, here it is: https://people.mozilla.org/~jorendorff/es6-draft.html#sec-iteratorclose

Still don't understand why it is not a part of iterator interface.

Interesting. Good to know we have an es master here!

plain node stream style looks good too:

var stream = fs.readdirStream(__dirname);
stream.on('data', function(dir) { ... })

is it possible to create iterable version on top of it?

+1 to an object stream for this, that could probably be shoehorned into an iterator (?)

So backing up to the object mode stream for fs.readDir (e.g. fs.createReadDirStream ?) would it make sense for there to be a deep/recursive mode for it (e.g. similar to what I made with spider-stream) at the moment doing this is tricky as you either have to stat all the things in the directory or do what rimraf doesn't do which would be assume everything is a directory and on error know it's a file (what rimraf does do doesn't really apply here).

Sounds like something for userland still. Perhaps made easier / safer by a streaming readdir

Perhaps made easier / safer by a streaming readdir

that is what I was going for

Looks like there might be some more movement on this: https://github.com/libuv/libuv/pull/416

To my understanding, it will require a major bump of libuv, so this might take a while to actually come to io.js / node.

With libuv/libuv#416 we could simply expose an API like this:

let dir = await fs.opendir("foo"), entries;
do {
  entries = await dir.read(100)
  // deal with entries
} while (entries.length === 100);
// when dir is finalized it calls closedir

Or with simple callbacks, if promises are too crazy.

FTR, Go's readdir requires you to limit the number of entries read. By default it's 100.

Also it hasn't been discussed much here, but not limiting readdir is a security risk — allows virtually unlimited memory allocation initiated by user input. I.e. a file server implementation (or a web server, because everyone seem to like web servers) that allow uploading/adding files and listing files could be DoS'd by creating many files and list the entries.

when dir is finalized it calls closedir

Not this. JS doesn't have finalization. It has garbage collection, which is never guaranteed to run. Scarce system resources like directory handles should never depend on garbage collection to collect them.

The stream design seems sane. Or yours, but with dir.close(). Alternately a lower-level API like const d = await fs.opendir(dir); const entry = await fs.readdir(d); fs.closedir(d) would be reasonable for core IMO.

Also it hasn't been discussed much here, but not limiting readdir is a security risk — allows virtually unlimited memory allocation initiated by user input. I.e. a file server implementation (or a web server, because everyone seem to like web servers) that allow uploading/adding files and listing files could be DoS'd by creating many files and list the entries.

This shouldn't be an issue with the streaming proposal, but should be with the current implementations I think.

@domenic I'm well aware how v8 works ;)

(BTW, object streams seems to be favoured by individuals who tend to seek out opportunities to create new abstractions — it doesn't add clarity and suffers in performance IMHO.)

TL;DR: async bridge toll costs would make reading single dirents unfeasible.

libuv/libuv#416 exposes the following interface:

int uv_fs_opendir(uv_loop_t*, uv_fs_t*, const char* path, uv_fs_cb);
int uv_fs_readdir(uv_loop_t*, uv_fs_t*, const uv_dir_t*, uv_dirent_t[], size_t limit, uv_fs_cb);
int uv_fs_closedir(uv_loop_t*, uv_fs_t*, const uv_dir_t*, uv_fs_cb);

Where the readdir call reads up to limit amount of entries. There's a bridge toll for any uv async call (which can be quite dramatic if you perform many calls, like i.e. a compiler would do.) It makes sense to read a sensible amount of dirents per uv request, to avoid paying high bridge tolls.

Revised proposal for exposure in node

let fs = require("fs");
fs.openDir("foo", function (err, dir) {
  if (err) { throw err; }
  let readNext = function() {
    dir.read(100, function (err, entries) {
      if (err) { throw err; }
      // deal with entries
      if (entries.length === 100) {
        readNext();
      } else {
        dir.close(function(err) {
          if (err) { throw err; }
          // done reading directory
        });
      }
    });
  }
  readNext();
});

Addition to lib/fs.js:

let fsb = process.binding("fs")

export class Dir {
  read(limit :number, cb :(err:Error,entries:DirEnt[])=>void) :void {}
  close(cb: (err:Error)=>void) :void {}
}

interface DirEnt {
  name :string;
  type :string; // "dir"|"file"|"link"|"fifo"|"socket"|"char"|"block"|"?"
}

export function openDir(path :string, cb :(err:Error,dir:Dir)=>void) {
  fsb.opendir(path, (err, handle) => {
    if (!err) {
      let dir = new Dir;
      dir._handle = handle;
    }
    cb(err, dir);
  })
}

Dir.prototype.read = function(limit, cb) {
  if (!limit || limit < 1) { limit = 100 }
  fsb.readdir(this._handle, limit, (err, entries) => {
    if (err) {
      fsb.closedir(this._handle)
      this._handle = null
    }
    cb(err, entries);
  })
}

Dir.prototype.close = function(cb) {
  if (!this._handle) { throw new Error('already closed') }
  fsb.closedir(this._handle, cb)
  this._handle = null;
}

Alternative w/ promises

let fs = require("fs");
let dir = await fs.openDir("foo"), entries;
do {
  entries = await dir.read(100)
  // deal with entries
} while (entries.length === 100);
await dir.close();
// done reading directory

Addition to lib/fs.js:

let fsb = process.binding("fs")

export class Dir {
  async read(limit :number) :DirEnt[] {}
  async close() :void {}
}

interface DirEnt {
  name :string;
  type :string; // "dir"|"file"|"link"|"fifo"|"socket"|"char"|"block"|"?"
}

export function openDir(path :string) :Promise<Dir> {
  return new Promise((resolve, reject) => {
    fsb.opendir(path, (err, handle) => {
      if (err) {
        reject(err);
      } else {
        let dir = new Dir;
        dir._handle = handle;
        resolve(dir);
      }
    })
  })
}

Dir.prototype.read = function(limit) {
  if (!limit || limit < 1) { limit = 100 }
  return new Promise((resolve, reject) => {
    fsb.readdir(this._handle, limit, (err, entries) => {
      if (err) {
        fsb.closedir(this._handle)
        this._handle = null
        reject(err);
      } else {
        resolve(entries);
      }
    })
  })
}

Dir.prototype.close = function() {
  return new Promise((resolve, reject) => {
    if (!this._handle) { throw new Error('already closed') }
    fsb.closedir(this._handle, err => {
      if (err) { reject(err); } else { resolve(); }
    })
    this._handle = null;
  })
}

How are things going with this?

@hollowdoor This is dependent on this libuv PR: https://github.com/libuv/libuv/pull/416

This has been a while and will probably be a while longer. From https://github.com/libuv/libuv/pull/416#issuecomment-218483712:

There is no defined roadmap as of yet. v2 won't happen overnight, that's for sure. Time is a scarce resource, I'm afraid. Personally I haven't had the time to work on any of the things I wanted for v2 yet.

So for this to happen, if I understand everything correctly:

  • That PR has to land in libuv.
  • libuv has to release v2.
  • Node.js has to start using the v2 release of libuv.
  • Then this issue can be addressed.

I'm not sure if this is a good use of the stalled label or not, but I'll just leave this comment for anyone curious if there's been progress since October...

Any progress on this?

@DomVinyard It would appear no. If you want to see this happen, it requires https://github.com/libuv/libuv/pull/416 so maybe head over there to see if there's a way you can help. /cc @whitlockjc

@Trott Appreciate the fwd. For anybody else awaiting this to solve the same subset of problems as me - a combination of native readdir and stream iterator gets you some of the way.

Almost three months later, still waiting eagerly on this. native-readdir is a nice shim but it's not compatible with windows. Does anybody have any other solutions?

There's really nothing we can do on this until the appropriate support lands in libuv and that is updated here. Best would be to try to go shake the tree on that side.

I'll be picking this up tonight, I apologize for the delay. Job changes and such have pulled me away from Node.js for a few months. I'm back and will get libuv support in ASAP.

just fyi the streams wg is broadly in favor of adding async iterator support to streams once v8 supports it.

Why can't we have just a simple promise if callback is omitted, like for example

fs.readDir(path[, options])
  .then(files=>console.log(files))
  .catch(err=>console.log(err));

I know I can use https://github.com/thenables/thenify but I assumed that since node supports Promises it would use them...

@zevero Wouldn't that present the same problem as a callback for reading large masses of file names? The benefit of streams are that they save memory. A stream can be abstracted as a promise to get the benefits of a promise. A promise can't be abstracted as a stream to get the benefits of a stream.

Indeed. Promises are a bit off-topic here.

On a related note though, there has been discussion about streams supporting async iterators in the future, enabling async for-of to work. https://github.com/nodejs/readable-stream/issues/254

Should this remain open?

@Trott Is there a streaming API for fs.readdir yet?

Here's what has to happen for this, now in a new and exciting CHECKLIST. Progress!

  • [ ] libuv/libuv#416 has to land in libuv
  • [ ] libuv has to release v2.
  • [ ] Node.js has to start using the v2 release of libuv.

Then this can be implemented.

Fair enough! Crazy to me that this still isn't possible yet.

We know what we need to make it happen, we just can't it yet. Unfortunately. It's definitely something I'd like to have

@whitlockjc How is progress? The PR on libuv looks complete, but failing tests.

I wonder if this is something Async iterators could be used for on top of a promise based api

/cc @jasnell @mcollina

@MylesBorins I think so, but we need libuv support first. Without the ability to receive all the dir content in a chunked fashion, we can't do much here.

After https://github.com/nodejs/node/pull/17755 lands, we have async iterators support here for free if we expose it as a stream.

Specifically, without the libuv support, there's absolutely zero performance advantage to this at all.

For reference, here is the libuv issue

https://github.com/libuv/libuv/pull/416

Running into a situation atm where this would be nice...

Indeed. I think it's time to revisit.

IMHO This could be remedied by introducing a new dir/glob function to node_file:

fs.dir(pattern [, options])

It only needs to implement the simplest wildcards ? and *, using a C function like nftw()

I think it would be acceptable for this function to return a promise of an array, since the programmer can control the memory usages by choosing the appropriate pattern.

The advantage of this approach is that it preserves backwards compatibility, while enabling use in large scale operations, which can easily require handling of 10^6+ files, without gobbling up memory.

Hey @paragi

It would be nice to have that fs.dir() function.
Notice this question - it might be that nftw is not the best starting point.

Anyway I don't think it really removes the need for directory streaming.
I.e. when walking entire filesystem tree you might stumble a large folder, or when the filtering is not based on name patterns i.e. by file type (dir/file/other).
It is unfortunate that the PR https://github.com/libuv/libuv/pull/416 got stuck for a while now.

@guymguym You are rigth. Boost.Filesystem library is probably the right choice for c++

The advantage of a fs.dir() over a stream version of fs.readdir() would be that the matching is done at the lowest possible level.
IMO its two different use cases. But there is of cause no good reason not to mix them :)

This looks promising! Looking forward to test :1st_place_medal:

@paragi i have created a nice wrapper around the nativ Operating System Find Command that works on mac, linux, windows, and is returning a Real Observable Stream this way till lib uv is ready

  • [X] libuv/libuv#2057 has to land in libuv
  • [ ] libuv has to release v1.x.
  • [ ] Node.js has to start using the v1.x release of libuv.

Then this can be implemented.

_made changes to the tasks_

has landed, and to the V1.X branch so it just has to be released and Node.js needs to update to use it. There are some caveats mentioned in the PR that should be read before trying to figure how the API should look in node.

Should this API also support returning Dirent values?

Should this API also support returning Dirent values?
To me, filename and a boolean is_directory would suffice.

You might consider implement it as an added directory separator, at the end of a directory name, so that it just returns a string pr. entry.

Tank you guys for effort!

has landed, and to the V1.X branch so it just has to be released and Node.js needs to update to use it. There are some caveats mentioned in the PR that should be read before trying to figure how the API should look in node.

That's not entirely necessary, updating libuv to a non-release is pretty easy and it's what I did.

I'm working on this for now, and will hand off to Colin if I get stuck. The whole uv_dir_t thing is a bit of an internals paradigm change so there's some extra work to do under the hood making a new handle class (DirHandle::) and whatnot.

Should this API also support returning Dirent values?

My plan is to have it _only_ return Dirent values. Hooray future.

Fwiw I'm presently focused on making this work directly as just an iterator (both sync and async). I'll see how it pans out.

I'm having a hard time imagining anyone would _actually want_ a _stream_ of directory entries...

i think stream was only suggested because async iterables weren't really a thing back in 2015.

i think stream was only suggested because async iterables weren't really a thing back in 2015.

I agree.

I can imagine 3 use cases, where you want to iterate over:

  1. a list of all entries
  2. a list of some selected files. (glob functionality)
  3. (rare) a list of files selected on very complex criteria.

I don't think you need to support the 3th
You need to be able to do it on a vary large dataset. (10 mill+ entries)
Preferably doing wildcard selection on library level.
I really don't see the need to return anything other than the filename (Dirs appended with a dir separator)

We're getting there — here's a taste: https://twitter.com/Fishrock123/status/1159337458091167745

There are very valid use cases for streaming a list of directory entries. We have one where we need to check for presence of millions of directories to audit data for instance (you don't want to know the file count). We check the actual files seperately.
And its easy to say at face value that we can implement it differently but at the code & data level, iterating files is just not an option.

@rijnhard That's gona add a whole bunch of overhead...

But to be clear, you want to: have an object stream of directory entries, transform those into file reads, in the same stream?

That doesn't really make sense to me. You end up muliplexing/splitting the stream anyways and that leads to a very similar can of worms as just "iterating" and "awaiting" until a stream is finished to pull the next entry.

Maybe you could elaborate more? I am presently avoiding making this a stream due to complexity and lack of solid use cases.

@Fishrock123 I think I misunderstood, I just need to iterate through directories and not their contents.

While you are working with it, would it be a lot of trouble to add a function with glob or just '?' and '*' wildcard functionality?
I presume performance and memory usages would vastly improve, working on very large filesystems, if you do the scanning in libuv.
It would be much appriciatet :1st_place_medal:

@paragi This API will do no filtering and will not return entries from subdirectories. That will be up to a module to do.

Ok folks, the pull request is up: https://github.com/nodejs/node/pull/29349

const fs = require('fs');

async function print(path) {
  const dir = await fs.promises.opendir(path);
  for await (const dirent of dir) {
    console.log(dirent.name);
  }
}
print('./').catch(console.error);

Landed in cbd8d715b2286e5726e6988921f5c870cbf74127 as fs{Promises}.opendir(), which returns an fs.Dir, which exposes an async iterator. 🎉

This has been released in 12.12.0

What about also making directories sync iterable (as initially suggested)?

I think this could be done by using dir.readSync().

The commit message could be "fs: make directories sync iterable".

@ma11hew28 sorry to tell you that Sync can't be iterable as its Sync :) a iterable is a async type

There are sync iterators too. It's entirely possible to make a sync version.

@Qard why should i use a iterator for a Sync call that will return after all is in mermory already but ok your right it could exist it can be done. i for my self would suggest for...of as iterate method but ok

Thank you, @frank-dspeed and @Qard, for responding. I'm sorry for not responding promptly. @frank-dspeed, I'm sorry, but I think you misunderstood me. What you suggested is what I meant, as it is the first part of what @jonathanong initially suggested. Ie, if we make directories sync iterable, then you could do something like this:

const fs = require('fs')

const dir = fs.opendirSync('.')
for (const dirent of dir) {
  console.log(dirent.name)
}

instead of something like this:

const fs = require('fs')

const dir = fs.opendirSync('.')
let dirent
while ((dirent = dir.readSync()) !== null) {
  console.log(dirent.name)
}
dir.closeSync()

As for the second sentence of my initial comment on this issue, by "this", I meant "making directories sync iterable". Ie, I think a directory's default sync iterator's next() method could call dir.readSync() to get the directory's next entry.

@frank-dspeed It _doesn't_ have to all be loaded into memory with a sync iterator. If you have a directory with millions of entries in it, a sync iterator could read and return entries one at a time, or in batches, but synchronously.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

addaleax picture addaleax  Â·  3Comments

srl295 picture srl295  Â·  3Comments

dfahlander picture dfahlander  Â·  3Comments

willnwhite picture willnwhite  Â·  3Comments

danialkhansari picture danialkhansari  Â·  3Comments