Go: path/filepath: add WalkDir (Walk using DirEntry)

Created on 16 Oct 2020 · 20Comments · Source: golang/go

There are a few annoyances with filepath.Walk, but the biggest problem is that it is needlessly inefficient. The new ReadDir API (#41467) provides a way to avoid the inefficiency in the implementation, but that must be paired with a new API that does not offer a FileInfo to the callback, since obtaining the FileInfo is the expensive part.

41974 proposes a new API with an iterator object. That may or may not be a good idea.

If that one doesn't work out, here's a smaller change: add WalkDir that replaces FileInfo with DirEntry but otherwise behaves exactly the same as Walk:

type WalkDirFunc func(path string, entry fs.DirEntry, err error) error
    WalkDirFunc is the type of the function called for each file or directory
    visited by WalkDir. The path argument contains the argument to WalkDir as a
    prefix; that is, if WalkDir is called with "dir", which is a directory
    containing the file "a", the walk function will be called with argument
    "dir/a". The info argument is the fs.DirEntry for the named path.

    If there was a problem walking to the file or directory named by path, the
    incoming error will describe the problem and the function can decide how to
    handle that error (and WalkDir will not descend into that directory). In the
    case of an error, the info argument will be nil. If an error is returned,
    processing stops. The sole exception is when the function returns the
    special value SkipDir. If the function returns SkipDir when invoked on a
    directory, WalkDir skips the directory's contents entirely. If the function
    returns SkipDir when invoked on a non-directory file, WalkDir skips the
    remaining files in the containing directory.

func WalkDir(root string, fn WalkDirFunc) error
    WalkDir walks the file tree rooted at root, calling fn for each file or
    directory in the tree, including root. All errors that arise visiting files
    and directories are filtered by fn. The files are walked in lexical
    order, which makes the output deterministic but means that for very large
    directories WalkDir can be inefficient. WalkDir does not follow symbolic links.

The only changes here are s/Walk/WalkDir/g and s/FileInfo/DirEntry/g.

Proposal Proposal-Accepted Proposal-FinalCommentPeriod

Source

rsc

👍4

Most helpful comment

To me, the ability to easily convert callers using filepath.Walk to use io/fs and the DirEntry API is compelling. Any cleaner / simpler / different-style API is not likely to address that use-case, so it seems to me that we will want some API very close to filepath.Walk regardless of whether we _also_ decide to add an iterator-based API.

WalkDir addresses the known use-case of “something close enough to filepath.Walk to make porting easy”, so I think we should go ahead and add it. Then we can decide the separate use-case of “something simpler or less error-prone than filepath.Walk” without the time pressure of holding up the migration of existing code to io/fs.

bcmills on 23 Oct 2020

👍3

All 20 comments

Change https://golang.org/cl/243916 mentions this issue: io/fs: add Walk

gopherbot on 18 Oct 2020

I've retracted #41974. Let's focus on this (WalkDir) as the replacement for Walk.

What I like most about this API is that updating existing code requires almost no effort at all: change Walk to WalkDir, change FileInfo to DirEntry, maybe rename info to d, and in 90% of cases you're done and have a more efficient file system traversal that does the same thing as the original code.

rsc on 21 Oct 2020

👍1

There is one detail that still bothers me that might be worth changing.

Suppose you want to do a walk but ignore all testdata directories. In the WalkFunc you write:

if filepath.Base(path) == "testdata" {
    return filepath.SkipDir
}

That works fine. But behind the scenes, filepath.Walk already did a full directory read from testdata _before_ calling the WalkFunc. Because the dir ended up being skipped, that ReadDir was entirely wasted effort. Often one reason to skip a directory is that it's big (like a cache). Doing a ReadDir on a big directory that you are going to skip is unfortunate. (It's nice that it's a ReadDir and not a Readdir, so you didn't spend tons of time calling Stat on every entry on Unix systems, but still, it's wasted effort.)

One of the complications of #41974 was defining that both entry and exit from a directory appeared in the iteration; the equivalent here would be calling the WalkFunc twice for a directory: both before and after. We clearly do not want to do that.

But I wonder if instead we should define that a directory read error (only) can result in a second call to the WalkFunc with the same path, to report the error. That is, to walk a directory, WalkDir does basically:

err := walkFn(path, d, nil)
if err != nil && err != SkipDir {
    return err // usual early out
}
if err != SkipDir {
    all, err := ReadDir(path)
    for _, child := range all {
        ... recursively handle path+/+child.Name() ...
    }
    if err != nil {
        walkFn(path, d, err)
    }
}

In addition to avoiding an expensive ReadDir that is not needed, this has the benefit of presenting the early children of a directory even if the directory read fails later in the directory. The current filepath.Walk throws away any children that were found when a read error also occurs. That's clearly a mistake, which would be good to fix in a new API but may be too subtle to fix in the existing API. And then the error is reported after the children that are available.

The one downside of course is that the WalkFunc is called twice for a directory with a read error: once for the existence of the directory itself, which is error-free, and then again when the directory read fails. This seems like a clearer separation of concerns, but at the cost of two calls with the same path.

Over on https://github.com/golang/go/issues/41974#issuecomment-708597849, @ianlancetaylor wrote:

As far as I can tell the Exiting method exists only to clearly report an error from ReadDir on a directory. Since errors are rare, I agree that this seems like an undesirable API complication. Personally I think it would be OK for the error case to return the directory a second name--same Name, same IsDir--but with a non-nil Err.

The "extra callback only for directory read error" I'm suggesting here is the equivalent to what Ian suggested, but for the callback API. It seems like a reasonable solution to me.

What do other people think about doing this?

rsc on 21 Oct 2020

I expect the vast majority of WalkFunc implementations are going to do something along the lines of:

    if err != nil {
        log.Print(err)
        return nil
    }

or equivalent, or

    if err != nil {
        return err
    }

right at the beginning.

For the latter functions, the second call for a ReadDir error seems fine: it potentially does a little more work up-front for directories that are probably just going to waste, but errors are unusual so that's no big deal.

For the former functions — the ones that just log.Print and carry on — it's also no big deal. They'll produce a little more output, but they seem intended to produce best-effort output anyway.

The only WalkFuncs that will have a problem are the ones that don't return early in case of a non-nil error _and_ are also not idempotent for a given input. But I suspect that those are rare compared to functions that return and functions that are idempotent (such as those that aggregate into a map).

bcmills on 21 Oct 2020

👍2

Using a second WalkDirFunc call to report errors appears to be the best option for this function signature. That allows callers to process and consider descending some paths, and receive errors separately when there is an Open/ReadDir failure.

Adding filepath.WalkDir as a similar but faster API would be useful. That would make it relatively easy to upgrade code in the wild - I would use it now.

Just to clarify, is the intent to add a potentially different fs.FS walk API to the standard library later (eg, go1.17+)? I would prefer a better walk API for io/fs (or path/fspath?). This would benefit from broader experimention with fs.FS walk APIs before committing.

mpx on 22 Oct 2020

@mpx, I think io/fs needs a walk API at the start, and if filepath.WalkDir exists, then fs.WalkDir should too.
Just as the original io/fs proposal adopted filepath.Walk, if we create filepath.WalkDir, then the io/fs work will adopt WalkDir instead.

That doesn't preclude adding another one later, but it does make it unlikely without a really compelling case.
Most importantly, the combination of ReadDir and io/fs mean that some other general walker can be implemented without being required to have OS-specific code in it for efficiency, so having more sophisticated APIs in third-party packages is not as big a problem as it is today.

rsc on 22 Oct 2020

👍2

@rsc

That doesn't preclude adding another one later, but it does make it unlikely without a really compelling case.
Most importantly, the combination of ReadDir and io/fs mean that some other general walker can be implemented without being required to have OS-specific code in it for efficiency, so having more sophisticated APIs in third-party packages is not as big a problem as it is today.

That seems like an equally compelling argument for not including any official API (or starting it out in golang.org/x/ until it's worth is proven)

jimmyfrasche on 22 Oct 2020

👍1

(I have no beef in this game, but...)
Maybe try a small experiment to see what the gains are, by implementing something like
func collectFileList(roots []string, includePatterns []string, excludePatterns []string) []string
?

(One of just a couple cases I have ever used Walk API was to traverse, show to user, and later compress list of log files. The collect function did have params to include and exclude patterns, which could be applied to all subitems under "roots". (if include was nil or empty, it meant "everything", unless it is not in exclude). This wasn't a big exercise to do, but maybe a useful one here to see what you can gain with a new traversal api.)

tandr on 23 Oct 2020

I'm not convinced io/fs needs to have a walk implementation on day one - especially given how little time there is until the freeze to be confident of finding a good/better API. Anyone who needs to walk fs.FS will be able to implement something for go1.16 - I'm sure a few options will appear fairly quickly. Perhaps even added to golang.org/x?. Any inconvenience from the delay will be soon forgotten after a more considered API is added in go1.17.

The opportunity to explore and provide a better API should be balanced against the benefit of having a standard library implementation immediately. If the later wins, the current proposal seems relatively safe since it's close to the existing approach.

mpx on 23 Oct 2020

bcmills on 23 Oct 2020

👍3

@mpx, I hear you, but as I noted above, I disagree. io/fs should be as capable as the existing library routines.
If we have filepath.WalkDir, we can have fs.WalkDir too.

rsc on 28 Oct 2020

Based on the discussion above, this (including the extra callback for reporting directory read errors) seems like a likely accept.

rsc on 28 Oct 2020

I agree w/@bcmills argument for supporting easy migration of existing code - this does seem like the best approach for now. Better APIs can still be provided elsewhere, and potentially considered for the standard library if there is enough benefit.

mpx on 29 Oct 2020

Change https://golang.org/cl/266240 mentions this issue: path/filepath: add WalkDir

gopherbot on 29 Oct 2020

// The files are walked in lexical order, which makes the output deterministic but means that for very
// large directories Walk can be inefficient.

I hava a question here may not related to this proposal. why we need this deterministic ? it is unfortunate for a large directory.
Seems Python os.scandir and GNU find does not sort the dir. Can we just relay on the order of ReadDir?

qingyunha on 30 Oct 2020

A lot of existing code benefits from assuming paths are walking in lexigraphical order. The proposed (sorted) API will continue to simplify new usage, and support easier migration of old code.

Ideally, it would be good to support some different tradeoffs. Eg, conserving memory vs file descriptors, performance vs ease of use,... However, this isn't practical with the proposed API since it would need to be parameterised.

It will be easy enough to create a different implementation/API supporting a different set of tradeoff when it matters. Eg, some code will see significant performance improvements from keeping file descriptors open and processing in DirEntry order. Maybe one of these APIs might be clean enough and desirable enough to add to the standard library in future.

mpx on 30 Oct 2020

No change in consensus, so accepted.

rsc on 4 Nov 2020

Change https://golang.org/cl/267719 mentions this issue: all: update to use filepath.WalkDir instead of filepath.Walk

gopherbot on 5 Nov 2020

Reopening because CL 266240 was reverted in CL 267798; it needs to be re-sent.