Fd: deterministic order - by default or by option flag

Created on 31 Oct 2017  路  18Comments  路  Source: sharkdp/fd

I'm pretty sure I get a different ordering of results on occasion when I use the app in the regular manner versus when I pipe it to some other app. (I had a concrete example of this but lost it and now can't find.) This would be the case perhaps if you collected results in parallel but did not force a deterministic output except when gathering it to deliver to a pipe (or vice versa).

If this is the case, it's bad for me since sometimes I pipe the results to other commands, for example, | line 2 to extract the second result, but now the desired item is in a different position.

Please add an option to lock it to deterministic output or enable determinism by default.

If I'm mistaken, please close.

question

All 18 comments

Thank you for the feedback.

I'm pretty sure I get a different ordering of results on occasion when I use the app in the regular manner versus when I pipe it to some other app.

That is entirely possible. The current situation is the following:

  1. fd traverses directories in parallel (for performance reasons).
  2. Due to parallelism, the output order is (in principle, see next point) non-deterministic.
  3. To compensate for (2), fd internally buffers results for a short amount of time T. If the search finishes in a time smaller than T, the buffered results are sorted before printing them to the console. If the search has not finished when T is reached, the unsorted buffer is flushed to the console and subsequent results are printed in the order they are discovered. The buffer time T can be configured with the (hidden) --max-buffer-time option. The default value for T is 100 ms (I chose this value such that a human would not notice the short lag).

Without any changes, there are two options right now:

  • You can disable parallelism by setting the number of threads to 1 via --threads 1 or -j1. In this case, the filesystem traversal is done in a depth-first fashion by a single directory-tree walker. Note that you are still left with potential non-determinism that might be coming from the OS/filesystem.
  • You can set the buffer time to a very large time-span (admittedly, a hacky solution). This has the negative side effect that search results will only become available when the search has finished.

Also, would it be an option to simply pipe to sort first?

Consider this:

If you are spawning DFSearches s1, s2, ..., sn, then you can report the results of s1...sj so long as all searches strictly below j have finished returning. For thej-th item you can report intermediate progress as it happens. (This assumes your threads partition the space in some kind of non-interlocking, ordered way.)

--threads 1 well that's no fun
| sort will always block on full result set (for obvious reasons)
--max-buffer-time useful but yes v. hacky

Thank you for your suggestion.

If you are spawning DFSearches s1, s2, ..., sn, then you can report the results of s1...sj so long as all searches strictly below j have finished returning.

Correct me if I'm mistaken, but I don't think that this will help a lot.

Let's say it takes a time T to walk the whole search space sequentially. If we parallelize the search across n threads, each of these small searches will (on average) take the same time t. Ideally, we would have t = T/n, but in practice we will have t > T/n of course. Since we will not be able to divide the search space evenly, some of these searches will be faster than t and some will be slower than t.

So the output sequence will look like this:

  • For times t < t_1, we can only show results from thread s1 -- here, t_1 is the time for s1 to finish (approximately equal to t).
  • By now, most of the other searches will also have finished and we will pretty soon be able to print out the rest of the results.

This might be a viable solution for searches that have a huge number of results. However, if there are only a few results (on the order of the number of threads or smaller), we will typically have to wait for a time t_1 until we see the first result.

The latter point is why I would not like to make this the default behavior. I'm open to the idea of implementing this as an option flag, but I'm afraid we won't gain too much over fd .. | sort (we would be able to see 1/n-th of the search results earlier).

A solution could be dividing the whole work into M chunks (could be more than N) with each chunk to be sufficiently small, and these chunks are assigned to the N threads in a sequential manner. Although looking difficult, there could be an adaptive algorithm to do this.

What about the following algorithm?

  1. Suppose the whole directory is small enough, and divide it into N chunks. Schedule each chunk to be processed by a thread.
  2. When 100ms has passed, if all threads finished, we are done. Otherwise, we choose the first unfinished thread. We pause the other unfinished threads, and recursively apply the same algorithm to the remaining work of that first unfinished thread.
  3. When the remaining work of the first unfinished thread is done, we then process the second, third, etc.

Since we are not doing the parallelism ourselves but are using the Ignore crate to handle it our hands may be relatively tied. The walker will create n worker threads and each worker will pull the directory that it will work on from a queue. The order of directory traversal is non-deterministic like @sharkdp said, and the only way it looks like we can fix that is by processing the data somehow before printing it.

I understand that it may be difficult to implement, but this could be a useful idea for future reference.

but I'm afraid we won't gain too much over fd .. | sort (we would be able to see 1/n-th of the search results earlier).

Even | sort needs to get the last input to decide the first line to print. So what a parallel algorithm can optimize is to do more (meaningful) sorting while waiting for more input.

Further, the directory traversal order does help to determine what can be printed before the rest are sorted. | sort cannot acquire this information.

@sharkdp I went back and reread the code for the Ignore package and that buffer I spoke about in my last comment is filled with paths that come from either the initial constructor of Walker or from the .add method. That means that when more than one thread is specified but only one path is provided then only one thread will actually search that path and the rest are unused. That means the buffer in the output thread should be filled in a deterministic way assuming we in fact use it like a fifo queue.

See source code for run:

for path in self.paths {
    let dent =
        if path == Path::new("-") {
            DirEntry::new_stdin()
        } else {
            match DirEntryRaw::from_link(0, path) {
                Ok(dent) => DirEntry::new_raw(dent, None),
                Err(err) => {
                    if f(Err(err)).is_quit() {
                        return;
                    }
                    continue;
                }
            }
        };
    queue.push(Message::Work(Work {
        dent: dent,
        ignore: self.ig_root.clone(),
    }));
    any_work = true;
}

where paths is a vector with each element coming from either the initial walker constructor or from the .add method.

Each thread corresponds to one worker

for _ in 0..threads {
    let worker = Worker {
        f: mkf(),
        queue: queue.clone(),
        quit_now: quit_now.clone(),
        is_waiting: false,
        is_quitting: false,
        num_waiting: num_waiting.clone(),
        num_quitting: num_quitting.clone(),
        threads: threads,
        parents: self.parents,
        max_depth: self.max_depth,
        max_filesize: self.max_filesize,
        follow_links: self.follow_links,
    };
    handles.push(thread::spawn(|| worker.run()));
}

And each worker pops one item off the queue to work on. Since only one item should be in the queue in the current implementation then only one thread ever does any work no matter how many threads we specify.

Since only one item should be in the queue in the current implementation then only one thread ever does any work no matter how many threads we specify.

No. Notice that workers will constantly push to the shared queue while traversing their current directory:

https://github.com/BurntSushi/ripgrep/blob/d73a75d6cd82068252c35c5718900b6a1acb296e/ignore/src/walk.rs#L1179-L1182

You're right the workers push directories back to the queue but when only one directory is given there can only be a maximum of one item on the queue at a time.

  1. Once the first worker thread pop's the directory off the queue all other threads wait for work
  2. The code section you linked in run_one pushes at most one item back onto the queue
  3. One item on the queue get's pop'd off the queue by some other thread, queue size is zero again
  4. continue until no more work can be done.

At most one thread is ever doing anything but multiple threads will be spawned waiting for work.

My hunch is that with the right data and processor the computation time teeters on the max buffer time and when output is piped, making run time a bit shorter, the data is sorted and when printed to the screen then it takes a bit longer and output is not sorted.

@sharkdp, Okay I concede threads don't just push the current directory they are working on onto the queue they push the contents of the directory so if a thread has a directory with n items in it n items will be pushed to the queue for other workers to work on.

Disregard my previous comments

@Doxterpepper Sorry for the late reply

they push the contents of the directory so if a thread has a directory with n items in it n items will be pushed to the queue for other workers to work on.

Yes, that's what I also thought :+1:. Should have been more clear.

I'm going to close this (for now), as:

  • There is no clear path on how to implement this without a substantial decrease in performance (concerning the availability of the first results, not the overall runtime)
  • There are several workarounds that work just fine.

If somebody feels that I'm mistaking, please tell me and we can reopen this ticket.

This has now been implemented with the new -l/--list-details option of fd which uses ls -l in the background to get a deterministic sort order.

This has now been released in fd v8.0.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kclevenger picture kclevenger  路  3Comments

mrzool picture mrzool  路  4Comments

mathomp4 picture mathomp4  路  3Comments

codyro picture codyro  路  4Comments

christianbundy picture christianbundy  路  3Comments