Ripgrep: RFC: add ngram indexing support to ripgrep

Created on 22 Feb 2020 · 30Comments · Source: BurntSushi/ripgrep

An alternative name for this issue is "ripgrep at scale."

I've been thinking about some kind of indexing feature in ripgrep for a long
time. In particular, I created #95 with the hope of adding fulltext support to
ripgrep. But every time I think about it, I'm stymied by the following hurdles:

The work required to make code oriented relevance search work well.
The difficulty of correctly synchronizing the state of the index with the
state of the file system.

I think what I'm coming to realize is that we could just declare that we will
do _neither_ of the above two things. My claim is that we will still end up
with something quite useful!

So I'd like to describe my vision and I would love to get feedback from folks.
I'm going to start by talking about the feature at a very high level and end
with a description of the new flags that would be added to ripgrep. I have
not actually built this yet, so none of this is informed by actually using
the tool and figuring out what works. I have built
indexing tools
before, and my day job revolves around information retrieval, so I do have some
relevant background that I'm starting with.

While the below is meant more to describe the UX of the feature, I have added a
few implementation details in places mostly for clarity purposes.

I'd very much appreciate any feedback folks have. I'm especially interested in
whether the overall flow described below makes sense and feels natural (sans
the fact that this initially won't come with anything that keeps the index in
sync). Are there better ways of exposing an indexing feature?

I'd also like to hear feedback if something below doesn't make sense or if a question about how things work isn't answer. In particular, please treat the section that describes flags as if it were docs in a man page. Is what I wrote sufficient? Or does it omit important details?

Search is exhaustive with no ranking

I think my hubris is exposed by ever thinking that I could do code aware
relevance ranking on my own. It's likely something that requires a paid team of
engineers to develop, at least initially. Indeed,
GitHub is supposedly working on this.

The key to simplifying this is to just declare that searching with an index
will do an exhaustive search. There should be no precision/recall trade off and
no ranking. This also very nicely clarifies the target audience for ripgrep
indexing: it is specifically for folks that want searches to go faster and are
willing to put up with creating an index. Typically, this corresponds to
frequently searching a corpus that does not fit into memory. If your corpus
fits into memory, then ripgrep's existing search is _probably_ fast enough.

Use cases like semantic search, ranking results or filtering out noise are
specifically not supported by this simplification. While I think these things
are very useful, I am now fully convinced that they don't belong in ripgrep.
They really require some other kind of tool.

ripgrep will not synchronize your index for you

I personally find the task of getting this correct to be rather daunting. While
I think this could potentially lead to a much better user experience, I think
an initial version of indexing support in ripgrep should avoid trying to do
this. In part to make it easier to ship this feature, and in part so that we
have time to collect feedback about usage patterns.

I do however think that the index should very explicitly support fast
incremental updates that are hard to get wrong. For example:

ripgrep should avoid creating two different index entries for the same file.
That it, ripgrep should not allow duplicates in its index.
By default, ripgrep should not re-index files if their last modified date
precedes their indexed date. That is, when you re-index a directory, the only
cost you should pay is the overhead of traversing that directory and the cost
of checking for and indexing files that have been changed since the last time
they were indexed.
Re-indexing a single additional file should have a _similar_ performance
overhead as re-indexing many additional files at once. While it's likely
impossible to make the performance idential between these, the big idea here
is that the index process itself should use a type of batching that makes
re-indexing files easier for the user.

If I manage to satisfy these things, then I think it would be fairly
straight-forward to build a quick-n-dirty index synchronizer on top of
something like the notify crate. So, if
it's easy, why not just put this synchronization logic into ripgrep itself?
Because I have a suspicion that I'm wrong, and moreover, I'm not sure if I want
to get into the business of maintaining a daemon.

Indexes are only indexes, they do not contain file contents

An index generated by ripgrep only contains file paths, potentially file meta
data and an inverted index identifying which ngrams occur in each file. In
order for ripgrep to actually execute a search, it will still need to read each
file.

An index does not store any offset information, so the only optimizations that
an index provides over a normal search are the following:

ripgrep does not need to traverse any file hierarchy. It can read all of the
file paths to search from the index.
As mentioned above, ripgrep does not need to read and search the contents of
every file. It only needs to read files that the index believes _may_
contain a match. An index may report a false positive (but _never_ a false
negative), so ripgrep will still need to confirm the match.

Moreover, in terms of incrementally updating the index for a particular
directory tree, ripgrep should only need to read the contents of files that
either weren't indexed previously, or are reported by the file system to have
been modified since the last time it was indexed. (ripgrep will have an option
to forcefully re-index everything, in case the last modified time is for some
reason unreliable.)

Also, as mentioned above, in addition to the file path, ripgrep will associate
some meta data with each file. This will include the time at which the file was
indexed, but might also include other metadata (like a file type, other
timestamps, etc.) that might be useful for post-hoc sorting/filtering.

Indexes will probably not be durable

What I mean by this is that an index should never contain critical data. It
should always be able to be re-generated from the source data. This is not to
say that ripgrep will be reckless about index corruption, but its committment
to durability will almost certainly not rise to the level of an embedded
database. That is, it will likely be possible to cut the power (or network) at
an inopportune moment that will result in a corrupt index.

It's not clear to me whether it's feasible or worth it to detect index
corruption. Needing to, for example, checksum the contents of indexed files on
disk would very likely eat deeply into the performance budget at search time.
There are certainly some nominal checks we can employ that are cheap but will
not rise to the level of robustness that one gets from a checksum.

This is primarily because ripgrep will not provide a daemon that can amortize
this cost. Instead, ripgrep must be able to quickly read the index off disk and
start searching immediately. However, it may be worthwhile to provide the
_option_ to check the integrity of the index.

The primary implication from an end user level here isn't great, but hopefully
it's rare: it will be possible for ripgrep to panic as part of reading an index
in a way that is not a bug in ripgrep. (It is certainly possible to treat
reading the index as a fallible operation and return a proper error instead
when a corrupt index is found, but the fst crate currently does not do this
and making it do it would be a significant hurdle to overcome.)

In order to prevent routine corruption, ripgrep will adopt Lucene's "segment
indexing" strategy. A more in depth explanation of how this works can be found
elsehwere, but effectively, it represents a partioning of the index. Once a
segment is written to disk, it is never modified. In order to search the index,
one must consult all of the active segments. Over time, segments can be merged.
(Segmenting is not just done to prevent corruption, but also for decreasing
the latency of incremental index updates.)

Additionally, ripgrep will make use of advisory file locking to synchronize
concurrent operations.

Indexs may not have a stable format at first

I expect this feature will initially be released in an "experimental" state.
That is, one should expect newer versions of ripgrep to potentially change the
index format. ripgrep will present an error when this happens, where the remedy
will be to re-index your files.

New flags

I don't think this list of flags is exhaustive, but I do think these cover
the main user interactions with an index. I do anticipate other flags being
added, but I think those will be more along the lines of config knobs that
tweak how indexing works.

I've tried to write these as if they were the docs that will end up in
ripgrep's man page. For now, I've prefixed the flags with --x- since I
suspect there will be a lot of them, and this will help separate them from the
rest of ripgrep's flags.

`-X/--index`

Enables index search in ripgrep. Without this flag set, ripgrep will never
use an index to search.

ripgrep will find an index to search using the following procedure. Once a
step has found at least one index, subsequent steps are skipped. If an index
could not be found, then ripgrep behaves as if -X was not given.

If any indexes are provided on the command line as directory paths to the
index itself, then all of them are searched, in sequence.
If RIPGREP_INDEX_PATH is set to a valid index, then it is searched.
If there exists a .ripgrep directory in the current working directory and
it contains a valid index, then it is searched.
Step (3) is repeated, but for each parent directory of the current working
directory.

If -X is given twice, then ripgrep will stop searching if an index is present
and the query makes it unable to use the index.

`--x-crud`

Indexes one or more directories or files. If a file has already been indexed,
then it is re-indexed. By default, a file is only re-indexed if it's last
modified time is more recent than the time it was last indexed. To force
re-indexing, use the --x-force flag. If a file that was previously indexed is
no longer accessible, then it is removed from the index.

The files that are indexed are determined by ripgrep's normal filtering
options.

The location of the index itself is determined by the following procedure. Once
a step has found an index, subsequent steps are skipped.

If --x-index-path is set, then the index is written to that path.
If RIPGREP_INDEX_PATH is set, then its value is used as the path.
If the current working directory contains a .ripgrep path and is a valid
existing index, then it is updated.
Step (3) is repeated, but for each parent directory of the current working
directory.
If an index could not otherwise be determined, then it is written to the
current working directory at .ripgrep. If .ripgrep already exists and is
not a valid index, then ripgrep returns an error.

If another ripgrep process is writing to the same index identified via the
above process, then this process will wait until the other is finished.

As an example, the following will create an index of all files (recursively)
in the current directory that pass ripgrep's normal smart filtering:

$ rg --x-crud

And this will search that index:

$ rg -X foobar

Note that the index created previously can now be updated incrementally. For
example, if you know that foo/bar/quux has changed, then you can run:

$ rg --x-crud foo/bar/quux

To tell ripgrep to re-index just that file (or directory). This works even if
foo/bar/quux has been deleted, where ripgrep will remove that entry from its
index.

Prior work

I believe there are three popularish tools that accomplish something similar to
what I'm trying to achieve here. That is, indexed search where the input is a
regex.

Russ Cox's codesearch, which is
famounsly described in his
"Regular Expression Matching with a Trigram Index"
article. Other tools, like
Hound
are based on codesearch.
qgrep, which is described in more detail
here.
livegrep, which also provides a very
nice front-end for searching. It is described in more detail
here.

(There are other similarish tools, like
Recoll
and
zoekt,
but these appear to be information retrieval systems, and thus not exhaustive
searching like what I've proposed above.)

ripgrep's index system will most closely resemble Russ Cox's codesearch. That
is, at its core, it is an inverted index with ngrams as its terms. Each ngram
will point to a postings list, which lists all of the files that contain that
trigram. The main downside of codesearch is that it's closer to a proof of
concept of an idea rather than a productionized thing that will scale. Its most
compelling downside is it performance with respect to incremental updates.

qgrep and livegrep both represent completely different ways to tackle this
problem. qgrep actually keeps a compressed copy of the data from every file
in its index. This makes its index quite a bit larger than codesearch, but
can do well with spinning rust hard drives due to sequential reading. qgrep
does support incremental updates, but will eventually require the entire index
to be rebuilt after too many of them. My plan with ripgrep is to make
incremental updates a core part of the design such that a complete re-index is
never _necessary_.

livegrep uses suffix arrays. I personally haven't played with this tool yet,
but mostly because it seems like incremental updates are slow or hard here.

enhancement question

Source

BurntSushi

👍36 ❤21 🚀3

Most helpful comment

Seems like a useful proposal with a sensible scope.

Sorry if this is bike shedding minor details, but with my, "what if this was a man page hat on", I don’t think --x-crud is a good flag name, especially with this definition of crud in mind. I would suggest --update-index as a possible alternative.

wezm on 22 Feb 2020

👍6

All 30 comments

Seems like a useful proposal with a sensible scope.

wezm on 22 Feb 2020

👍6

I currently solve this with a combination of a slightly changed csearch that just returns a list of file candidates which I then feed to ripgrep.

This RFC describes exactly what I need with a very precise scope. I'd love to see this implemented!

aslushnikov on 23 Feb 2020

❤3

Cool!
Would this work with the --search-zip flag? Typically if I'm searching uncompressed on-disk data rg is already plenty fast.

The one thing that would really excite me is if this got integrated into cloud-scale search engines like Github's code search or AWS Cloudwatch Logs search. I've always wanted to use regex instead of word search for those things.

MikeFHay on 23 Feb 2020

Would this work with the --search-zip flag?

Good question. I certainly think it should. It should also work with the --pre flag. Those flags will need to be specified during indexing though. And then probably again during search.

The one thing that would really excite me is if this got integrated into cloud-scale search engines like Github's code search or AWS Cloudwatch Logs search. I've always wanted to use regex instead of word search for those things.

My intent is to design this such that it should be able to scale "arbitrarily," so it's certainly possible. The main problem is that at that scale, you would need to ban regexes that search too much of the corpus. It is very possible to implement this ban, but whether that's an acceptable user experience, I'm not sure. For example, the regex \w has to search every file. (ripgrep will be able to detect trivial cases like that and just skip using the index altogether. Similar to how a relational database's query planner might not use your index of it needs to visit most of the data in the table.)

BurntSushi on 23 Feb 2020

👍3

Nice. I've often thought about ngram-indexing in ack, but have never pursued it. I'm glad you did.

petdance on 16 Mar 2020

❤1

I would love to have this as I have been advocating for such indexing feature in the other issue as well.
Thanks for this RFC, I would love to take a look at the initial implementation and possibly consider to remaking esty/hound with ripgrep index :D

sluongng on 16 Mar 2020

It's great to hear that this is coming to ripgrep. I'm still using codesearch for a particularly large code base (around 8M loc). Notes about my experience with codesearch:

1) codesearch silently ignores indexing of large files, I think the limit is 20K ngrams. Only fix for this was to manually change and recompile codesearch. If there is such limit in ripgrep, it should be configurable and there should be a clear warning about it when indexing.

2) Because indexing can be quite slow, sometimes, after having some changes in the code that are not yet indexing, I prefer to use ripgrep instead of csearch to make sure that everything is included. I'm not sure if this is feasible, but it would be really nice if ripgrep could search using the index and then find files older than the index and search them directly. In other words, have the option of using index only for files older than the index, and use regular search for newer files.

ceronman on 16 Mar 2020

@MikeFHay Have you tried using sourcegraph.com instead for searching repos on GitHub?

zoidbergwill on 16 Mar 2020

That is, it will likely be possible to cut the power (or network) at
an inopportune moment that will result in a corrupt index.

For people who index entire directory trees at once, rather than individual files, I'm hoping that indexes can get updated by atomic rename, which would prevent the majority of index corruption.

Also, I'm hoping that the index logic is provided as a separate crate that doesn't inherently depend on integration with path searching, so that it would be possible (for instance) to index the history of a git repository.

joshtriplett on 16 Mar 2020

Enables index search in ripgrep. Without this flag set, ripgrep will never
use an index to search.

Never is a strong word, is there a reason you wouldn’t want this to be the default at one point (in the far future).

jasonwilliams on 16 Mar 2020

These are _amazing_ questions/feedback folks. Really appreciate. Keep more coming like this!

@ceronman

codesearch silently ignores indexing of large files, I think the limit is 20K ngrams. Only fix for this was to manually change and recompile codesearch. If there is such limit in ripgrep, it should be configurable and there should be a clear warning about it when indexing.

Yeah, I noticed this and I've been thinking on what to do about it. It is enticing to adopt a similar heuristic in ripgrep because a single large file could really screw the pooch. That is, a large file probably has a lot of different ngrams in it, so it has a high likelihood of appearing in candidate sets generated by the index. Which would then kind of defeat some of the large gains of using an index in the first place, since you wind up searching that large file---which might take as long as it takes to search thousands of tiny files.

The main reason why I'm initially not a fan of having such a limit is because it's an _additional_ filter on files on top of what ripgrep does normally, but it's a filter that only makes sense in the context of indexing. One possible alternative is to make the filter opt-in. After all, ripgrep already has a --max-filesize flag. I guess it depends on how likely it is for a large file to make index searches unnecessary slow for unsuspecting users.

But yes, if this does wind up being an opt-out feature specific to indexing, then ripgrep would emit a warning message for each large file skipped. And the limit would certainly be configurable.

Because indexing can be quite slow, sometimes, after having some changes in the code that are not yet indexing, I prefer to use ripgrep instead of csearch to make sure that everything is included. I'm not sure if this is feasible, but it would be really nice if ripgrep could search using the index and then find files older than the index and search them directly. In other words, have the option of using index only for files older than the index, and use regular search for newer files.

My hope is that ripgrep will make incremental indexing fast enough where you can just drop this particular behavior. :-)

I think if this were to exist, it would probably need to be an opt-in feature. In particular, one of the key benefits that an index will bring is the ability to avoid re-walking the directory tree, which can take non-trivial time. But I agree, some kind of flag to say "use the index but search newer files like normal" seems like a good idea.

@joshtriplett

For people who index entire directory trees at once, rather than individual files, I'm hoping that indexes can get updated by atomic rename, which would prevent the majority of index corruption.

Ah yeah interesting. I hadn't really thought of a "blow up the world and re-index" mode, but it probably makes sense to have something like that. I do hope that incremental indexing will negate the need for such things in most cases, but sure, "re-index the world" would indeed likely take the atomic rename strategy.

Also, I'm hoping that the index logic is provided as a separate crate

Yes, it will be contained in the grep-index crate.

that doesn't inherently depend on integration with path searching, so that it would be possible (for instance) to index the history of a git repository.

This may be trickier to avoid. I believe the implementation path in my head does indeed focus around having a path for each thing you want to search. I'll consider the git history use case though and see if something like it can be supported. Maybe the way to think about it is not "path searching" but rather "searching documents where each document has some unique ID that can be resolved to its content." Making that pluggable at the crate level certainly seems doable.

BurntSushi on 16 Mar 2020

Enables index search in ripgrep. Without this flag set, ripgrep will never use an index to search.

Never is a strong word, is there a reason you wouldn’t want this to be the default at one point (in the far future).

I guess it just seems like surprising UX to me. Searching with an index can lead to subtle state synchronization bugs where the index is lagging behind your actual corpus (or some other sync issue). So to me, it seems like it should be opt-in.

Also, I think you might be giving more weight to "never" in this context than I intended. It's describing the behavior of a specific release of ripgrep, and not necessarily my future intent.

BurntSushi on 16 Mar 2020

👍2

ripgrep's index system will most closely resemble Russ Cox's codesearch. That
is, at its core, it is an inverted index with ngrams as its terms.

You should consider using Bloom filters instead. Totally sidesteps the large file problem, has constant size which is a lot easier to manage, and uses less space. Or XOR filters, which are supposedly more compact: https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/

Edit: That is using a Bloom filter instead of the inverted index, i.e. still a bloom filter over the ngrams.

mpdn on 16 Mar 2020

@mpdn I know I dipped into implementation details, but I'd really like to keep this ticket focused on the higher level user story so that folks are more inclined to participate. If we bog it down into implementation details, then I fear the discussion may become too intimidating for most. So with that in mind, I created #1518 so that folks who want to talk details can. :-) I responded to your comment there.

BurntSushi on 16 Mar 2020

👍1

There is some overlap between the goals to add ngram support to ripgrep. THAT would be phenomenal! Work was started to achieve that in rust. There was a c/c++ implementation called BigGrep, a set of tools, but the one that does the indexing is called bgindex. At Geekweek in Nov. 2019, they started a rust re-implementation called rs-bgindex, but we only had a week there.
https://gitlab.com/geekweek-vi/4.3-geekweek/rs-bgindex
https://gitlab.com/geekweek-vi/4.3-geekweek/rs-cccslib
https://gitlab.com/geekweek-vi/4.3-geekweek/croaring-rs
Most of the infrastructure for the command-line as already been built, but the part that calls croaring-rs(roaring bitmaps capability) was never glued in. Maybe you could take a crack at it there.
https://gitlab.com/geekweek-vi/4.3-geekweek/rs-cccslib/-/blob/master/src/CCCSBgIndexConf/mod.rs#L462
https://gitlab.com/geekweek-vi/4.3-geekweek/rs-cccslib/-/blob/master/src/CCCSBgIndexConf/mod.rs#L263
https://gitlab.com/geekweek-vi/4.3-geekweek/rs-bgindex/-/blob/master/src/main.rs#L297

omac777 on 16 Mar 2020

@omac777 Please keep implementation details in #1518, as requested above.

BurntSushi on 16 Mar 2020

On Mon, Mar 16, 2020 at 06:56:55AM -0700, Andrew Gallant wrote:

@joshtriplett

For people who index entire directory trees at once, rather than individual files, I'm hoping that indexes can get updated by atomic rename, which would prevent the majority of index corruption.

Ah yeah interesting. I hadn't really thought of a "blow up the world and re-index" mode, but it probably makes sense to have something like that. I do hope that incremental indexing will negate the need for such things in most cases, but sure, "re-index the world" would indeed likely take the atomic rename strategy.

I wasn't necessarily thinking of re-indexing everything, so much as
indexing enough that it makes sense to write a new file containing the
new index "segment"; that file could then be written atomically.

Also, I'm hoping that the index logic is provided as a separate crate

Yes, it will be contained in the grep-index crate.

Excellent.

that doesn't inherently depend on integration with path searching, so that it would be possible (for instance) to index the history of a git repository.

This may be trickier to avoid. I believe the implementation path in my head does indeed focus around having a path for each thing you want to search. I'll consider the git history use case though and see if something like it can be supported. Maybe the way to think about it is not "path searching" but rather "searching documents where each document has some unique ID that can be resolved to its content." Making that pluggable at the crate level certainly seems doable.

A unique ID seems fine, sure.

joshtriplett on 17 Mar 2020

Some high-level feedback:

I am really delighted with your decision to not support relevance ordering and index synchronisation. I feel this is absolutely the right call as it means that ripgrep can very easily be dropped into a larger system. I'm generally satisfied with all the details mentioned. That said, a query regarding this:

ripgrep will associate some meta data with each file. This will include the time at which the file was indexed, but might also include other metadata (like a file type, other timestamps, etc.) that might be useful for post-hoc sorting/filtering.

Do you think this metadata would/should be extensible to user-defined metadata? E.g. a user might apply work, oss, gpl, etc tags to various files and want to narrow query results based on this. Should that be within the scope of ripgrep? I could go either way on this.

aidansteele on 29 Apr 2020

@aidansteele Thanks for the feedback. I suspect that the metadata will probably permit the library user to insert an arbitrary blob of bytes that they can use, yes. I doubt that will be exposed at the CLI level though. (Because ripgrep's CLI just really isn't expressive enough for something like that.)

BurntSushi on 29 Apr 2020

👍1

That sounds ideal. And I think that's a reasonable judgment on the CLI. Thanks for your great work.

aidansteele on 29 Apr 2020

First of all, I really appreciate the thought and care you've put into proposing this feature! I think your decision to make a daemon that updates the index out of scope is very reasonable.

One suggestion I have is that I think it might be better to make your --x-crud into a separate command. rg is a tool with a single purpose: searching. Making it able to use a precomputed index is a natural extension, but the index update commands are going to feel shoehorned in no matter how you expose them and the rg command is not really amenable to adding subcommands.

If you get a basic implementation of index creation/updating I think it would be straightforward to prototype a continuous index update process using watchman.

luser on 6 May 2020

@luser Thanks for the feedback. I'll noodle on it, but as of now, I'm still inclined to keep one command. It does feel a little shoe-horned, but rg has been shoe-horning subcommands-as-flags since day 1. See --files and --type-list for example. But yeah, if it feels too crazy once it actually gets implemented, then a separate command is definitely a doable option. It would itself be a lot of work though.

BurntSushi on 6 May 2020

Another way to slice that would be "rg runs a different command when its filename is rg-index" or something like that, so you could hardlink rg to rg-index. I don't think that's something you can get out of cargo install currently, though.

luser on 6 May 2020

Can it work like gtags? gtags only provides commands to generate and update index, but doesn't maintain the index files by itself. Editor plugins can auto update the index since they know when user change some files.

By the way, if to avoid re-walking the directory tree can have significant gains, can rg learn from git tree blob to speed up searching in a git repository?

xaljer on 24 May 2020

Can it work like gtags? gtags only provides commands to generate and update index, but doesn't maintain the index files by itself. Editor plugins can auto update the index since they know when user change some files.

You mean ctags? I'm not quite sure what you mean when you say "it generates and updates the index" but also "doesn't maintain the index." These seem like contradictory statements. If all you mean is, "can ripgrep just provide commands to create, update and search the index" while letting external processes run those commands, then yes, that is exactly what's being proposed as an initial version of this feature. Please see the "ripgrep will not synchronize your index for you" section in the first comment of this issue.

By the way, if to avoid re-walking the directory tree can have significant gains, can rg learn from git tree blob to speed up searching in a git repository?

This question isn't really on-topic for this thread, but the answer is "not really." ripgrep may search files outside of git's index. For example, an .ignore file may whitelist files that are in .gitignore, and this is an explicitly documented and useful feature. If you want to discuss this further or ask other questions, then please use Discussions.

BurntSushi on 24 May 2020

I really like this proposal.

It sound like the underlying assumption here is that reading the index into memory for each search will provide acceptable performance. This is a trade-off with a daemon approach which could keep the index in-memory, but has the complication of managing a daemon.

I don't have good intuition about the performance trade-off, but if it is substantial I would want to make sure that the daemon experience is good. Maybe the list of places to find the index could be extended to auto-discover such a (maybe third-party) daemon in some way, with a simple protocol? Alternatively, maybe storing the index on tmpfs provides enough of a speed-up?

For reference, I use ripgrep (often through vscode, sometimes from CLI) for searching code bases that I audit. For very large projects this is usually too slow, so I use and index like OpenGrok or VsChromium. It would be great if I could use ripgrep for everything!

nzig on 15 Jun 2020

@nzig The index would be designed to be read via memory maps. So the entire index doesn't need to be read off disk immediately. This is a substantial simplification to the implementation, as it pushes the onus of managing what's on disk and what isn't to the OS.

If you want to get more into the weeds on implementation details, I'd recommend #1518 for that. I'm trying to keep this issue more focused on the user experience, which I think is a discussion that more people can participate in.

BurntSushi on 15 Jun 2020

@BurntSushi Thanks for the reply!

If I'm understanding correctly, you're saying that you think reading the index via memory-mapped file and relying on OS caching will be fast enough so that a daemon won't be needed?

I think this is on the intersection between UX and implementation. If the performance can't be good enough without a daemon, some users will want a daemon and will want the UX for that to be good as well. If the performance is acceptable, then everyone will have a good time with this proposal 😀.

nzig on 15 Jun 2020

Yes, I am saying that. There is precedent for this. It's how Lucene works, for example.

And yeah, I see how this is at the intersection of the UX and implementation. More or less, if you want to cross deeper into implementation details, then the other issue is better for it. :)

BurntSushi on 15 Jun 2020

👍1

Hey, I just noticed this thread and thought I'd chime in. I'm a heavy user of rg, and I've used/built many search tools for my various day jobs. Most recently I used codesearch quite extensively for a research project totaling 30gb across 1M files - https://freshchalk.com/blog/150k-small-business-website-teardown-2019. Thank you so much for creating and maintaining ripgrep! It's making our lives better and I appreciate your dedication.

As to your original design questions:

no rankings and no sync - I think it's fine to have exhaustive search with no rankings, and sync can happen as a separate process. Those constraints don't bother me at all while working with codesearch. It would be nice if search results were deterministic (alphabetical).
no file contents - Is this mainly a tradeoff between disk space and speed? I definitely appreciated the speed of codesearch, even for queries that had many matches. On the other hand, the index was a pig and I was uninclined to replicate it across machines. Note that excluding file contents will complicate certain use cases. I've built search systems that distributed the index to mobile apps, for example...
not durable/segments/corruption - Ripgrep is already fast enough to be used without indexing for many projects, especially when used repeatedly with disk caching. It might be helpful to consider the data set size at which indexing becomes important. I appreciated the fact that codesearch kept my index in a single file, which could be (atomically?) regenerated from scratch. Incremental indexing was not important and I setup a cron job to rebuild the index nightly. For which use case is this simple model insufficient?
--x-crud Agree with others, I would gently suggest alternate naming. Perhaps --x-build, or even --X-build if you want to be pedantic (like me) :)

This sentence makes me nervouse - If an index could not be found, then ripgrep behaves as if -X was not given.. For serious use cases, I would prefer something that fails fast. I believe that indexed searches will be for "serious use cases" since ripgrep is so amazingly fast. If that's unpalatable, a command like --X-explain or --X-verbose would be helpful in tracking down issues related to missing (or misconfigured) indexes.

I anticipate many broken cases where people try to use un-indexed directories with indexed queries, since ripgrep is so wonderfully flexible with paths. This is likely to be more prevalent with a complex index location procedure... Maybe -X should always be -X [indexpath]. It might be useful to constrain the problem by only allowing a single target directory, or even using a separate command like rg-x to enforce that constraint and differentiate from good old rg. I'd recommend giving this UX issue serious thought if you aren't too attached to your initial proposal.

Happy to brainstorm more if that would be helpful. Very excited to start indexing!