(Link to original RFC: https://github.com/rust-lang/rfcs/pull/528)
This is a tracking issue for the unstable pattern
feature in the standard library. We have many APIs which support the ability to search with any number of patterns generically within a string (e.g. substrings, characters, closures, etc), but implementing your own pattern (e.g. a regex) is not stable. It would be nice if these implementations could indeed be stable!
Some open questions are:
cc @Kimundi
String Patterns RFC tracking issue: #22477
Stabilization not only impacts implementing the pattern traits, but also of course detailed use of the Searcher trait, .next_match()
, .next_reject()
and so on.
I'm having trouble seeing the purpose of the next()
method; in all the examples I look at, it's faster and cleaner to just implement next_match()
and next_reject()
individually. For example, these two functions can be implemented for CharEqSearcher as one-liners using Iterator::find
. Moreover, if you want an optimized implementation of Searcher
for an ASCII char
then in implementing next()
you need to choose between an implementation that returns rejections quickly and an implementation that skips quickly over the input (e.g. using SIMD) in order to quickly find a match.
@jneem the StrSearcher uses the same code for next()
and next_match()
, but specialized for each case, so that the next_match()
case is much faster. It works well.
@bluss I think that helps make my point: AFAICT, all the users of Searcher only call next_match()
and next_reject()
. Therefore, the only purpose of next()
AFAICT is to make implementing Searcher easier -- you only need to implement one function instead of two. But that benefit isn't born out in practice. StrSearcher implements two methods anyway, and it would be simpler to implement next_reject()
instead of next()
. CharEqSearcher implements next()
only, but it would be simpler and cleaner to optimize if it implemented next_match()
and next_reject()
instead.
Oops, I missed one: Pattern uses Searcher::next()
for the default implementation of is_prefix_of
.
This is useful, I'd like to use it on stable.
cc @BurntSushi, Regex is the major pattern user and it would be great for everyone if it was stable because of that.
@BurntSushi Are the Pattern traits sufficient for Regex? Any design issues?
I wrote the impl for Regex
a while back, which does indeed only implement next
: https://github.com/rust-lang-nursery/regex/blob/master/src/re.rs#L1104 I seem to recall that it was tricky to get the corner cases right, but that isn't too surprising. I otherwise found it pretty straight forward.
One thing that I don't think is representable with the Pattern
API is fast retrieval of all matches from a suffix table. In particular, matches are reported as part of lexicographically sorted suffixes rather than order of occurrence in the string, so it can't satisfy the contract of Searcher
without an additional cost. (I've already talked to @Kimundi about this, and I don't think it's a major issue. Suffix tables are a pretty niche data structure.)
Nominating for discussion. Not sure whether this is ready for stabilization, but I'd like for the team to dig into it a bit.
Unfortunately the libs team didn't get a chance to talk about this in terms of stabilization for 1.8, but I'm going to leave the nominated tag as I think we should chat about this regardless.
Today I investigated a bit what would need to be done to generalize the Pattern API to arbitrary slice types. As part of that I also took a more pragmatic route to the involved types and interfaces based on these assumptions:
A very rough sketch of the result can be seen below. Note that this is only the Pattern traits itself, without actual integration in the std lib or comprehensive re implementation of the existing types.
https://github.com/Kimundi/pattern_api_sketch/blob/master/src/v5.rs
The core changes are:
Pattern
trait now has an input parameter for the slice type. This turns Pattern<'a>
into Pattern<&'a str>
and allows for Pattern<&'a mut [T]>
.next()
got removed, leaving only next_{match/reject}()
. This has been done because none of the existing types could make use of the return value of next()
, and none of the existing implementations benefited from needing to implement it.Pattern::is_{prefix/suffix}_of()
methods are no longer default-implemented. But seeing how the manual implementation for a given pattern-slice combination is usually simple and straight forward, this does not appear to be a problem.match_indices
between slice types,next_*()
are now associated types of the slice type.Changes in regard to the open questions:
SearchPtrs
trait has gained new ad-hoc named elements which would require additional name audition. Also, Pattern
continues to be a confusing name in a language with pattern matching...next()
methods,split()
be replaced by shared, generic ones there might be some regressions due to optimizations possibly not carrying over. This is optional though.Is the focus of Pattern for slices going to be specific for &[u8]
only? I think that's ok, I'm unsure how to really extend "substring search" further to generic &[T]
. Technically, the two-way algorithm that we use for str::find(&str)
can be made to work on any _ordered alphabet_, i.e. T: Ord
is enough, but I don't know how performant or realistic this is.
@bluss: I only used &[u8]
as a quick proof of concept that the traits work with different slice types. I'm assuming that there is some sensible way to make it work with generic slices.
The &[u8]
case seems like a good candidate for a specialization impl though.
I'm currently trying out some runtime scanners for scan-rules
. Basically: I want users to be able to parse text based on Pattern
s (_e.g._ accept anything that matches this pattern, consume input until this pattern is matched, _etc._). The current example is being able to do something like (where until
accepts a Pattern
):
let_scan!("before:after", (let word_0 <| until(":"), ":", let word_1: Everything));
assert_eq!(word_0, "before");
assert_eq!(word_1, "after");
The problem I'm having is that it's _really_ painful to actually do this. The interface as it exists seems to assume that ownership of the pattern can be passed to the method which will use it, and as a result, can _only_ be used _exactly once_. This doesn't make much sense to me. Searching for a pattern should not (in general) _consume_ the pattern.
What I want is P: Pattern
⇒ for<P: Pattern> &P: Pattern
. Currently, it's only true for &str
. If I had this, I could take ownership of the pattern from the caller, store it, and loan it to the underlying find
calls. If the caller wants to use the pattern in multiple places, and it's expensive to construct, they can instead pass my function a borrow, which will also work.
The more I think about this, the more it comes down to the FnMut
implementation. The only closure kind that would allow for the kind of interface I want is Fn
... but that excludes closures that test based on accumulated state, though I wonder how often that's even desireable.
I can work around this by requiring Copy
patterns, but again, callables are _the_ most useful kind of pattern (until(char::is_whitespace)
, _etc._), so it seems a deep shame to exclude them.
Oh, another thing I just realised: there doesn't appear to be any way to find out how _long_ a match is given a pattern and, say, str::starts_with
.
You can use .next_reject for that.
Hey, I don't know a lot about this API, but I noticed that StrSearcher
is not a double ended pattern, but CharSliceSearcher
is. Is this an error? The explanation of why StrSearcher
is not double ended seems to apply equally well to CharSliceSearcher
:
(&str)::Searcher is not a DoubleEndedSearcher because the pattern "aa" in the haystack "aaa" matches as either "[aa]a" or "a[aa]", depending from which side it is searched.
@withoutboats A slice of chars represents a set of possibilities, so it's not like a string; either of the chars can be matched by themselves.
@bluss That makes sense. The semantics of CharSliceSearcher
are not documented; I assumed the char slice was treated as an ordered sequence rather than a set.
Can we say that, after returning Done
once, a Searcher
must continue to return Done
forever more? That is, can I make this assumption when implementing FusedIterator
?
I'm a big fan of extending this to be more generic than just str, needed/wanted it for [u8] quite frequently. Unfortunately, there's an API inconsistency already - str::split takes a Pattern, whereas slice::split takes a predicate function that only looks at a single T in isolation.
@shahn That shouldn't be a big problem - if Pattern
is implemented for such function, then slice::split
could support Pattern
without breaking backwards compatibility.
I strongly agree with widening the API to slice and maybe Vec.
What's the status here? The original RFC is quite old -- from 2014, before 1.0 -- do we need to completely revisit the design? What are the current blockers?
With SIMD stabilizing in 1.27, Jetscii will be available on stable, but its Pattern
support won't — I never expected that SIMD would arrive before this! 😇
Would be nice to see these for OsStr, CStr, [u8] and similar, as mentioned a year ago. Non-UTF in Rust feels very much a second class citizen and this would be a start.
@jeekobu https://github.com/rust-lang/rust/issues/49802 is accepted but not yet implemented.
I think, that Searcher
need skip
method(s) that can be used to advance search positions without performing search. For example, proof-of-concept of inplace unescape requires this method to work correctly.
I'm looking at this and there doesn't seem to be any way to efficiently perform a search for a single needle in many haystacks? The only way to make a Searcher
currently is by starting with a str
for both the needle and the haystack. This leads to an expensive call to ToWaySearcher::new
which performs computations that only seem to depend on the needle
.
(see how expensive these calls are here: https://users.rust-lang.org/t/why-my-rust-code-is-2-times-slower-than-my-js-code/31189/8)
But it doesn't look like this is impossible with the current API. It looks to me like a new public type StrPattern
could be added which implements Pattern
and contains these precomputed results?
Aho-Corasick (or similar) is probably the right answer there, but the bstr crate does provide a way to do what you want (including not needing to do utf8 validation, if that helps): https://docs.rs/bstr/0.2.7/bstr/struct.Finder.html
I do think it is a somewhat niche concern though. Consider that libc APIs like memmem don't permit this either.
Most helpful comment
I'm a big fan of extending this to be more generic than just str, needed/wanted it for [u8] quite frequently. Unfortunately, there's an API inconsistency already - str::split takes a Pattern, whereas slice::split takes a predicate function that only looks at a single T in isolation.