Kibana: Add Regex Support to KQL

Created on 27 Sep 2019 · 21Comments · Source: elastic/kibana

I would love to have the ability to search for regex pattern in KQL.

KQL AppServices enhancement

Source

Skoetting

👍18

Most helpful comment

@rayafratkina the regex query should work on every field that supports it in KQL. A user though would do good in using a wildcard field if they know they need to use regexp queries a lot. I'm not sure if we can do anything reasonable to advertise that though from KQL since at that point indexing is already done and it's kind of "too late".

timroes on 22 Jul 2020

👍2

All 21 comments

Pinging @elastic/kibana-app

elasticmachine on 27 Sep 2019

I am trying to perform a Kibana KQL search on a text field for any value that doesn't end in $

For instance, when parsing Windows Event Logs for successful/unsuccessful logins, I am trying to not show computer accounts (which end with $).

I have looked at sever other questions around this same concept (Regex search where a string field ends with $) but that solution isn't working for me as I it is using lucene, not KQL.

I know that KQL supports wildcards so I was assuming it was going to be a query along the lines of:
not accountName: *$

Full regex support would be helpful in finding these documents.

joshuasmith0 on 25 Oct 2019

Pinging @elastic/kibana-app-arch (Team:AppArch)

elasticmachine on 20 Feb 2020

To provide additional background, @randomuserid was just explaining to me that lack of regexp support in KQL means that they need to fall back to the Lucene search syntax whenever they need regexps, here is an example for instance: https://github.com/elastic/detection-rules/blob/main/rules/linux/privilege_escalation_setgid_bit_set_via_chmod.toml. So there's no urgency to support regexps since we can fall back to Lucene, but it would be better if KQL supported regexps so that we would no longer need to fall back to Lucene in such cases.

jpountz on 22 Jul 2020

👍2

Wondering if we should use the new wildcard field in regexps? In that case https://github.com/elastic/kibana/issues/60933 is related

rayafratkina on 22 Jul 2020

timroes on 22 Jul 2020

👍2

@jpountz do wildcard and regexp perform equivalently when possible? For example, does the wildcard foo*bar*baz perform identically as /foo.*?bar.?baz/?

I think the wildcard functionality of KQL is a little flaky from what I've seen and those currently convert to query_string. Wondering if regex would be preferred if you're avoiding issues with double escaping, etc. that come from wildcards

rw-access on 22 Jul 2020

@rw-access sorry I'm not sure I get the question. Did you mean /foo.*bar.*baz/ as a regexp? (question marks shouldn't be required as * means "0 or more times", not "1 or more times"). But otherwise yeah, wildcard and regexp queries execute exactly the same way internally: Lucene first converts the expression to an automaton, and then runs the query using this automaton. If a wildcard expression and a regexp translate into the same automaton then they'll match exactly the same documents.

jpountz on 23 Jul 2020

👍1

I wonder if we should have a discussion here first maybe how we want the syntax of regex queries in KQL to look like? Due to backwards compatibility reasons we cannot use the Lucene way field : /someregex/, since that would already have been a valid query beforehand and we can't change the meaning of existing queries. Thus I'd suggest we use a custom operator between the fieldname. So far we're having :, : *, >=, >, <=, < as operators.

Given that we only treat the following characters as special characters, which would need to be escaped in a value: \():<>"*{} we could only use one of those after the : if we want to create an operator that is :<some character>. All of them already have a meaning and thus we can't use them. Meaning we need to either use a combination of <some character>: or a completely separate operator, e.g. field ~ regex or field ~: regex. I am not sure if there if anyone has currently any preference for how the regex operator should look like. I don't have a strong preference, but think the tilde ~ might be a good choice, since it's commonly used to mean "aroundish", which it was regex are used for often. But please share your thoughts about what you think would be a good regex operator?

cc @ppisljar @lukasolson

timroes on 23 Jul 2020

👍1

@jpountz ah, I was using .*? to indicate non-greedy. That's generally how I convert wildcard to regex for EQL, but it looks like that's an optimization for PCRE, not Lucene regex.

@timroes wouldn't that syntax be subject to the same problem? Since ~ isn't a reserved character, it would currently be interpreted as part of the field name.

We've had to make many similar decisions for EQL. One of the guiding principles for changes was that we won't reinterpret syntax that's already valid with new semantics, unless is was truly a bug. For breaking changes or limiting the syntax, we decided that we should still accept the syntax in the grammar, so that we can recognize it and raise an error message. That seems to be a good path forward for us .

I think that means ~: might be out. But we could do something like field : match(/foo.*bar.*baz.*/). It could open the door to more functions, and we could do field : wildcard("foo*bar*baz"). It would still be valid within a list of values joined by or or and.

Or we introduce a new predicate instead of : <list of values>. field LIKE "..." or field RLIKE /foo.*bar.*baz/. We wouldn't need a special character since it's not currently valid syntax.

Thoughts? It's not great, but our options are limited. And I think the feature is desired enough — both by internal Elastic teams and our users — that we might have to pick a syntax that's less than ideal.

rw-access on 23 Jul 2020

There are additional concerns about how to expose the important regex options of case insensitivity. This is done in other engines using /..../i syntax (the i meaning insensitive).

Symptoms of a broader issue - KQL is becoming a bottleneck to putting functionality in users hands.

As long as KQL is the top-level means for users to assemble clauses with Boolean logic we will have issues :

we run out of special syntax characters
there's no encapsulation - all details are laid bare in string form
illegal syntax is easy to introduce
users have to escape everything
there's no helpful checkboxes etc for setting options

With the Sculptor object model as a top-level organiser for Boolean logic :

complex clauses like regex can have their own dedicated GUI editor with help text and arbitrary options - KQL parser changes are no longer a bottleneck
the things we clicked (aka "Filter pills") can be ORed with the things users typed (KQL). There's no good reason why things-you-click should be assumed are always to be ANDed and not be OR-able.
KQL can still exist as a data-entry form but also be converted to more editable objects (muting, NOTing, expand/collapse, setting advanced options like boosts etc).

markharwood on 23 Jul 2020

@rw-access As far as I understand the grammar atm, the fieldname can not have (unquoted) spaces, thus we know that the operator is part of the field name. Maybe Lukas will be the better candidate for talking about that. I know we also experimented some time with having everything some kind of functions in KQL, which would be more along the lines with your match/wildcard suggestion. I also here need to refer to @lukasolson to give some background information about that.

Regarding flags, even with a custom operator we could still put the regex in /../ and allow flags that way to support it: field ~ /foo/i.

timroes on 23 Jul 2020

True, but the downside is that ~: becomes sensitive to whitespace. field ~: value is parsed differently from field~: value or field~:value. Generally, it's good to be consistent in whitespace handling across the grammar, so I'd be worried about this edge case causing confusion. Same applies to field~/foo/i. I believe that's currently valid. We use the more compact syntax in a lot of rules right now. For example:

query = '''
event.category:(network or network_traffic) and network.transport:tcp and destination.port:8000 and
  source.ip:(10.0.0.0/8 or 172.16.0.0/12 or 192.168.0.0/16) and
  not destination.ip:(10.0.0.0/8 or 127.0.0.0/8 or 172.16.0.0/12 or 192.168.0.0/16 or "::1")
'''

Agreed for flags. I was thinking about adding /i as well. I didn't see it as a flag on the 7.9 regexp page or the regular expression syntax page, so I didn't know if it was complete yet or not. We could have shorthand for the other flags as well (COMPLEMENT, INTERVAL, INTERSECTION, ANYSTRING), as long as we pick good letters that don't also start with i. Or we could leave that open for the future and use the default flags, with only the case-sensitivity toggle.

rw-access on 23 Jul 2020

I didn't see it as a flag on the 7.9 regexp page or the regular expression syntax page, so I didn't know if it was complete yet or not.

We have a PR flip-flopping on what to do - whether to make API concessions that make KQL easier with extended pattern syntax or stick to more formal APIs with named JSON flags.

markharwood on 23 Jul 2020

👀1 👍1

I have a preference for formal JSON APIs in elasticsearch with dedicated editors as counterparts in the GUI to simplify.

We could create a formal query JSON syntax for automatons (char_sequence, ORs, nots, repeats etc).
It would be validatable and we wouldn't have the silent failures we experience currently when someone uses a bit of what they think is valid regex syntax (/i \w or whatever) and it isn't supported but interpreted instead as a search for those literal characters.
A more formal JSON syntax, like we offer for spans/interval queries won't get an outing in Kibana though because the assembly tool for all criteria is KQL. There is no graphical assembly of query objects that can be editable with dedicated editors. Only strings and a brittle, cryptic syntax that struggles to separate content from controls.

markharwood on 23 Jul 2020

I think that's a fine point to show how you don't think KQL fits your needs or doesn't solve its problem well. But that discussion might be a little easier to have in a separate issue that's better scoped, and we keep the scope of this issue constrained to adding regex support to KQL. I don't mean that at all to shut you down, but just that we keep those discussions separate, since it's already a little hard to keep track of the two.

rw-access on 23 Jul 2020

++ @rw-access There are already issues to discuss this. Please continue discussion in https://github.com/elastic/kibana/issues/8112 (Graphical query builder) or https://github.com/elastic/kibana/issues/14272 (more control over how filters are added to the filter bar) which are the more appropriate places for discussion around the overall concept of the filter bar. Discussion in this thread should be about the Regex support in KQL so everyone can keep better track about it.

timroes on 23 Jul 2020

++ Happy to keep discussion elsewhere - just wanted to flag that regex construction is complex and adding this might mark a tipping point in how much complexity we try shoe-horn into KQL. We hit this wall 15 years ago in Lucene's query syntax.

We have a proposal for an elasticsearch API that you might want to incorporate as an aid to regex authors. It could help validate that the expressions people write are actually understood correctly by Lucene's parser.

markharwood on 23 Jul 2020

👍1

i like the suggestion of using a function field : match(/foo.*bar.*baz.*/i), seems a bit more error prone and also leaves more doors open for the future.

ppisljar on 27 Jul 2020

I would prefer to avoid any functional syntax (like match()) since there isn't any other functional syntax in KQL at the moment.

I would definitely prefer to go with something regex users are already used to (like foo: /bar.*/i) but, like @timroes mentioned, this would break backwards compatibility. But I don't think it'd be too hard to add a migration step that escapes leading forward slashes in the "match" clause.

Adding a completely new operator for regex when we already use : for exact match as well as wildcard matches doesn't seem intuitive to me, but it would also make adding autocomplete for regex queries a bit easier.