Caddy: v2: proposal: Matcher to compare primitive values

Created on 17 Feb 2020  路  45Comments  路  Source: caddyserver/caddy

Currently, if we want to match based on data stored in the ReplacerCtx (e.g. http.error.status_code, one first has to store it in the the VarsCtx with a vars middleware. This leads to slightly over-complicated configurations, e.g. like this.

I feel like vars and replacers should be unified under a single concept. Is there a specific reason why they aren't currently backed by the same data?

feature request

Most helpful comment

@roblabla @mholt as one of the core maintainers of CEL, I want to say that I think this was a very thoughtful discussion and I'm super pleased with the outcome. If you have any questions, I'd be happy to consult. Cheers!

All 45 comments

Great question!

Vars and replacers are orthogonal concepts, but they use each other.

Vars are named values, set by the HTTP handler chain for use in later handlers or matchers. They can be accessed through the replacer via {http.vars.*} placeholders.

The replacers simply expose various values to the config. These values can come from anywhere, including the HTTP request, system calls (like, for current time or environment variables), or custom values set by handlers like the "vars" handler.

Significantly, "vars" are specifically values that are configured by the user. Some things that the Replacer exposes can't or shouldn't be changed by the user, for example, the current system time or the HTTP client address. "Vars" don't change once set, but the Replacer can retrieve a dynamic value like the current time. Replacers never write values, they only read them.

I feel like vars and replacers should be unified under a single concept. Is there a specific reason why they aren't currently backed by the same data?

Since values accessed by a replacer come from all over the place, unifying them would be quite challenging... especially to make it efficient, without causing lots of allocations, each time Replace() is called.


Anyway, this might be an X-Y problem? I agree it's a little awkward as things currently are. But unifying the Replacer and Vars isn't the right solution IMO.

We just need a new matcher that fundamentally compares two primitive values.

"match": [
    {
        "equal": {
            "{http.error.status_code}": 401
        }
    }
]

I'm not sure yet whether it should be called equal or equals. (What do you think?)

I think equal, notEqual, lessThan, greaterThan, lessThanOrEqual and greaterThanOrEqual would be useful.

I'm thinking of it like "these key-val pairs are _equal_" or "these key-val pairs are _not equal_" when reading it.

So maybe we need a more general comparison or compare matcher where we can specify an operator?

While they could be all separate, remember that multiple different matchers are AND'ed together; so different operators would be AND'ed, necessarily. If we wanted to support OR logic, we'd need to combine them under one matcher that treats them like OR.

Alternatively, could starlark be used here? Starlark should be able to compare stuff, but I don't think it can currently access replacer data.

I think equal, notEqual, lessThan, greaterThan, lessThanOrEqual and greaterThanOrEqual would be useful.

I'm thinking of it like "these key-val pairs are _equal_" or "these key-val pairs are _not equal_" when reading it.

For the Caddyfile, maybe something like:

compare {
    {http.error.status_code} == 401

    # syntax:
    <left-hand> <operator> <right-side>
}

And I assume we'd AND all of them?

Should be a simple switch statement on the operators: ==, !=, >, <, >=, <=.

@roblabla

Alternatively, could starlark be used here? Starlark should be able to compare stuff, but I don't think it can currently access replacer data.

Definitely. But, I pulled out the Starlark stuff recently before the 2.0 release because it deserves more attention than I can give it right now. I want to make sure that feature isn't half-baked when we do add it.

I do think, though, that for something as simple as a string comparison, one shouldn't need to resort to an embedded scripting language.

@francislavoie That would work. Although, usually multiple options/values within a matcher are OR'ed (for example, "host": ["foo1", "foo2"] matches either foo1 or foo2).

What would the equivalent JSON look like? That's the most important thing to figure out, and the first thing that should be implemented.

Something like this?

{
  "compare": [
    {
      "operator": "equal",
      "operands": {
        "{http.error.status_code}": 401
      }
    },
    {
      "operator": "greaterThan",
      "operands": {
        "{http.header.content-length}": 100
      }
    }
  ]
}

A bit verbose but I think it would be fine

Looks mostly good to me. But the naming, greaterThan is inconsistent with our typical names, which should be like greater_than. I wonder if we should prefer the actual operator symbols, like "==" or ">" instead? I haven't decided yet, maybe that's just better for the Caddyfile.

Sure, could just use the symbols. 馃憤

@francislavoie What would the combination logic be? Are different operands (within the same operator) OR'ed? Or AND'ed? And what about across multiple operators? Would the Caddyfile syntax be the same as you proposed above?

@roblabla How is this looking to you so far?

Makes sense to me, and should work for my needs.

I do question the need for an explicit vars matcher though, if we go down that route. It seems like the vars matcher would be equivalent to

{
  "compare": [{
    "operator": "equal",
    "operands": {
      "{http.vars.VAR_NAME}": "VAR_VALUE"
    }
  }]
}

And at that point I don't know what vars would even be used for? Basically, it seems to me like "vars" would just become a way to inject data into the replacer from the JSON config. What am I missing? What's the use-case for vars?

I'd say within the same operator is AND, and if you want to OR you just make a new operator? But that's just for JSON I guess. Could also just not allow more than one key-value pair in operands to simplify. And I don't see a reason the Caddyfile syntax would be different.

For OR-ing matchers, maybe we should have a standalone or matcher, similar to how we have a not matcher? NVM I'm dumb.

https://caddyserver.com/docs/json/apps/http/servers/routes/match/ says this: "The matcher sets which will be used to qualify this route for a request (essentially the "if" statement of this route). Each matcher set is OR'ed, but matchers within a set are AND'ed together." So what @francislavoie says makes sense to me, and matches how every other matcher works already.

Is the ordering important?

If not, how about this structure:

{
    "match": {
        "compare": {
            "equal": {
                "{http.error.status_code}": 401,
                "{foo}": "bar"
            },
            "greater_than": {
                "{http.error.status_code}": 500
            }
        }
    }
}

I think ordering is important, i.e. short circuiting. If a user puts them in the Caddyfile in a certain order, I'd assume them to be evaluated in that order.

Short-circuiting doesn't seem very important since there shouldn't be any side-effect though? Evaluation order is only visible if you can observe side-effects.

Well, performance is a side-effect 馃槢 but yeah, true. Just feels wrong to me to not preserve order there.

@francislavoie

I think ordering is important, i.e. short circuiting. If a user puts them in the Caddyfile in a certain order, I'd assume them to be evaluated in that order.

Short circuiting is irrelevant; we can short-circuit either way when implementing OR logic. It's just a matter of whether the order in which we try comparisons matters.

Updated opinions?

Updated proposal:

{
    "match": {
        "compare": {
            "equal": {
                "{http.error.status_code}": [401],
                "{foo}": ["bar"]
            },
            "greater_than": {
                "{http.error.status_code}": [500]
            }
        }
    }
}

Note that there can be multiple right-hand operands, which are OR'ed. Everything else is AND'ed. Will that do?

So that would be like: (status_code == 401 || status_code == 402) && foo == "bar" && status_code > 500 (assuming [401, 402] for the sake of example) ?

Updated proposal

That seems rather over-complicated: it is not obvious at first glance what's going on, and falls short of being able to express every logical expression (for instance, I think this is not possible to express with your idea: (status_code == 401 || foo == "bar") && foobar == "true").

I'd rather have either something less expressive but obvious, or something maximally expressive, but this proposal stands awkwardly in-between.

It's more consistent with how current matchers work.

Assuming that a future Starlark matcher will be able to express arbitrarily complex logic, how complicated do we really need the JSON structure to be? Are comparisons like that realistically common?

I mean, I think in the presence of a starlark matcher, this feature would probably become obsolete anyways. Starlark would be easier to read, write, and be infinitely more expressive.


Thinking about this a bit more, I think I actually like your proposal though. But first, a silly thought:

Looking at the entire matcher spec, it's already possible to express the following kind of logic with it (Letters represent a match group, number represent a matcher within the group):

(A1 && A2) || !(B1 && B2 && B3) || C1

Furthermore, the ! can be arbitrarily nested. This enables us to write arbitrary logic, e.g. to express (A1 || A2) && B1, you could instead express !(!A1 && !A2) && B1. Hilariously over-verbose, but it works. Amen to boolean logic. I'm not going to test it, as writing the JSON seems very unfun, but I'm pretty sure it works.

Back to serious thought: in your proposal, compare's components are basically "flattened" into the match group, such that there'd be no difference between using compare or having several equal/less_than/greater_than. I like this, it lowers cognitive load and makes a ton of sense. In a way, I wonder if having multiple comparison matchers wouldn't be better however.

Having equals take an array and OR the result makes a lot of sense, and I can think of many cases where it could be useful (e.g. {http.error.status}: [401, 403] to check for either forbidden or unauthorized code).

Other operators don't really benefit from this though: a < 10 || a < 5 is redundant after all. And for the != operator, we almost certainly want the opposite behavior, since a != b || a != c is the same as true (assuming b and c are distinct), whereas a != b && a != c is meaninful.


Also something I thought of: having a "bounds" operator would be nice. E.G. to represent 400 <= x < 500, being able to write this would be nice

"bounds": {
    "{http.error.status_code}": { "lower": 400, "upper": 500 }
}

It's technically the same as using greaterThan and lowerThan, but is easier to reason with, I think. But now the LHS is an object...

This leads me to think that maybe we should have distinct operator. Fitting everything in a single "compare" operator doesn't seem to really work IMO. I think I'd rather write something like this, and it'd also make documentation much easier if every comparison has its dedicated page explaining its intricacies:

{
    "match": {
        "equal": {
            "{http.error.status_code}": [401, 402],
            "{foo}": ["bar"]
        },
        "bounds": {
            "{http.error.status_code}": { "lower": 400, "higher": "500" }
        }
    }
}

I'd like to mention I think we'd also want a empty matcher, i.e. "empty string or nil" useful for situations like "header is not set" etc. Obviously wrappable in not to do "not empty" allowing any value.

I think for the Caddyfile it would make sense to still wrap the comparison operators with a compare { } to allow for the more expressive <left> <operator> <right>. bounds doesn't map nicely to this though. I think between might feel nicer than bounds?

@francislavoie

useful for situations like "header is not set" etc

You can already do this:

{
    "match": [
        {
             "header": null
        }
    ]
}

I like the between idea. Although, inclusive or not? [ ] vs. ( )

We're starting to get into typed operations territory. Which, is fine, but maybe some of these are best left to future Starlark matcher and our initial matchers should be the most basic/common ops.

(If someone wants to propose and contribute a fully-fledged Starlark matcher, I'd be willing to join the discussion there.)

You can already do this: "header": null

Well, I don't mean exclusively for headers, could be used for any placeholder too. And I don't think you can set header to null in the Caddyfile, it'll just take your null as a string.

I think for the Caddyfile it would make sense to still wrap the comparison operators with a compare { } to allow for the more expressive .

So, I tried to implement this, splitting each "kind" of comparison into its own module, and immediately hit a small roadblock: In the caddyfile, matchers are decoded by looking at their name and loading the http.matchers.{name} module. This means we can't have something like compare { a == b } while also having different modules in the JSON. Not the end of the world, but yeah.


So the design I'm currently going with is something like this:

{
    "match": {
        "compare": {
            "equal": {
                "{http.error.status_code}": [401, 402]
            },
            "bounds": {
                "{http.error.status_code}": {
                    "lower": 400,
                    "upper": 500
                },
                "{http.error.status_code}": {
                    "lowerExclusive": 400,
                    "upperInclusive": 499
                }
            }
        }
    }
}

greaterThan and lowerThan disappear, they're tucked into bounds since they're basically the same thing. Each component of bounds is optional, lower/lowerExclusive are mutually exclusive, so are higher/higherInclusive.

In the caddyfile, it's used like so

compare {
    {http.error.status_code} == 401
    # Implemented with "bounds" { "{http.error.status_code}": { "upper": 401 }
    {http.error.status_code} < 401
    # Implemented with "bounds" { "{http.error.status_code}": { "lowerExclusive": 401 }
    {http.error.status_code} > 401
    # Neat chaining, specifies both a lower and upper bound.
    400 <= {http.error.status_code} < 500
}

There's one annoying thing with this Caddyfile syntax though:

compare {
    {http.error.status_code} == 401
    {http.error.status_code} == 402
    {http.error.status_code} < 410
}

This would mean http.error.status_code == 401 || http.error.status_code == 402 && http.error.status_code < 410. I really dislike that, it doesn't make intuitive sense and is hard to explain without showing the resulting JSON. Maybe we should split compare into different modules in the end...

Gonna revisit this next week after beta 16 goes out, will be sure to get this in beta 18 :)

@roblabla I've been looking at this recently.

This particular snippet:

compare {
    {http.error.status_code} == 401
    {http.error.status_code} == 402
    {http.error.status_code} < 410
}

apparently reads differently to me than it does to you. I read this as:

status_code == 401 && status_code == 402 && status_code < 410

which doesn't make sense of course.

What if... are we able to sidestep these problems if we just go ahead and implement something like a CEL-powered matcher? CEL is simply used for expressions like this, and is much easier to set up and use than Starlark. I'm still interested in Starlark for advanced embedded scripting, but for expressions, CEL seems like the ideal fit.

What would you need to hold you over in the short-term? How much boolean logic do you need right now? Because if it's just simple equals/LT/GT/contains, we could still implement simple versions of those that don't allow complex expression logic. For that, we could have CEL. I like both ideas, personally. Might play with this in a moment.

@roblabla I've pushed an initial proof of concept here: https://github.com/caddyserver/caddy/pull/3155

If we can find a way to expose all the variables to the CEL environment, I think this makes for a very promising solution!

What would you need to hold you over in the short-term?

I have a working solution already using vars for my use-case, it's just super boilerplatey. I don't need LT/GT myself though I do see a lot of value in having it.

Regarding CEL, it certainly looks promising! I think part of the problem for both Starlark and CEL is that there's no clear way to "gather" the data associated with a request. The way I see it, the ideal place to end up in is having an interface IntoCEL that could be implemented for caddyauth.User, HandlerError, etc... that would allow turning the various data into something usable from CEL. The problem is, all of this is stored in a ctx.Context and this has no way to iterate over all Values that I can see, you need to know the context key to access its values.

In the meantime, though, I think the CEL environment can be meaningfully improved by giving it access to things that are not trivially available in the other matchers:

  • The replacer (basically give access to ReplaceAll from CEL)
  • ErrorCtxKey (Need to know: the HTTP Error Status associated with the error, and maybe a string associated with the error description?)
  • UserCtxKey (Basically just the user ID for now)
  • And maybe more stuff in the http request

You'd have the following vars/functions available from the top-level in CEL:

  • ReplacePlaceholders(input, empty): Basically Replacer.ReplaceAll, allows processing placeholders. A lot of data is only accessible through this, so I think it's primordial this function exists from CEL.
  • error: Either null, or a struct:

    • message: a string, the result of running err.Error. Equivalent to ${http.error}

    • status_code: A number, the status code.

  • user: Either null if not authenticated, or:

    • id: The ID of the user.

  • req: The http.Request currently being processed:

    • path: A string

    • headers: A dictionary (can CEL handle those?)

    • host: A string, The request host

    • method: A string, The method used

    • protocol: A string, the protocol

    • query: A dictionary of query strings

    • remote_ip: A string, the ip of the remote side

This would already cover a lot of ground, and more can always be added.

What do you think?

Hmm. A CEL expression is itself a string, right? Would it be possible to call the replacer on the expression string itself before executing it? Or does the expression need to be compiled early? (I figure you'd need some logic like is_numeric() return it raw, is_boolean return true/false, otherwise return it wrapped with "")

@francislavoie So, something I could imagine doing is replacing ${([a-zA-Z.]*)} with ReplacePlaceholders("\0", ""), which would allow keeping consistency with the placeholder syntax in the rest of caddy while still allowing precompilation of CEL expressions.

So if you have ${http.status_code} == "200", it'd turn it into ReplacePlaceholders("http.status_code") == "200" and pass that to the CEL interpreter.

These are all great thoughts! I've read through these and been working on this most of the day.

@roblabla That's a great idea. I think I got this working, let me clean it up and I'll push it to a new PR.

@roblabla @francislavoie I've updated PR #3155 with the working matcher. It now supports all Caddy placeholders! Thanks to your ideas. PTAL!

Ooooooh that's fun 馃榿

I'll dig a bit deeper tonight but at a quick glance, code makes sense to me. Replaces placeholders with a method call back to userland to use the replacer. 馃帀

Closing, since I decided to just merge the CEL matcher for beta 18, but please feel free to continue discussion here and we can always change/fix things in beta 19!

@roblabla @mholt as one of the core maintainers of CEL, I want to say that I think this was a very thoughtful discussion and I'm super pleased with the outcome. If you have any questions, I'd be happy to consult. Cheers!

@TristonianJones Hi, welcome! CEL is pretty cool.

I'm glad you found this -- I would greatly appreciate your counsel for the CEL integration. I know I want to add more functions for timestamps and durations, and I am pretty sure I will have questions :) I'll ping you when we get to it! (Or if anyone beats me to it, feel free.)

@mholt Likewise, I'm happy you found CEL. It's been really interesting to watch who finds it and how they use it.

In terms of some additional improvements that you might want to make, I'd consider the following:

  1. You can also simplify compilation by using env.Compile to replace the env.Parse and env.Check phases.

  2. You probably also want to check the ast.ResultType() equals a decls.Bool type to make sure the filter is truly producing a boolean output and not just any old valid CEL expression:

if !proto.Equal(ast.ResultType(), decls.Bool) {
   // produce an error here.
}
  1. Istio took a slightly different approach to adding a custom type provider which can give you stronger typing guarantees; however, I can take this as a general feature request for better docs and helpers to make this easier / better. Here's a reference (probably overkill for what you need): https://github.com/istio/istio/blob/master/mixer/pkg/lang/cel/provider.go

@TristonianJones Excellent, thank you!

I've gone ahead and implemented the first two, those were pretty straightforward. e5dc76b05406ece7e2c1a9567bb18a47d7873793

The last one... looks like it will take some work! But I'm interested in it.

Awesome work! The last note about the type provider is definitely an area for improvement. It's very likely I could give you a Go-reflection based type-provider to make this easier. Istio happens to have its own internal type system, so that's why it gets more complex.

If the reflection is only used for unknown types, i.e. the few complex/custom types, then that sure does sound handy without having a performance hit most of the time. Thanks for being so attentive!

The job of the ref.TypeProvider is two-fold:

  1. Provide information about the valid fields (and their types) on a given object.
  2. Provide a utility to adapt types to CEL's ref.Val type.

In the first case, the list of http.Request fields supported in the matchers that was posted in the discussion thread could be statically-validated with a rich enough ref.TypeProvider. The validation could use reflection to improve correctness. Alternately you could do what Istio does, or I've also dabbled with using Open API Schema as a way to define type structure in https://github.com/google/cel-policy-templates-go (very early in development).

In the second case, the ref.TypeProvider helps with safe object traversal. However, this is often done via reflection or type-conversion. To avoid the reflection and conversion penalties when you have a very limited attribute set, you can use a cel.CustomAttributeFactory to provide your own Go-native utility for accessing fields and returning their raw value. This will avoid a vast number of the CEL-related reflection penalties you might incur otherwise. CEL will adapt the type as needed, but only when needed. There's a very small example of this in the interpreter/attributes_test.go. I would only recommend doing this if you're very performance conscious and you're re-evaluating the same expression over and over again:

type custAttrFactory struct {
    AttributeFactory
}

func (r *custAttrFactory) NewQualifier(objType *exprpb.Type,
    qualID int64, val interface{}) (Qualifier, error) {
    if objType.GetMessageType() == "http.Request" {
        return &httpReqQualifier{id: qualID, field: val.(string)}, nil
    }
    return r.AttributeFactory.NewQualifier(objType, qualID, val)
}

type httpReqQualifier struct {
    id    int64
    field string
}

func (q *httpReqQualifier) ID() int64 {
    return q.id
}

func (q *httpReqQualifier) Qualify(vars Activation, obj interface{}) (interface{}, error) {
    req := obj.(*http.Request)
        switch q.field {
        case "path":
           return req.GetPath(), nil
        case "headers":
           return req.GetHeaders(), nil
        // and so on.
        }
    return nil, fmt.Errorf("no such field: %s", q.field)
}

@TristonianJones Ah.. that makes more sense. Thank you for the example and the explanation!

This is really good to know. I think, personally, I'm pretty satisfied with the CEL matcher we have, at least as far as my priorities are concerned, but I wouldn't stop anyone from continuing to improve it further in the short term if they wanted to!

Look for the 2.0 release in the coming weeks, we'd be happy to help propagate CEL a little more!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

areve picture areve  路  37Comments

hazcod picture hazcod  路  41Comments

wallacyyy picture wallacyyy  路  53Comments

lukasbestle picture lukasbestle  路  38Comments

TribuneX picture TribuneX  路  37Comments