Vega-lite: Compare a field with selection

Created on 17 May 2019 · 7Comments · Source: vega/vega-lite

This is in reference to Issue #4740 . We want to design a schema for supporting comparison between datum values and the selections.

We want the design to be such that it can accommodate the following cases in a generic manner.

We can select a field from a selection and aggregate it into a single value for interval or multi selections
_(Simplified representation for easier readability)_
datum.Cylinders <= selection(sel1).Cylinders.agg(min)
We can aggregate multiple selections which may or may not use the same data source
datum.Cylinders <= total(selection(sel1).Cylinders.agg(min), selection(sel2).cyl.agg(avg))
We can have a boolean logical representation between different selections for a single datum field
datum.Cylinders <= selection(sel1).Cylinders.agg(max) && datum.Cylinders > selection(sel2).cyl.agg(total) || datum.Cylinders != selection(sel3).Cylinders
We can have logical representations between different datum fields
datum.a <= selection(sel1).a.agg(max) && datum.b <= selection(sel2).b.agg(max)

Me and @arvind have come up with the following schemas. We wanted to know if there are any alternate schemas which would address the above requirements in a manner which is consistent and fairly simple for a user who has little or no programming experience.

"test": {
    "Cylinders": {
        "lte": "sel1",
        "field": "Cylinders",
        "aggregate": "min"
    }
}

_we are using Cylinders instead of datum.Cylinders as datum will never change_

This implies datum.Cylinders <= selection(sel1).Cylinders.agg(min)

It can be further extended by using AND OR and NOT operators.

"test": {
    "Cylinders": {
        "and" : [{"lte": "sel1", "field": "Cylinders", "aggregate": "min"},
                    {"gt": "sel2", "field": "cyl", "aggregate": "total"}]
         }
}

The above declaration would result in -

datum.Cylinders <= selection(sel1).Cylinders.agg(max) && datum.Cylinders > selection(sel2).cyl.agg(total)

For multiple datum fields, we can wrap them in another AND OR NOT array.

"test": {
    "and": [
          {"Cylinders": {"lte":...},
          {"or": [{"Horsepower":...}, {"Origin":...}]}
           ]
    }
}

datum.Cylinders <= ... && (datum.Horsepower > ... || datum.Origin > ...)

Finally, for addressing the 2nd requirement, we have the following schema

"test": {
    "Cylinders": {
        "aggregate": "avg",
        "selections": [{"name": "sel1", "field": "Cylinders", "aggregate": "min"},
                    {"name": "sel2", "feild": "cyl", "aggregate": "total"}],
        "gt": true  
    }
}

This would result in
datum.Cylinders > avg(selection(sel1).Cylinders.agg(min), selection(sel2).cyl.agg(total))

There are a couple of things which we want to point out here.

There is a certain variability in the schema for the last declaration compared to the above three. This has been done to accommodate both logical expression and the overall aggregation of selections. We tried to combine them into a single generic definition but we were facing issues such as replacing name with lte which made no sense when an aggregate was used.

"test": {
    "mean": [
          {"Cylinders": {"lte":...},
          {"Horsepower": {"lte":...}
        ]
}

_What does lte mean here?_

Currently, we can have lte, gt etc. as keys. What does it mean to have both the keys together? (Though using TypeScript we can enforce that only one of the keys is present, as done in filter transform). This isn't a big issue but an alternative schema which would overcome this problem would be favourable.

Area - Interaction P3 RFC / Discussion

Source

djbarnwal

Most helpful comment

Another minor note about writing nested logics:

In general, it's good to avoid writing A && B || C. Instead, be explicit about your nesting: (A && B) || C or A && (B || C).

kanitw on 2 Jun 2019

😄1 👍1

All 7 comments

I think it makes sense to base the schema on our filter schema and our logical operands. Your design makes sense to me. Do you have specific questions that we want to answer here?

The one part that I am not convinced by is this

"test": {
    "Cylinders": {
        "aggregate": "avg",
        "selections": [{"name": "sel1", "field": "Cylinders", "aggregate": "min"},
                    {"name": "sel2", "feild": "cyl", "aggregate": "total"}],
        "gt": true  
    }
}

"gt": true doesn't really explain what should be greater than what here.

Why not this?

"test": {
        "field": "Cylinders",
        "gt": {
        "aggregate": "avg",
            "selections": [{"name": "sel1", "field": "Cylinders", "aggregate": "min"},
                        {"name": "sel2", "feild": "cyl", "aggregate": "total"}],
        }   
    }
}

domoritz on 31 May 2019

After some thought, I think this would be a better way to do it. It has the consistency of the above schemas where we use the datum field as the key and provide a ComparisonOp as its value. Moreover, this helps us in implementing it too as we can use the ComparisonOp as an anchor to identify all these predicates.

"test": {
    "Cylinders": {
              "gt": {
                "aggregate": "avg",
             "selections": [{"name": "sel1", "field": "Cylinders", "aggregate": "min"},
                        {"name": "sel2", "feild": "cyl", "aggregate": "total"}],
              }
                }   
    }
}

djbarnwal on 31 May 2019

One issue with the last example is that it puts variables in the keys. We usually prefer not to do that. Selections are the only example where we broke this rule and it is causing some confusion at times. I think it would be good if we can avoid it but it's not a strict requirement.

domoritz on 31 May 2019

One point that I raised on slack is that using variable names in expressions is dangerous because of nesting. If you have a field called “and” or “gt”, then it can be very tricky to understand a complex expression.

domoritz on 31 May 2019

@djbarnwal -- Thanks for initiating the discussion. :)

Here are some thoughts:

Treat Selection Data as a New Type of Predicate Value?

We should consider treating selection data as a new type of predicate value, which can be used with either "test" for conditional encoding or with "filter" in filter transform.

In the existing field predicates, we use "field": string, so we could re-use the same keyword in this use case.

(Even without the consistency argument, I think we should not use field name as a key. Selection name is already causing some confusion as Dom mentioned and as discussed on Slack. However, one could at least argue that the selection name must be unique, so having it as a key still have some rationales. On the other hand, you may want to test the same field with multiple different predicates, so it doesn't warrant the need to make it a key.)

We can then treat selection data (and its aggregated value) as a new variant of data object supported in gt, gte, lt, lte, range, oneOf predicates.

Basically, we can describe the following logic (from case 1)

datum.Cylinders <= selection(sel1).Cylinders.agg(min)

as:

{
  "field": "Cylinders",
  "lte": {
    "selection": "sel1",  // we already have`{"filter": {"selection": "brush"}}`, so having explicit "selection" key make sense
    "field": "Cylinders",
    "op/aggregate": "min"  // see discussion about "op" below 
  }
}

By making aggregated selection just a type of predicate value, we can get support for case 3 and 4 for free without additional new syntax (as predicates already support logical operand structure).
We also use "op" for aggregate ops in Encoding Sort/Window Transform/Aggregate Transform. We should debate whether we should use "aggregate" or "op" here. (I'm leaning towards "op" as we don't use aggregate elsewhere besides as a macro for inline-encoding, but I'm happy to hear if you have a different opinion.)

Should we include case 2 (aggregating multiple selections)?

It's good that we enumerate these different cases (1-4) for the discussion here. However, I suspect if we should include case 2 since we don't even support combining aggregates from multiple data streams, so I think it's a bit consistent that we introduce an advanced support only for selection.

That said, if there is an example visualization that we should definitely support for this case, it's worth mentioning before we exclude this.

Side Notes:

~~We previously omitted comparison between multiple fields (e.g., datum.a < datum.b) in the predicate object as we wanted to avoid bloating the language and focus on critical use cases instead.~~ (actually commenting on this issue makes me realizes a solution -- see #5020).
There are many P0 and P1 issues relevant to selection and I think they are more important than case 2. So it might better to help address such issues before implementing the case 2.

Should we consider groupby?

Just a thought -- If there is aggregate, how do we deal with "groupby"?

For example, consider:

datum.Cylinders <= selection(sel1).Cylinders.agg(mean)

What if I want to consider this mean(Cylinders) based on the origin of the Cylinders of the particular car being test?

I think this "groupby" case, just like case 2), is probably too complicated to support. (And I can't think of a convincing example chart for it yet.) But it's worth mentioning as it comes up to me in my thought process.

I hope this is useful for iterating on your design. Let me know you need additional explanations.

kanitw on 2 Jun 2019

👍1

Another minor note about writing nested logics:

In general, it's good to avoid writing A && B || C. Instead, be explicit about your nesting: (A && B) || C or A && (B || C).

kanitw on 2 Jun 2019

😄1 👍1

@kanitw I do prefer the syntax you suggested above. That makes a lot of things simpler. Regarding case 2, I do not have any example use case right now. It's probably best if we focus on that sometime later.

I am working on a prototype. We can re-iterate over it and see if we need to add more functionalities.

djbarnwal on 3 Jun 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Styling points on mark type line

mcnuttandrew · 3Comments

Docs for typescript users

kanitw · 3Comments

Conflicting scale property "domain" with temporal scale

infai-feineis · 3Comments

How does lasso selection fits into our interaction grammar?

kanitw · 3Comments

Refactor: extract special Expression type

kanitw · 4Comments