This is in reference to Issue #4740 . We want to design a schema for supporting comparison between datum values and the selections.
We want the design to be such that it can accommodate the following cases in a generic manner.
We can select a field from a selection and aggregate it into a single value for interval or multi selections
_(Simplified representation for easier readability)_
datum.Cylinders <= selection(sel1).Cylinders.agg(min)
We can aggregate multiple selections which may or may not use the same data source
datum.Cylinders <= total(selection(sel1).Cylinders.agg(min), selection(sel2).cyl.agg(avg))
We can have a boolean logical representation between different selections for a single datum field
datum.Cylinders <= selection(sel1).Cylinders.agg(max) && datum.Cylinders > selection(sel2).cyl.agg(total) || datum.Cylinders != selection(sel3).Cylinders
We can have logical representations between different datum fields
datum.a <= selection(sel1).a.agg(max) && datum.b <= selection(sel2).b.agg(max)
Me and @arvind have come up with the following schemas. We wanted to know if there are any alternate schemas which would address the above requirements in a manner which is consistent and fairly simple for a user who has little or no programming experience.
"test": {
"Cylinders": {
"lte": "sel1",
"field": "Cylinders",
"aggregate": "min"
}
}
_we are using Cylinders instead of datum.Cylinders as datum will never change_
This implies datum.Cylinders <= selection(sel1).Cylinders.agg(min)
It can be further extended by using AND OR and NOT operators.
"test": {
"Cylinders": {
"and" : [{"lte": "sel1", "field": "Cylinders", "aggregate": "min"},
{"gt": "sel2", "field": "cyl", "aggregate": "total"}]
}
}
The above declaration would result in -
datum.Cylinders <= selection(sel1).Cylinders.agg(max) && datum.Cylinders > selection(sel2).cyl.agg(total)
For multiple datum fields, we can wrap them in another AND OR NOT array.
"test": {
"and": [
{"Cylinders": {"lte":...},
{"or": [{"Horsepower":...}, {"Origin":...}]}
]
}
}
datum.Cylinders <= ... && (datum.Horsepower > ... || datum.Origin > ...)
Finally, for addressing the 2nd requirement, we have the following schema
"test": {
"Cylinders": {
"aggregate": "avg",
"selections": [{"name": "sel1", "field": "Cylinders", "aggregate": "min"},
{"name": "sel2", "feild": "cyl", "aggregate": "total"}],
"gt": true
}
}
This would result in
datum.Cylinders > avg(selection(sel1).Cylinders.agg(min), selection(sel2).cyl.agg(total))
There are a couple of things which we want to point out here.
name with lte which made no sense when an aggregate was used. "test": {
"mean": [
{"Cylinders": {"lte":...},
{"Horsepower": {"lte":...}
]
}
_What does lte mean here?_
lte, gt etc. as keys. What does it mean to have both the keys together? (Though using TypeScript we can enforce that only one of the keys is present, as done in filter transform). This isn't a big issue but an alternative schema which would overcome this problem would be favourable.I think it makes sense to base the schema on our filter schema and our logical operands. Your design makes sense to me. Do you have specific questions that we want to answer here?
The one part that I am not convinced by is this
"test": {
"Cylinders": {
"aggregate": "avg",
"selections": [{"name": "sel1", "field": "Cylinders", "aggregate": "min"},
{"name": "sel2", "feild": "cyl", "aggregate": "total"}],
"gt": true
}
}
"gt": true doesn't really explain what should be greater than what here.
Why not this?
"test": {
"field": "Cylinders",
"gt": {
"aggregate": "avg",
"selections": [{"name": "sel1", "field": "Cylinders", "aggregate": "min"},
{"name": "sel2", "feild": "cyl", "aggregate": "total"}],
}
}
}
After some thought, I think this would be a better way to do it. It has the consistency of the above schemas where we use the datum field as the key and provide a ComparisonOp as its value. Moreover, this helps us in implementing it too as we can use the ComparisonOp as an anchor to identify all these predicates.
"test": {
"Cylinders": {
"gt": {
"aggregate": "avg",
"selections": [{"name": "sel1", "field": "Cylinders", "aggregate": "min"},
{"name": "sel2", "feild": "cyl", "aggregate": "total"}],
}
}
}
}
One issue with the last example is that it puts variables in the keys. We usually prefer not to do that. Selections are the only example where we broke this rule and it is causing some confusion at times. I think it would be good if we can avoid it but it's not a strict requirement.
One point that I raised on slack is that using variable names in expressions is dangerous because of nesting. If you have a field called “and” or “gt”, then it can be very tricky to understand a complex expression.
@djbarnwal -- Thanks for initiating the discussion. :)
Here are some thoughts:
We should consider treating selection data as a new type of predicate value, which can be used with either "test" for conditional encoding or with "filter" in filter transform.
In the existing field predicates, we use "field": string, so we could re-use the same keyword in this use case.
(Even without the consistency argument, I think we should not use field name as a key. Selection name is already causing some confusion as Dom mentioned and as discussed on Slack. However, one could at least argue that the selection name must be unique, so having it as a key still have some rationales. On the other hand, you may want to test the same field with multiple different predicates, so it doesn't warrant the need to make it a key.)
We can then treat selection data (and its aggregated value) as a new variant of data object supported in gt, gte, lt, lte, range, oneOf predicates.
Basically, we can describe the following logic (from case 1)
datum.Cylinders <= selection(sel1).Cylinders.agg(min)
as:
{
"field": "Cylinders",
"lte": {
"selection": "sel1", // we already have`{"filter": {"selection": "brush"}}`, so having explicit "selection" key make sense
"field": "Cylinders",
"op/aggregate": "min" // see discussion about "op" below
}
}
By making aggregated selection just a type of predicate value, we can get support for case 3 and 4 for free without additional new syntax (as predicates already support logical operand structure).
We also use "op" for aggregate ops in Encoding Sort/Window Transform/Aggregate Transform. We should debate whether we should use "aggregate" or "op" here. (I'm leaning towards "op" as we don't use aggregate elsewhere besides as a macro for inline-encoding, but I'm happy to hear if you have a different opinion.)
It's good that we enumerate these different cases (1-4) for the discussion here. However, I suspect if we should include case 2 since we don't even support combining aggregates from multiple data streams, so I think it's a bit consistent that we introduce an advanced support only for selection.
That said, if there is an example visualization that we should definitely support for this case, it's worth mentioning before we exclude this.
Side Notes:
datum.a < datum.b) in the predicate object as we wanted to avoid bloating the language and focus on critical use cases instead.Just a thought -- If there is aggregate, how do we deal with "groupby"?
For example, consider:
datum.Cylinders <= selection(sel1).Cylinders.agg(mean)
What if I want to consider this mean(Cylinders) based on the origin of the Cylinders of the particular car being test?
I think this "groupby" case, just like case 2), is probably too complicated to support. (And I can't think of a convincing example chart for it yet.) But it's worth mentioning as it comes up to me in my thought process.
I hope this is useful for iterating on your design. Let me know you need additional explanations.
Another minor note about writing nested logics:
In general, it's good to avoid writing A && B || C. Instead, be explicit about your nesting: (A && B) || C or A && (B || C).
:)
@kanitw I do prefer the syntax you suggested above. That makes a lot of things simpler. Regarding case 2, I do not have any example use case right now. It's probably best if we focus on that sometime later.
I am working on a prototype. We can re-iterate over it and see if we need to add more functionalities.
Most helpful comment
Another minor note about writing nested logics:
In general, it's good to avoid writing
A && B || C. Instead, be explicit about your nesting:(A && B) || CorA && (B || C).:)