I thought there was already an issue open for this but I couldn't find it. I don't know if people saw my JuliaCon talk but I implemented covariant namedtuples for LightQuery. If that would be useful, maybe we could get it in Base, and if its not useful, I'd like to understand why, and I can switch LightQuery over to using DataValues and invariant NamedTuples.
See also #24614 :joy:
I think the main reason invariant named tuples are useful is that a named tuple with Union{Int,Missing}
as a field type can be a concrete type with a known layout.
The main issue I see with covariance is that a tuple of Int or Missing must have either the type Tuple{Int}
or Tuple{Missing}
. If you have N elements there are 2^N possible types. Even if there is an enclosing container with element type Tuple{Union{Int,Missing},...}
the types of the elements are still all taken from the set of 2^N. With invariance, every element can have the same type InvariantTuple{Union{Int,Missing}, ...}
. As far as I recall that was our reasoning.
In addition to Jeff's great points, I've also been diving a lot into the remaining inconveniences of working w/ Union{T, Missing}
and I think we have a pretty good handle on things we can do to make them better:
With https://github.com/JuliaLang/julia/pull/32448, NamedTuples (and all structs) w/ isbits Union fields will be stored inline in Arrays.
With https://github.com/JuliaLang/julia/issues/32699, we actually get into a scenario where you can do a lazy map (a la Query.EnumerableMap
, or like this simple Tables.jl version that doesn't rely on inference at all), and the compiler is able to inline/infer everything all the way down, which == great performance and no extra allocations. Now, that does depend on who's collecting the lazy map, and this ideal scenario works for most cases: columnar outputs like DataFrame/IndexedTable/TypedTable, file-based outputs like CSV/Feather, and databases, like ODBC/MySQL/SQLite. The one case that is inconvenient is collecting into an Array of NamedTuples, since the resulting type would be Vector{NamedTuple{names, T} where T <: Tuple{Int, Union{Float64, Missing}}
, which wouldn't have an inlined/concrete layout. Now, you can still get the nice case by manually specifying the output type of your NamedTuple mapping function (like return NamedTuple{names, Tuple{Int, Union{Float64, Missing}}}((1, missing))
, and I think it'd be worth exploring some syntax ideas to make that simpler/nicer.
And honestly, by having a lazy map that can be inferred when collecting, we do away with the main reason I'm aware of for the DataValue
type (since, on a technical basis, the same inference improvement would apply to inferring join key mapping functions, etc.). (the other differences I'm aware of with Missing
vs. DataValue
are orthogonal to inference issues, like 3-valued logic, etc.). That to me seems like a huge win!
At this point, I think there's strong reasoning and momentum to not further complicate the data ecosystem by hanging on to DataValues. There are so many great developments going on between machine learning frameworks, core statistics, applied math and optimization techniques, and they've all adopted and moved forward with the Missing
approach. It would just be unfortunate if a subset of packages kept themselves isolated, needing to always wrap/unwrap themselves to go in/out w/ the rest of the ecosystem instead of "just working" out of the box.
I think sometimes we also get really caught up on temporary inconveniences and miss the forest for the trees; Jeff and I sat for maybe 20 minutes at the JuliaCon hackathon and were able to hammer out a bunch of concrete issues to improve various scenarios. I think it'd be much more productive for everyone to highlight the inconveniences and then come together to figure out solutions; having dug so deep into all of this over the last month or two, I really think it was the right choice to go w/ the Missing
representation, and I don't think it fundamentally limits things in any way. Do we need to keep evolving APIs, improving inference, and make things easier/simpler? Definitely. But I, for one, think it would be a huge win to move everything off of DataValue, avoid the need to maintain all these extra data structures, and collectively collaborate on a more cohesive data ecosystem, even if that means dealing with some short-term inconveniences.
If you have N elements there are 2^N possible types
I came to the opposite conclusion by the same reasoning, e.g.
using LightQuery: @name
using InteractiveUtils: @code_warntype
unstable() = rand(Bool) ? 1 : 1.0
test1() = (a = unstable(), b = unstable(), c = unstable())
test2() = @name (a = unstable(), b = unstable(), c = unstable())
@code_warntype test1()
@code_warntype test2()
Although maybe #32699 would solve that?
Well anyways once #32699 comes out I'll try switching LightQuery over to regular NamedTuples and see what happens to performance
If you have N elements there are 2^N possible types
I came to the opposite conclusion by the same reasoning, e.g.
The "2^N types" is not an inference issue; it's more fundamental. We have no way to efficiently store such a structure. We'd like to keep a bit for each element saying whether it's Int or Missing, but we can't because neither of the types Tuple{Int}
or Tuple{Missing}
provides for such a bit, and Tuple{Union{Int,Missing}}
is abstract so it doesn't have a layout at all.
We could fix that by having the outer structure (an array of tuples) store all the tuples inline, with N bits next to each specifying which tuple type it has. But we don't have that implemented, and it doesn't really seem worth it --- we would still only infer an abstract type when you index the array, and code_warntype would still be red.
We have no way to efficiently store such a structure
That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators
We have no way to efficiently store such a structure
That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators
I really think #32699 is the answer here; it should give inference just enough information to realize it can avoid materializing the temporary NamedTuple all together and efficiently store the computed values into output columns directly.
We have no way to efficiently store such a structure
This might not have been the best choice of word. As Jeff said later, we know how to efficiently store this structure without too much difficulty, the problem is that we can't compute the value of the NamedTuple.
That's not a problem for me; I store data column-wise in LightQuery. What is important is to get efficient iterators
Right; the problem can be avoided by never materializing the tuples. For example you can return a lazy Row{Tuple{Union{Int,Missing}}}
type that just contains the index of a row --- but that type is invariant! So I still don't see what work the covariant named tuples are doing here; maybe spell it out more?
The covariant named tuples came out of me experimenting with a way to write named tuples before NamedTuples came about, and realizing that my version of named tuples didn't suffer from #32699 . I thought it was because of covariance but I guess its just a missing optimization?
Reopen if no I guess
Missing case apart,
The structdiff
@ https://github.com/JuliaLang/julia/blob/v1.3.0-rc4/base/namedtuple.jl#L303 catch an instance or a type by following the generated function way. I wonder if there is a workaround playing with this to bring nametuple covariance.
My own case
@test NamedTuple{(:a,),Tuple{Int64}} <: NamedTuple{NS,T} where {NS,T<:Tuple}
@test_broken Type{NamedTuple{(:a,),Tuple{Int64}}} <: Type{NamedTuple{NS,T} where {NS,T<:Tuple}}
Edited: may be rather due to a Type
covariance issue than a namedtuple one
Most helpful comment
I really think #32699 is the answer here; it should give inference just enough information to realize it can avoid materializing the temporary NamedTuple all together and efficiently store the computed values into output columns directly.