jullia> f(x) = NamedTuple{(:x,)}((x,))
julia> code_typed(f, Tuple{Any})
1-element Array{Any,1}:
CodeInfo(
1 1 ─ %1 = (Core.tuple)(x)::Tuple{Any} │
│ %2 = (NamedTuple{(:x,),T} where T<:Tuple)(%1)::Any │
└── return %2 │
) => Any
julia> code_typed(f, Tuple{Union{Int64, Float64}})
1-element Array{Any,1}:
CodeInfo(
1 1 ─ %1 = (Core.tuple)(x)::Tuple{Union{Float64, Int64}} │
│ %2 = (NamedTuple{(:x,),T} where T<:Tuple)(%1)::Any │
└── return %2 │
) => Any
It would be nice if these could at least figure out that it's gonna be a NamedTuple{(:x,), <:Any}
, or even if the second one distributed over the union.
This is due to the constructors being @generated
. Currently inference doesn't look at fallback (non-generated) implementations for abstract argument types, but it could.
Yeah we're coming up against this one, too. It would be nice to infer the use Union{Missing, T}
inside NamedTuple
s. xref https://github.com/FugroRoames/TypedTables.jl/issues/34.
I didn't follow your explanation, either. I would have guessed we would arrive at the Expr(:new, ...)
line here (which should be inferrable), and the fallback branch wouldn't be invoked. (Or does generation get skipped if the signature contains tuples with Union
inside?)
OK, the reflection tools seem to be indicating that they consider the input types to be "abstract" or non-leaf types.
julia> code_lowered(NamedTuple{(:a, :b),Tuple{Int64,Union{Missing, Bool}}}, Tuple{Tuple{Int64, Union{Missing, Bool}}})
ERROR: Could not expand generator for `@generated` method (::Type{NamedTuple{names,T}})(args::T) where {names, T<:Tuple} in Core at boot.jl:539. This can happen if the provided argument types (Tuple{Tuple{Int64,Union{Missing, Bool}}}) are not leaf types, but the `generated` argument is `true`.
Stacktrace:
[1] error(::String, ::Method, ::String, ::String, ::Type, ::String, ::String) at ./error.jl:42
[2] (::getfield(Base, Symbol("##12#13")){Bool,DataType})(::Method) at ./reflection.jl:709
[3] iterate at ./generator.jl:47 [inlined]
[4] _collect(::Array{Union{Method, MethodInstance},1}, ::Base.Generator{Array{Union{Method, MethodInstance},1},getfield(Base, Symbol("##12#13")){Bool,DataType}}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:640
[5] collect_similar(::Array{Union{Method, MethodInstance},1}, ::Base.Generator{Array{Union{Method, MethodInstance},1},getfield(Base, Symbol("##12#13")){Bool,DataType}}) at ./array.jl:569
[6] map(::Function, ::Array{Union{Method, MethodInstance},1}) at ./abstractarray.jl:2014
[7] #code_lowered#11 at ./reflection.jl:704 [inlined]
[8] code_lowered(::Any, ::Any) at ./reflection.jl:704
[9] top-level scope at none:0
Would it be fair to say that this heuristic is a little outdated for simple Union
types (and tuples thereof), where code can be very fast and inference and code generation can be useful?
This is not a heuristic. We decided not to ever invoke generators on abstract types, because for that to be sound they have to be monotonic: you have to make sure that your generated code infers to a subtype of what we infer in any case where the arguments are a supertype. Of course that is possible, but in practice it is very difficult to ensure, and you get a segfault if you get it wrong.
Thanks for the explanation.
So what is the best way forward in this case? As far as I can tell, the main problem with NamedTuple
methods will be with the constructors; unlike tuples, named tuples with union elements will actually be concrete types (so all those generated functions you wrote for merge
and so-on work regardless). Interstingly... it seems to me that if we had named tuple being a regular struct
with a tuple inside and getproperty
overloading, the constructor wouldn't need to be generated at all (it would just call new
with the input tuple). Is there something the current NamedTuple
implementation can do that a struct
couldn't do, nowadays?
And the other option was to enable inference on the if !@generated
branch?
I wonder if the following MWE is this issue: define functions
f(::Type{Union{Missing, T}}) where T <: Real = rand() < 0.5 ? zero(T) : missing
_fs(::Type{Tuple{}}) = ()
_fs(::Type{T}) where T <: Tuple = (f(Base.tuple_type_head(T)), _fs(Base.tuple_type_tail(T))...)
fs(::Type{NamedTuple{N,T}}) where {N,T} = NamedTuple{N,T}(_fs(T))
and apply as (output trimmed)
julia> VERSION
v"1.1.0-DEV.671"
julia> T = NamedTuple{(:a, :b), Tuple{Union{Missing, Int}, Union{Missing, Float64}}}
NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}
julia> @code_warntype fs(T)
Body::Any
24 ─ %66 = (Core.tuple)(%32, %64)::Tuple{Union{Missing, Int64},Union{Missing, Float64}}
└─── goto #25 ││
25 ─ %68 = (NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}})(%66)::Any
└─── return %68 │
Also, is the current workaround something like
fs_fixed(::Type{NamedTuple{N,T}}) where {N,T} = NamedTuple{N,T}(_fs(T))::NamedTuple{N,T}
This is the best workaround I can come up with at short notice:
let
for n in 1:32
Ts = [Symbol("T$i") for i in 1:n]
xs = [:(x[$i]) for i in 1:n]
NT = :(Core.NamedTuple{names, Tuple{$(Ts...)}})
eval(quote
$NT(x::Tuple{$(Ts...)}) where {names, $(Ts...)} = $(Expr(:new, NT, xs...))
end)
end
end
It seems to work ok, in particular it fixes FugroRoames/TypedTables.jl#34. Of course if I need more than 32 elements AND Union
types it won't work, and it's not exactly elegant.
Would a workaround PR like this be welcome, or should we think of a more fundamental fix? (Another approach I explored was using the inner constructor as a thunk that makes a simple wrapper around the tuple to make it concrete, ending up calling a generated function that is not an inner constructor but calls $(Expr(:new, ...))
anyway).
EDIT: Hmm... my "workaround" seems to fix inference but generated code is horrible.
Let me just throw in another MWE that seems to be caused by this:
````julia
getfirstvals(a,b) = (a=a[1],b=b[1])
a=zeros(Union{Float64,Missing},10)
b=zeros(Union{Float64,Missing},10)
getfirstvals(a,b)
````
This does not infer as well and any fix or workaround would be welcome.
Fixed by @JeffBezanson in #30577.
So I've been doing some research around issues that affect common data processing tasks that include Union{T, Missing}
column types and tracked one issue I'm seeing down to the fact that we still have:
julia> code_typed(f, Tuple{Union{Int64, Float64}})
1-element Array{Any,1}:
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Union{Float64, Int64}}
│ %2 = (NamedTuple{(:x,),T} where T<:Tuple)(%1)::NamedTuple{(:x,),_A} where _A
└── return %2
) => NamedTuple{(:x,),_A} where _A
which, while better than when originally reported on this issue, is still suboptimal: NamedTuple{(:x,), Tuple{Union{Int64, Float64}}}
should ideally be inferrable here, it's a concrete type with its data stored inline.
What would we need to allow this? It seems like a compiler/optimizer issue where we need to teach it how to recognize that this kind of scenario is preferable to infer. I'm happy to dive in if someone can given some guidance, or reproducible scenarios where this is an issue, or whatever I can do to help.
At run time that will give either a NamedTuple of Int64 or of Float64, so inferring the type with the union would not be correct. This is a pretty fundamental problem.
The issue boils down to the fact that named tuples are invariant, unlike tuples:
julia> Tuple{Int64} <: Tuple{Number}
true
julia> NamedTuple{(:x,), Tuple{Int64}} <: NamedTuple{(:x,), Tuple{Number}}
false
But it seems like it should be possible to infer the tighter return type
NamedTuple{(:x,), Tuple{T}} where T <: Union{Int64,Float64}
which would be correct. Not arguing whether it's reasonable to expect this or not, but it would be a correct but tighter return type that could be inferred.
Yeah, I'm not sure if that would actually be helpful for later dataflow analysis though, since then any extraction from the namedtuple would be ::Union{Int64, Float64}
; for a single-element NamedTuple that's fine, and doing a union split would be nice, but the reality is with multiple fields that all have Unions, you quickly lose any benefit.
Thanks for the explanations everyone (including @vtjnash offline); I have a couple alternative approaches I'm going to try out as work-arounds for the issue here.
@quinnj don't know if it helps, but in my specific use case I have been using this https://github.com/meggart/SentinelMissings.jl to get efficient iterable tables with missings. In my case missings are encoded through special values in my input data anyway, so simply wrapping them in a new number type helped making the table type stable. I know this is against the spirit of missings but may be useful for some use cases.
Thanks for the link @meggart! That looks like a clever little package. That would actually immediately allow resolution of a recent CSV.jl issue. I think there are definitely some ideas like that to help in these scenarios.
Making NamedTuples covariant wouldn't magically fix this, since then NamedTuple{(:x,), Tuple{Union{Int64, Float64}}}
would be an abstract type. What everybody wants is for this to return an invariant NamedTuple of that type, which appears to be the one thing it's not clear how to get...
It's indeed unfortunate because NamedTuple{(:x,), Tuple{Union{Int64, Float64}}}
is such a useful, concrete type, with data stored inline, even though Tuple{Union{Int64, Float64}}
isn't concrete.
Ok, since I've been thinking about this for the past two days straight, let me share some thoughts if only to get them out of my head onto proverbial github paper, warning that ideas may range from impossible to insane:
(x::Union{Float64, Int64}=3.14,)
; or in a common case, we could do syntax like (x::Int?=4,)
as short-hand for "make the x
field of type Union{Int64, Missing}
". we might need to be clever by having some kind of struct TypedValue{T, S}; x::S; end
, so x::Int?=4
expanded to x=TypedValue{Union{Int, Missing}}(4)
, and then a NamedTuple constructor that unwrapped TypedValue
s when constructing; but maybe there are parser tricks that could avoid all thatstruct UnionValue{T, S}
x::Union{T, S}
end
the idea here is to "first-class" the kind of field we have in struct Foo; x::Union{Float64, Int64}; end
as its own, concrete, scalar type; the type would be UnionValue{Float64, Int64}
, to be distinct from the abstract Union
, but the "builtin"-ness would be somehow allowing codegen to do union-splitting on it's type like we have for Union{T, S}
, so a runtime check of the value's actual type, then generated code branches for each type; it could have an inline layout for isbitsunion
, and would be a box otherwise. Basically, a concrete Union
value. One problem with this approach is might be too weird/impossible to allow having a first-class #undef
for non-isbitsunion values unless we just don't allow that or something. The other wrinkle with making this idea actually useful would be having some way to, like the first suggestion above, to signal that a type should expect UnionType{T, S}
. Also, I'm not sure if this could be made to scale beyond 2 parameters.
%38 = Core.tuple(%23, %37)::Tuple{Int64,Union{Missing, Float64}}
%39 = (NamedTuple{(:x, :y),T} where T<:Tuple)(%38)::NamedTuple{(:x, :y),_A} where _A
it could do:
%38 = Core.tuple(%23, %37)::Tuple{Int64,Union{Missing, Float64}}
%39 = (NamedTuple{(:x, :y),Tuple{Int64, T} where T<:Union{Missing, Float64}}})(%38)::NamedTuple{(:x, :y),Tuple{Int64, T} where T<:Union{Missing, Float64}}}}
function f(x::NamedTuple)
return (x=x.id + 1, y::?=x.salary * 2)
end
# or for Pair
function f(x::NamedTuple)
return Pair{Int, ?}(x.id, x.salary * 2)
end
this gives a kind of SQL-parameter-binding feel w/ the use of ?
, and allows the user to signal to the compiler that whatever inference came up w/ would actually be really useful, like "hey, I know Union types generally aren't useful or allowed to do this, but in this place specifically, it really would be". But, maybe getting too cozy w/ inference here is too big of a no-no; and maybe this somehow breaks the invariance rule too fundamentally.
Realistically, I think some way of allowing users to specify individual field types is probably the biggest bang for development buck; and maybe just coming up w/ an alternative API altogether that avoids the direct NamedTuple-to-NamedTuple projection and breaks the fields out individually before putting them back in the final NamedTuple.
The TypedValue
idea makes me think. Basically, we need some way to propagate the type of the field a value came from along with the value itself. A simple example would be
function map(f, nt::NT{T}) where T
x = nt.field
ismissing(x) && return nt
val = f(x)
NT{Union{typesubtract(T,typeof(x)),typeof(val)}}(val)
end
where NT
is an ultra-simplified NamedTuple type with 1 field. In the case where a named tuple expression references a field of another object a.b
, we could arrange to propagate fieldtype(typeof(a),:b)
like this. Of course, if that's just done syntactically then it breaks if you assign a.b
to a variable first, but this feels like it's going somewhere.
Yeah, this is certainly interesting. Do you imagine this would just be extra "metadata" propagated by the compiler during codegen? Or some actual type like:
struct TypedValue{T}
x::T
TypedValue(fieldtype, x) = new{fieldtype}(x)
end
that was automatically wrapped by the compiler and used internally, that was somehow treated as a "unwrapped" value when used?
The main issue with the UnionValue idea, which is similar to what many static languages call sum types but which aren’t actually sum types because they require explicit wrapping and unwrapping, is precisely that—we would also require explicit wrapping and unwrapping, which makes them inconvenient. In that respect, Julia’s union types are close to true sum types, except, of course that they don’t enforce disjointness (although that’s rarely actually an issue, except for a theoretical standpoint). If setting/getting a field/slot of such a type always did automatic wrapping/unwrapping but the type was concrete—I’m not even sure that’s coherent—that feels like some of what’s wanted. To some extent this seems like it’s about hoisting or lowering the selector bits outwards or inwards as necessary. The trouble now seems to be that sometimes we have the bits on an individual field and want them on the containing structure, selecting between different concrete layouts, and vice versa. And of course, in those situations you want a guarantee that the layouts are compatible or it’s not all that helpful. Also just thinking out loud on this.
Yeah, my hope w/ the UnionValue
idea was that, during code generation, it would always participate in union-splitting + unwrapping in the respective union-split branch. I think for wrapping, we would also want to wrap the final result of each union-split branch in something like Jeff mentioned, like:
# something like
f(nt::NamedTuple{(:x,), Tuple{UnionValue{Int64, Missing}}}) = (x=nt.x + 1,)
# would expand in codegen to something like
function f(nt::NamedTuple{(:x,), Tuple{UnionValue{Int64, Missing}}})
x = nt.x
if x.value isa Missing
y = UnionValue{Int64, Missing}(x.value + 1)
else # x.value isa Int64
y = UnionValue{Int64, Missing}(x.value + 1)
end
return (x=y,)
end
The part that then doesn't make a lot of sense in some ways is allowing x::UnionValue{Int64, Missing}
to be a 1st-class value. It gets confusing because if I called sin(x)
, I'd want it to automatically compile:
function sin(x)
if x.value isa Int64
UnionValue{Int64, Missing}(sin(x.value))
else
UnionValue{Int64, Missing}(sin(x.value))
end
end
but we don't really do/allow complete function call overloading like this. It makes me think that UnionValue
really only makes sense as a fieldtype and use for codegen/union-splitting, which is basically what we have Union{Int, Missing}
for already (i.e. it can be a fieldtype, but not a 1st-class type of a value, and we get union-splitting in codegen).
So maybe as an alternative to having UnionValue
, we really want something like:
f(nt) = (x=?(nt.x) + 1,)
where ?(nt.x)
is this little signal to the compiler that nt.x
should be a "propagating" Union field, i.e. after doing union-split, the result would be wrapped in the Union{typesubtract(T,typeof(x)),typeof(val)}
that Jeff mentioned above.
I'm still not sure that completely solves things, because we still want the NamedTuple constructor to be happy a potential incoming Union field, so maybe we'd need f(nt) = (x::?=nt.x + 1,)
where the ::?
on the field would expand to NamedTuple{(:x,), Tuple{Union{typesubtract(T, typeof(x)), typeof(val)}}}
the type would be
UnionValue{Float64, Int64}
, to be distinct from the abstract Union, but the "builtin"-ness would be somehow allowing codegen to do union-splitting on it's type like we have forUnion{T, S}
This is an interesting line of thought related to https://github.com/JuliaLang/julia/pull/32448#issuecomment-507010029: it would be neat from an ABI standpoint if we hoisted all selector bits outward. In this way arrays of nested structs-and-unions-of-bits could become ABI compatible with C structs and unions when the parallel array of selector data is ignored.
If we followed this rule, your UnionValue
wouldn't need to be builtin because the rules would naturally say that the selector bits for the contained union would be hoisted out and stored separately.
To some extent this seems like it’s about hoisting or lowering the selector bits outwards or inwards as necessary.
Yes I think so. Do you have an example of where always hoisting the selectors out would be a bad thing and we'd prefer the selector bits to be stored with the data?
Most helpful comment
It's indeed unfortunate because
NamedTuple{(:x,), Tuple{Union{Int64, Float64}}}
is such a useful, concrete type, with data stored inline, even thoughTuple{Union{Int64, Float64}}
isn't concrete.Ok, since I've been thinking about this for the past two days straight, let me share some thoughts if only to get them out of my head onto proverbial github paper, warning that ideas may range from impossible to insane:
(x::Union{Float64, Int64}=3.14,)
; or in a common case, we could do syntax like(x::Int?=4,)
as short-hand for "make thex
field of typeUnion{Int64, Missing}
". we might need to be clever by having some kind ofstruct TypedValue{T, S}; x::S; end
, sox::Int?=4
expanded tox=TypedValue{Union{Int, Missing}}(4)
, and then a NamedTuple constructor that unwrappedTypedValue
s when constructing; but maybe there are parser tricks that could avoid all thatthe idea here is to "first-class" the kind of field we have in
struct Foo; x::Union{Float64, Int64}; end
as its own, concrete, scalar type; the type would beUnionValue{Float64, Int64}
, to be distinct from the abstractUnion
, but the "builtin"-ness would be somehow allowing codegen to do union-splitting on it's type like we have forUnion{T, S}
, so a runtime check of the value's actual type, then generated code branches for each type; it could have an inline layout forisbitsunion
, and would be a box otherwise. Basically, a concreteUnion
value. One problem with this approach is might be too weird/impossible to allow having a first-class#undef
for non-isbitsunion values unless we just don't allow that or something. The other wrinkle with making this idea actually useful would be having some way to, like the first suggestion above, to signal that a type should expectUnionType{T, S}
. Also, I'm not sure if this could be made to scale beyond 2 parameters.it could do:
this gives a kind of SQL-parameter-binding feel w/ the use of
?
, and allows the user to signal to the compiler that whatever inference came up w/ would actually be really useful, like "hey, I know Union types generally aren't useful or allowed to do this, but in this place specifically, it really would be". But, maybe getting too cozy w/ inference here is too big of a no-no; and maybe this somehow breaks the invariance rule too fundamentally.Realistically, I think some way of allowing users to specify individual field types is probably the biggest bang for development buck; and maybe just coming up w/ an alternative API altogether that avoids the direct NamedTuple-to-NamedTuple projection and breaks the fields out individually before putting them back in the final NamedTuple.