Julia: Suboptimal inference for incompletely specified named tuples

Created on 9 Nov 2018 · 22Comments · Source: JuliaLang/julia

jullia> f(x) = NamedTuple{(:x,)}((x,))
julia> code_typed(f, Tuple{Any})
1-element Array{Any,1}:
 CodeInfo(
1 1 ─ %1 = (Core.tuple)(x)::Tuple{Any}                                      │
  │   %2 = (NamedTuple{(:x,),T} where T<:Tuple)(%1)::Any                    │
  └──      return %2                                                        │
) => Any

julia> code_typed(f, Tuple{Union{Int64, Float64}})
1-element Array{Any,1}:
 CodeInfo(
1 1 ─ %1 = (Core.tuple)(x)::Tuple{Union{Float64, Int64}}                    │
  │   %2 = (NamedTuple{(:x,),T} where T<:Tuple)(%1)::Any                    │
  └──      return %2                                                        │
) => Any

It would be nice if these could at least figure out that it's gonna be a NamedTuple{(:x,), <:Any}, or even if the second one distributed over the union.

inference

Source

Keno

👍1

Most helpful comment

It's indeed unfortunate because NamedTuple{(:x,), Tuple{Union{Int64, Float64}}} is such a useful, concrete type, with data stored inline, even though Tuple{Union{Int64, Float64}} isn't concrete.

Ok, since I've been thinking about this for the past two days straight, let me share some thoughts if only to get them out of my head onto proverbial github paper, warning that ideas may range from impossible to insane:

allow someway for a user to "forward declare" the type for a specific field, for a NamedTuple, something like (x::Union{Float64, Int64}=3.14,); or in a common case, we could do syntax like (x::Int?=4,) as short-hand for "make the x field of type Union{Int64, Missing}". we might need to be clever by having some kind of struct TypedValue{T, S}; x::S; end, so x::Int?=4 expanded to x=TypedValue{Union{Int, Missing}}(4), and then a NamedTuple constructor that unwrapped TypedValues when constructing; but maybe there are parser tricks that could avoid all that
add an entirely new builtin type, that would essentially be:

struct UnionValue{T, S}
    x::Union{T, S}
end

the idea here is to "first-class" the kind of field we have in struct Foo; x::Union{Float64, Int64}; end as its own, concrete, scalar type; the type would be UnionValue{Float64, Int64}, to be distinct from the abstract Union, but the "builtin"-ness would be somehow allowing codegen to do union-splitting on it's type like we have for Union{T, S}, so a runtime check of the value's actual type, then generated code branches for each type; it could have an inline layout for isbitsunion, and would be a box otherwise. Basically, a concrete Union value. One problem with this approach is might be too weird/impossible to allow having a first-class #undef for non-isbitsunion values unless we just don't allow that or something. The other wrinkle with making this idea actually useful would be having some way to, like the first suggestion above, to signal that a type should expect UnionType{T, S}. Also, I'm not sure if this could be made to scale beyond 2 parameters.

Maybe inference could be even more aggressive and instead of having:

%38 = Core.tuple(%23, %37)::Tuple{Int64,Union{Missing, Float64}}
%39 = (NamedTuple{(:x, :y),T} where T<:Tuple)(%38)::NamedTuple{(:x, :y),_A} where _A

it could do:

%38 = Core.tuple(%23, %37)::Tuple{Int64,Union{Missing, Float64}}
%39 = (NamedTuple{(:x, :y),Tuple{Int64, T} where T<:Union{Missing, Float64}}})(%38)::NamedTuple{(:x, :y),Tuple{Int64, T} where T<:Union{Missing, Float64}}}}

(and you thought I was done w/ crazy ideas??), adding a way to basically tell inference/compiler, "I'd like to take whatever you managed to infer and use it as a type parameter here. thank you", something in combination w/ the first example, like:

function f(x::NamedTuple)
    return (x=x.id + 1, y::?=x.salary * 2)
end
# or for Pair
function f(x::NamedTuple)
    return Pair{Int, ?}(x.id, x.salary * 2)
end

this gives a kind of SQL-parameter-binding feel w/ the use of ?, and allows the user to signal to the compiler that whatever inference came up w/ would actually be really useful, like "hey, I know Union types generally aren't useful or allowed to do this, but in this place specifically, it really would be". But, maybe getting too cozy w/ inference here is too big of a no-no; and maybe this somehow breaks the invariance rule too fundamentally.

Realistically, I think some way of allowing users to specify individual field types is probably the biggest bang for development buck; and maybe just coming up w/ an alternative API altogether that avoids the direct NamedTuple-to-NamedTuple projection and breaks the fields out individually before putting them back in the final NamedTuple.

quinnj on 28 Jun 2019

❤3

All 22 comments

This is due to the constructors being @generated. Currently inference doesn't look at fallback (non-generated) implementations for abstract argument types, but it could.

JeffBezanson on 10 Nov 2018

Yeah we're coming up against this one, too. It would be nice to infer the use Union{Missing, T} inside NamedTuples. xref https://github.com/FugroRoames/TypedTables.jl/issues/34.

I didn't follow your explanation, either. I would have guessed we would arrive at the Expr(:new, ...) line here (which should be inferrable), and the fallback branch wouldn't be invoked. (Or does generation get skipped if the signature contains tuples with Union inside?)

andyferris on 17 Nov 2018

OK, the reflection tools seem to be indicating that they consider the input types to be "abstract" or non-leaf types.

julia> code_lowered(NamedTuple{(:a, :b),Tuple{Int64,Union{Missing, Bool}}}, Tuple{Tuple{Int64, Union{Missing, Bool}}})
ERROR: Could not expand generator for `@generated` method (::Type{NamedTuple{names,T}})(args::T) where {names, T<:Tuple} in Core at boot.jl:539. This can happen if the provided argument types (Tuple{Tuple{Int64,Union{Missing, Bool}}}) are not leaf types, but the `generated` argument is `true`.
Stacktrace:
 [1] error(::String, ::Method, ::String, ::String, ::Type, ::String, ::String) at ./error.jl:42
 [2] (::getfield(Base, Symbol("##12#13")){Bool,DataType})(::Method) at ./reflection.jl:709
 [3] iterate at ./generator.jl:47 [inlined]
 [4] _collect(::Array{Union{Method, MethodInstance},1}, ::Base.Generator{Array{Union{Method, MethodInstance},1},getfield(Base, Symbol("##12#13")){Bool,DataType}}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:640
 [5] collect_similar(::Array{Union{Method, MethodInstance},1}, ::Base.Generator{Array{Union{Method, MethodInstance},1},getfield(Base, Symbol("##12#13")){Bool,DataType}}) at ./array.jl:569
 [6] map(::Function, ::Array{Union{Method, MethodInstance},1}) at ./abstractarray.jl:2014
 [7] #code_lowered#11 at ./reflection.jl:704 [inlined]
 [8] code_lowered(::Any, ::Any) at ./reflection.jl:704
 [9] top-level scope at none:0

Would it be fair to say that this heuristic is a little outdated for simple Union types (and tuples thereof), where code can be very fast and inference and code generation can be useful?

andyferris on 17 Nov 2018

This is not a heuristic. We decided not to ever invoke generators on abstract types, because for that to be sound they have to be monotonic: you have to make sure that your generated code infers to a subtype of what we infer in any case where the arguments are a supertype. Of course that is possible, but in practice it is very difficult to ensure, and you get a segfault if you get it wrong.

JeffBezanson on 17 Nov 2018

❤1

Thanks for the explanation.

So what is the best way forward in this case? As far as I can tell, the main problem with NamedTuple methods will be with the constructors; unlike tuples, named tuples with union elements will actually be concrete types (so all those generated functions you wrote for merge and so-on work regardless). Interstingly... it seems to me that if we had named tuple being a regular struct with a tuple inside and getproperty overloading, the constructor wouldn't need to be generated at all (it would just call new with the input tuple). Is there something the current NamedTuple implementation can do that a struct couldn't do, nowadays?

And the other option was to enable inference on the if !@generated branch?

andyferris on 17 Nov 2018

I wonder if the following MWE is this issue: define functions

f(::Type{Union{Missing, T}}) where T <: Real = rand() < 0.5 ? zero(T) : missing
_fs(::Type{Tuple{}}) = ()
_fs(::Type{T}) where T <: Tuple = (f(Base.tuple_type_head(T)), _fs(Base.tuple_type_tail(T))...)
fs(::Type{NamedTuple{N,T}}) where {N,T} = NamedTuple{N,T}(_fs(T))

and apply as (output trimmed)

julia> VERSION
v"1.1.0-DEV.671"

julia> T = NamedTuple{(:a, :b), Tuple{Union{Missing, Int}, Union{Missing, Float64}}}
NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}

julia> @code_warntype fs(T)
Body::Any
   24 ─ %66 = (Core.tuple)(%32, %64)::Tuple{Union{Missing, Int64},Union{Missing, Float64}}
   └───       goto #25                                   ││            
   25 ─ %68 = (NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}})(%66)::Any
   └───       return %68                                 │

Also, is the current workaround something like

fs_fixed(::Type{NamedTuple{N,T}}) where {N,T} = NamedTuple{N,T}(_fs(T))::NamedTuple{N,T}

tpapp on 17 Nov 2018

This is the best workaround I can come up with at short notice:

let
    for n in 1:32
        Ts = [Symbol("T$i") for i in 1:n]
        xs = [:(x[$i]) for i in 1:n]
        NT = :(Core.NamedTuple{names, Tuple{$(Ts...)}})
        eval(quote
            $NT(x::Tuple{$(Ts...)}) where {names, $(Ts...)} = $(Expr(:new, NT, xs...))
        end)
    end
end

It seems to work ok, in particular it fixes FugroRoames/TypedTables.jl#34. Of course if I need more than 32 elements AND Union types it won't work, and it's not exactly elegant.

Would a workaround PR like this be welcome, or should we think of a more fundamental fix? (Another approach I explored was using the inner constructor as a thunk that makes a simple wrapper around the tuple to make it concrete, ending up calling a generated function that is not an inner constructor but calls $(Expr(:new, ...)) anyway).

EDIT: Hmm... my "workaround" seems to fix inference but generated code is horrible.

andyferris on 19 Nov 2018

Let me just throw in another MWE that seems to be caused by this:

````julia
getfirstvals(a,b) = (a=a[1],b=b[1])
a=zeros(Union{Float64,Missing},10)
b=zeros(Union{Float64,Missing},10)

getfirstvals(a,b)
````
This does not infer as well and any fix or workaround would be welcome.

meggart on 19 Nov 2018

Fixed by @JeffBezanson in #30577.

Keno on 12 Feb 2019

So I've been doing some research around issues that affect common data processing tasks that include Union{T, Missing} column types and tracked one issue I'm seeing down to the fact that we still have:

julia> code_typed(f, Tuple{Union{Int64, Float64}})
1-element Array{Any,1}:
 CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Union{Float64, Int64}}
│   %2 = (NamedTuple{(:x,),T} where T<:Tuple)(%1)::NamedTuple{(:x,),_A} where _A
└──      return %2
) => NamedTuple{(:x,),_A} where _A

which, while better than when originally reported on this issue, is still suboptimal: NamedTuple{(:x,), Tuple{Union{Int64, Float64}}} should ideally be inferrable here, it's a concrete type with its data stored inline.

What would we need to allow this? It seems like a compiler/optimizer issue where we need to teach it how to recognize that this kind of scenario is preferable to infer. I'm happy to dive in if someone can given some guidance, or reproducible scenarios where this is an issue, or whatever I can do to help.

quinnj on 27 Jun 2019

At run time that will give either a NamedTuple of Int64 or of Float64, so inferring the type with the union would not be correct. This is a pretty fundamental problem.

JeffBezanson on 27 Jun 2019

The issue boils down to the fact that named tuples are invariant, unlike tuples:

julia> Tuple{Int64} <: Tuple{Number}
true

julia> NamedTuple{(:x,), Tuple{Int64}} <: NamedTuple{(:x,), Tuple{Number}}
false

But it seems like it should be possible to infer the tighter return type

NamedTuple{(:x,), Tuple{T}} where T <: Union{Int64,Float64}

which would be correct. Not arguing whether it's reasonable to expect this or not, but it would be a correct but tighter return type that could be inferred.

StefanKarpinski on 27 Jun 2019

Yeah, I'm not sure if that would actually be helpful for later dataflow analysis though, since then any extraction from the namedtuple would be ::Union{Int64, Float64}; for a single-element NamedTuple that's fine, and doing a union split would be nice, but the reality is with multiple fields that all have Unions, you quickly lose any benefit.

Thanks for the explanations everyone (including @vtjnash offline); I have a couple alternative approaches I'm going to try out as work-arounds for the issue here.

quinnj on 27 Jun 2019

@quinnj don't know if it helps, but in my specific use case I have been using this https://github.com/meggart/SentinelMissings.jl to get efficient iterable tables with missings. In my case missings are encoded through special values in my input data anyway, so simply wrapping them in a new number type helped making the table type stable. I know this is against the spirit of missings but may be useful for some use cases.

meggart on 27 Jun 2019

Thanks for the link @meggart! That looks like a clever little package. That would actually immediately allow resolution of a recent CSV.jl issue. I think there are definitely some ideas like that to help in these scenarios.

quinnj on 27 Jun 2019

Making NamedTuples covariant wouldn't magically fix this, since then NamedTuple{(:x,), Tuple{Union{Int64, Float64}}} would be an abstract type. What everybody wants is for this to return an invariant NamedTuple of that type, which appears to be the one thing it's not clear how to get...

JeffBezanson on 27 Jun 2019

It's indeed unfortunate because NamedTuple{(:x,), Tuple{Union{Int64, Float64}}} is such a useful, concrete type, with data stored inline, even though Tuple{Union{Int64, Float64}} isn't concrete.

allow someway for a user to "forward declare" the type for a specific field, for a NamedTuple, something like (x::Union{Float64, Int64}=3.14,); or in a common case, we could do syntax like (x::Int?=4,) as short-hand for "make the x field of type Union{Int64, Missing}". we might need to be clever by having some kind of struct TypedValue{T, S}; x::S; end, so x::Int?=4 expanded to x=TypedValue{Union{Int, Missing}}(4), and then a NamedTuple constructor that unwrapped TypedValues when constructing; but maybe there are parser tricks that could avoid all that
add an entirely new builtin type, that would essentially be:

struct UnionValue{T, S}
    x::Union{T, S}
end

Maybe inference could be even more aggressive and instead of having:

%38 = Core.tuple(%23, %37)::Tuple{Int64,Union{Missing, Float64}}
%39 = (NamedTuple{(:x, :y),T} where T<:Tuple)(%38)::NamedTuple{(:x, :y),_A} where _A

it could do:

%38 = Core.tuple(%23, %37)::Tuple{Int64,Union{Missing, Float64}}
%39 = (NamedTuple{(:x, :y),Tuple{Int64, T} where T<:Union{Missing, Float64}}})(%38)::NamedTuple{(:x, :y),Tuple{Int64, T} where T<:Union{Missing, Float64}}}}

(and you thought I was done w/ crazy ideas??), adding a way to basically tell inference/compiler, "I'd like to take whatever you managed to infer and use it as a type parameter here. thank you", something in combination w/ the first example, like:

function f(x::NamedTuple)
    return (x=x.id + 1, y::?=x.salary * 2)
end
# or for Pair
function f(x::NamedTuple)
    return Pair{Int, ?}(x.id, x.salary * 2)
end

quinnj on 28 Jun 2019

❤3

The TypedValue idea makes me think. Basically, we need some way to propagate the type of the field a value came from along with the value itself. A simple example would be

function map(f, nt::NT{T}) where T
    x = nt.field
    ismissing(x) && return nt
    val = f(x)
    NT{Union{typesubtract(T,typeof(x)),typeof(val)}}(val)
end

where NT is an ultra-simplified NamedTuple type with 1 field. In the case where a named tuple expression references a field of another object a.b, we could arrange to propagate fieldtype(typeof(a),:b) like this. Of course, if that's just done syntactically then it breaks if you assign a.b to a variable first, but this feels like it's going somewhere.

JeffBezanson on 28 Jun 2019

Yeah, this is certainly interesting. Do you imagine this would just be extra "metadata" propagated by the compiler during codegen? Or some actual type like:

struct TypedValue{T}
    x::T
    TypedValue(fieldtype, x) = new{fieldtype}(x)
end

that was automatically wrapped by the compiler and used internally, that was somehow treated as a "unwrapped" value when used?

quinnj on 30 Jun 2019

The main issue with the UnionValue idea, which is similar to what many static languages call sum types but which aren’t actually sum types because they require explicit wrapping and unwrapping, is precisely that—we would also require explicit wrapping and unwrapping, which makes them inconvenient. In that respect, Julia’s union types are close to true sum types, except, of course that they don’t enforce disjointness (although that’s rarely actually an issue, except for a theoretical standpoint). If setting/getting a field/slot of such a type always did automatic wrapping/unwrapping but the type was concrete—I’m not even sure that’s coherent—that feels like some of what’s wanted. To some extent this seems like it’s about hoisting or lowering the selector bits outwards or inwards as necessary. The trouble now seems to be that sometimes we have the bits on an individual field and want them on the containing structure, selecting between different concrete layouts, and vice versa. And of course, in those situations you want a guarantee that the layouts are compatible or it’s not all that helpful. Also just thinking out loud on this.

StefanKarpinski on 1 Jul 2019

Yeah, my hope w/ the UnionValue idea was that, during code generation, it would always participate in union-splitting + unwrapping in the respective union-split branch. I think for wrapping, we would also want to wrap the final result of each union-split branch in something like Jeff mentioned, like:

# something like
f(nt::NamedTuple{(:x,), Tuple{UnionValue{Int64, Missing}}}) = (x=nt.x + 1,)
# would expand in codegen to something like
function f(nt::NamedTuple{(:x,), Tuple{UnionValue{Int64, Missing}}})
    x = nt.x
    if x.value isa Missing
        y = UnionValue{Int64, Missing}(x.value + 1)
    else # x.value isa Int64
        y = UnionValue{Int64, Missing}(x.value + 1)
    end
    return (x=y,)
end

The part that then doesn't make a lot of sense in some ways is allowing x::UnionValue{Int64, Missing} to be a 1st-class value. It gets confusing because if I called sin(x), I'd want it to automatically compile:

function sin(x)
    if x.value isa Int64
        UnionValue{Int64, Missing}(sin(x.value))
    else
        UnionValue{Int64, Missing}(sin(x.value))
    end
end

but we don't really do/allow complete function call overloading like this. It makes me think that UnionValue really only makes sense as a fieldtype and use for codegen/union-splitting, which is basically what we have Union{Int, Missing} for already (i.e. it can be a fieldtype, but not a 1st-class type of a value, and we get union-splitting in codegen).

So maybe as an alternative to having UnionValue, we really want something like:

f(nt) = (x=?(nt.x) + 1,)

where ?(nt.x) is this little signal to the compiler that nt.x should be a "propagating" Union field, i.e. after doing union-split, the result would be wrapped in the Union{typesubtract(T,typeof(x)),typeof(val)} that Jeff mentioned above.

I'm still not sure that completely solves things, because we still want the NamedTuple constructor to be happy a potential incoming Union field, so maybe we'd need f(nt) = (x::?=nt.x + 1,) where the ::? on the field would expand to NamedTuple{(:x,), Tuple{Union{typesubtract(T, typeof(x)), typeof(val)}}}

quinnj on 1 Jul 2019

the type would be UnionValue{Float64, Int64}, to be distinct from the abstract Union, but the "builtin"-ness would be somehow allowing codegen to do union-splitting on it's type like we have for Union{T, S}

This is an interesting line of thought related to https://github.com/JuliaLang/julia/pull/32448#issuecomment-507010029: it would be neat from an ABI standpoint if we hoisted all selector bits outward. In this way arrays of nested structs-and-unions-of-bits could become ABI compatible with C structs and unions when the parallel array of selector data is ignored.

If we followed this rule, your UnionValue wouldn't need to be builtin because the rules would naturally say that the selector bits for the contained union would be hoisted out and stored separately.

To some extent this seems like it’s about hoisting or lowering the selector bits outwards or inwards as necessary.

Yes I think so. Do you have an example of where always hoisting the selectors out would be a bad thing and we'd prefer the selector bits to be stored with the data?