Julia: Allow `new` to accept keyword arguments for more control over incomplete initialization

Created on 24 Jul 2020 · 25Comments · Source: JuliaLang/julia

The Problem

Currently, when defining inner constructors, we can decide to incompletely initialize a structure simply by passing to new(...) fewer values than the number of fields:

struct Incomplete
    x1
    x2
    x3
    ...
    x719
    Incomplete(x1, x2, x3, x4) = new(x1, x2, x3, x4) # x5, ..., x719 are uninitialized
end

Since the call to new(...) accepts only positional arguments, this has the obvious limitation that we can only initialize a prefix of all the fields and leave the remaining ones uninitialized. It is impossible to initialize only x1, x42, x470 and x666.

The Proposal

My proposal is to allow new(...) to accept keyword arguments, so that the following code becomes legal:

struct Incomplete
    x1
    x2
    x3
    ...
    x719
    Incomplete(a, b, c, d) = new(x1=a, x42=b, x470=c, x666=d) # the other fields are uninitialized
end

Remarks

Motivation

It is just useful sometimes to be able to do this. See for instance this thread where I asked on the forum if there was a way to achieve this with the current state of the language. There are plenty of workarounds, with more or less significant downsides (complexity-wise and performance-wise), but no direct solution.

Backward compatibility

This should be :100:.

Downsides

I honestly don't see any.

Labels

design, feature, keyword arguments, types and dispatch

design feature keyword arguments types and dispatch

Source

FedericoStra

👍5 ❤1

Most helpful comment

The example in the manual about a recursive struct is cute, but I am not sure that this pattern occurs in the wild so much that it would be missed

Trees and linked-lists? Nodes have to refer to other nodes, but for a list the head and tail don't (yet) refer to anything. Likewise for the root and child nodes of trees. You might think you could circumvent this by using mutable structs and have the head/tail refer to themselves (so all fields are initialized), but you still have to temporarily leave a field uninitialized:

# A demo where the user never sees an object with an uninitialized field, but you still need to be able to construct one transiently

"""
    Node{T}(data)

Construct the head of the list.
"""
function Node{T}(data) where T
    head = new{T}(data)
    head.preceeding = head
    head.succeeding = head
    return head
end

"""
    Node(data, preceeding)

Add a new link after `preceeding`
"""
function Node(data, preceeding::Node{T}) where T
    node = new{T}(data, preceeding)
    preceeding.succeeding = node
    return node
end

These are pretty fundamental data structures, and need to be supported.

timholy on 27 Jul 2020

👍4

All 25 comments

Orthogonally to incomplete initialization, just having a form of new with keyword arguments would be kind of nice for composite types with a nontrivial amount of fields.

tpapp on 24 Jul 2020

👍3

Downsides

I honestly don't see any.

Complicates the parser and lowering passes of the front-end;
Compiler needs to be able to handle arbitrary subsets of fields being uninitialized.

StefanKarpinski on 24 Jul 2020

😄1 👍1

Complicates the parser and lowering passes of the front-end

How does it complicate the _parser_? The parser is already capable of parsing keyword arguments elsewhere. Is the new function special-cased in this regard? I'll try to dig into the code as soon as I find the time. I'm not familiar at all with the internals, but it seems strange to me that this proposal would require a big change to the parser.

Compiler needs to be able to handle arbitrary subsets of fields being uninitialized.

Does this mean that currently the compiler relies on the fact that only a suffix of the fields can be uninitialized? I don't see why the compiler should care about this. The implementation should be pretty straightforward:

keep what is already in memory for the fields which are "plain data",
possibly initialize to 0, NULL or whatever the fields which are references, or skip initialization altogether if possible.

Then the logic for accessing fields doesn't need to change at all. We don't have to keep track of which fields are initialized or not.

How does the compiler currently "handle" which fields are initialized? I think it just emits the asm to read them whenever they are accessed, right? There shouldn't be anything new to do w.r.t. the current implementation.

FedericoStra on 24 Jul 2020

It's probably not a huge complication, but it will certainly complicate some of the front-end-code. The new constructor isn't really a function, so it's handled a bit specially.

Regarding the compiler, it needs to defensively emit code that checks any ref fields that can be uninitialized to make sure they're not NULL pointers before accessing them. When accessing fields that are always initialized, this doesn't need to be done. Currently, which fields need to be checked and which don't can be tracked with a single number; if it becomes an arbitrary subset, a bitfield will be required instead.

I'm not arguing against this feature, just pointing out that it isn't without implementation and maintenance costs.

StefanKarpinski on 24 Jul 2020

👍3

I do think this feature makes sense. It's easy to implement for mutable structs, since we can just lower it to a series of setfield!s. For immutable we would probably need the "mutating immutables" feature (#11902) before we could do this.

JeffBezanson on 24 Jul 2020

@StefanKarpinski

Regarding the compiler, it needs to defensively emit code that checks any ref fields that can be uninitialized to make sure they're not NULL pointers before accessing them. When accessing fields that are always initialized, this doesn't need to be done. Currently, which fields need to be checked and which don't can be tracked with a single number; if it becomes an arbitrary subset, a bitfield will be required instead.

Now I completely get what you mean. Sorry :) Of course I agree with what you say.

I believe the counter you mention is ninitialized in julia.h

typedef struct _jl_datatype_t {
    /* ... */
    int32_t ninitialized;
    /* ... */
} jl_datatype_t;

right?

@JeffBezanson

For immutable we would probably need the "mutating immutables" feature (#11902) before we could do this.

I also thought about having temporary mutability for immutable structs inside inner constructors, but that seemed somewhat a bigger change to the language, therefore I limited myself to this proposal, which could potentially be implemented in an easier way than the much bigger rework of immutability.

It's easy to implement for mutable structs, since we can just lower it to a series of setfield!s.

That's also the case that I consider less interesting, because I could just do it myself by using an emptynew() in the inner constructor and then explicitly setting the fields that I want. The only downside is that those fields would probably not be marked as "always initialized", so accessing them requires a NULL check. Of course having the support for this feature would still be handy, but not as indispensable as it is for immutable structs, for which there is no possible workaround in userland.

FedericoStra on 24 Jul 2020

FWIW, I've always questioned the need for partially initialized immutable structs. I guess that it can be useful in some situations, but since you can't finish initializing them, it seems like a bit of a dicey feature to me.

StefanKarpinski on 24 Jul 2020

👍3

Well, maybe you don't need to "finish initializing" them; they are already finished as they are. They just carry redundant fields for homogeneity with other instances of the same type.

I'll reproduce here the example I gave on the forum. Imagine that you want to represent bounded/unbounded open/closed intervals with endpoints of type T. In C you would do

typedef struct _interval_t {
    T left;
    T right;
    uint8_t kind;
} interval_t;

where kind is a bitmask that indicates the kind of endpoints: there are 3 possibilities (closed, open, unbounded) for each endpoint, so you can use 2 bits for each.

When you need to represent the interval [42, ∞), you just don't assign to the member right (and of course never read it afterwards!). The produced struct has an uninitialized member, but it is in no way "unfinished". It's like this just to blend in with her friends (-∞, 13] and [0, 1].

FedericoStra on 24 Jul 2020

In that case, if T is a reference type you're better off setting the unused field to some value anyway, so we don't have to check fields on access.

JeffBezanson on 24 Jul 2020

True, but it isn't straightforward to produce "some value" of type T in a generic fashion. One would need to force library users to define a

default(::Type{T}) = ...?...

for their types if they want to use them in the structure I provide them.

Also, either you always return exactly the same instance, or you'll have some allocations.

default(::Type{BigInt}) = BigInt(0)

would be bad. One has to do

const BIGINT_DEFAULT = BigInt(0)
default(::Type{BigInt}) = BIGINT_DEFAULT

This could be seen as a possible tradeoff once the two approaches are available:

either create incompletely initialized structs, which require no cooperation on the part of the library user, but require NULL checks;
or arbitrarily initialize non plain data fields, possibly requiring some cooperation on the part of the library user, but leading to more efficient generated code because it avoids NULL checks.

FedericoStra on 24 Jul 2020

@FedericoStra: as I suggested on the forum, I think that Union types are the perfect solution for your original problem.

Generally, incomplete initialization may be a vestigial feature of Julia since Union became efficient. I don't think that relying on it to effectively emulate NULL in other languages should be encouraged, since we have Union types which was developed for precisely this.

tpapp on 25 Jul 2020

@tpapp: You know what? I think if you keep telling me you'll probably convince me in the end! I might be a bit stubborn, but not to the point of being completely unreasonable :stuck_out_tongue_closed_eyes:

So here is my idea: I plan to have a look at builtins.c, cgutils.c, codegen.cpp, datatype.c, and jltypes.c, the places where ninitialized is mainly used. This is an interesting study for me, so even if I don't conclude anything, it'll be an instructive learning experience. I'll start doing this on monday, and I'll try to keep posting here some notes documenting my findings and what needs to be changed. I wish you all a good weekend!

FedericoStra on 25 Jul 2020

The main difference between using a union with nothing and undef fields is that accessing an undef field is an immediate error whereas nothing is a first class value that you can access, pass to functions, define methods for, etc. Using a union with nothing is also more generic since not all types of fields can be undef, so if you're relying on the "error on access" feature of undef fields then you have to be careful of what type of field you have—it must be a reference type. The union approach seems generally more useful and usable to me, but that is of course a judgement call.

StefanKarpinski on 26 Jul 2020

👍2

The downside of the Union approach is that every field keeps track of its own state of definedness, so every access is always checked. I feel like a benefit of the undef approach is when the field is "plain data" (hence no NULL check) and the state of definedness is kept implicitly by other fields, or it is even determined outside of the struct by its usage. Imagine having

struct P
    x::Int
    y::Int
end

and then knowing that in a collection Vector{P} every element has both x and y, except for the last which has only y. This is a knowledge that is kept outside of the struct, it is given by the problem at hand, and the undef approach allows you to avoid any checks upon access.

I realize that this is a quite specific usage, so the Union approach is definitely generally more useful, but this necessity can come up and the undef alternative allows some (possibly tiny) performance improvement.

I would never suggest relying on the "error on access" feature of ref fields. I'm more for the "look before you leap" than the "it’s easier to ask for forgiveness than permission" ideology (when I'm outside of Python).

FedericoStra on 26 Jul 2020

I plan to have a look at builtins.c, cgutils.c, codegen.cpp, datatype.c, and jltypes.c, the places where ninitialized is mainly used.

Of course do so if you're interested, but I don't think ninitialized is the limiting factor here. We could very well implement this without changing it, and still just set it to whatever prefix of fields is always initialized. In particular, this won't cause regressions in any existing code, so generalizing ninitialized would not be urgent.

JeffBezanson on 27 Jul 2020

@FedericoStra:

downside of the Union approach is that every field keeps track of its own state of definedness
[...]
knowing that in a collection Vector{P} every element has both x and y, except for the last which has only y.

What about

struct P
    xy::Union{Tuple{Int,Int},Tuple{Int,Nothing}}
end

with a convenient interface via Base.getproperty?

tpapp on 27 Jul 2020

Generally, the more I think about incomplete initialization in the post-Union-optimized era of Julia, the less the use case makes sense, especially in a style which favors immutables and generic code. The example in the manual about a recursive struct is cute, but I am not sure that this pattern occurs in the wild so much that it would be missed. Maybe in 2.0 the whole feature could be removed, or at least the language should enforce that no uninitialized fields can escape the constructor.

If there are examples of incomplete initialization used in the wild, I would appreciate some links.

tpapp on 27 Jul 2020

Please don't remove incomplete initialization for no reason. There are a lot of applications outside of number crunching, where immutable types are not an option and incomplete initialization and delayed computation of fields is important. Also, Unions are not a solution to every problem, in particular in view of the recent changes towards reducing compiler latency.

thofma on 27 Jul 2020

👍2

There are a lot of applications

Just to clarify: I am not (yet) suggesting that incomplete initialization is removed (in any case, this is a breaking change so the earliest this could happen is 2.0), but I think that real-world use cases would still be very interesting.

tpapp on 27 Jul 2020

The example in the manual about a recursive struct is cute, but I am not sure that this pattern occurs in the wild so much that it would be missed

# A demo where the user never sees an object with an uninitialized field, but you still need to be able to construct one transiently

"""
    Node{T}(data)

Construct the head of the list.
"""
function Node{T}(data) where T
    head = new{T}(data)
    head.preceeding = head
    head.succeeding = head
    return head
end

"""
    Node(data, preceeding)

Add a new link after `preceeding`
"""
function Node(data, preceeding::Node{T}) where T
    node = new{T}(data, preceeding)
    preceeding.succeeding = node
    return node
end

These are pretty fundamental data structures, and need to be supported.

timholy on 27 Jul 2020

👍4

@JeffBezanson

Of course do so if you're interested, but I don't think ninitialized is the limiting factor here. We could very well implement this without changing it, and still just set it to whatever prefix of fields is always initialized. In particular, this won't cause regressions in any existing code, so generalizing ninitialized would not be urgent.

Sorry, maybe what I wrote was unclear. My intention is of course to retain ninitialized with exactly the same meaning that it currently has: the length of the longest prefix of fields that are always initialized by every inner constructor. What I meant is that I would probably add a member

typedef struct _jl_datatype_t {
    /* ... */
    jl_svec_t *types;
    jl_svec_t *names;
    jl_svec_t *alwaysinit; /* this is the new member */
    int32_t ninitialized;
    /* ... */
} jl_datatype_t;

which is a simple vector of jl_bool_types keeping track of whether each individual field is always initialized by every inner constructor.

This way ninitialized would still be available and nothing should break. What I meant is that I was looking for all the call sites of jl_new_datatype, which need to be updated to pass also alwaysinit.

FedericoStra on 27 Jul 2020

@timholy: thanks for the example. I agree that mutable struct is indeed a relevant use case. Note that for this, incomplete initialization of tail fields works fine, even though it has no dedicated syntax:

julia> mutable struct Incomplete
       a
       b
       Incomplete(b) = (result = new(); result.b = b; result)
       end

julia> Incomplete(2)
Incomplete(#undef, 2)

If a keyword-based syntax is desired, a macro simple should be able to take care of this just fine.

As for immutable structs, I still think that leaving fields uninitialized is a vestigial feature after Union optimizations, and there is no compelling reason for it to be supported (this, of course, is a question for 2.0). I may be mistaken of course, in which case I would appreciate examples.

tpapp on 28 Jul 2020

@tpapp I believe I've already given elsewhere some compelling examples where immutable structs with uninitialized fields can be of use. I'll try to repeat here the main advantage they can provide from my point of view.

They allow to work with homogeneous collections of objects with efficient access (no isa Nothing check) of "plain data" fields in situations where the state of initialization is not necessarily kept by some internal flags (checking which would probably be analogous to isa Nothing) but rather by some external invariants (for instance, being the first or last element in a list).

FedericoStra on 28 Jul 2020

In the fullness of time (i.e. for Julia 2.0), I think we should consider eliminating uninitialized fields entirely. This could be done with a combination of union types and field defaults. For example, we could only allow fields to not be explicitly given a value if they have a default and then double down on the convention of using nothing as the default for fields that may need to be uninitialized for a bit. This need not change the memory representation of values at all since it's strightforward to represent a field of type Union{List, Nothing} as a single pointer with NULL indicating that the value is the singleton nothing rather than actually storing a pointer to a nothing value, although, that's a constant as well, so maybe there's no real point.

The only hesitation I have about that is that it forces the type of a field to be a union even if the field never escapes the constructor uninitialized. So an intermediate solution would be to somehow enforce that undef fields get defined by the time an object is fully constructed. Of course, that leaves the question of when the object is considered fully constructed. One potential definition is when it is returned from the method in which new is called, so we would enforce that it have all fields assigned by the time it is returned from that function body and then not have to worry about uninitialized fields in any other code.

In any case, I think this is mostly tangential to this issue. None that clashes with this feature that I can see.

StefanKarpinski on 28 Jul 2020

Another avenue to explore is to have an explicit unconstructable token. Union{List, Never} can thus be used to represent a field that is set to either List or not set (as the Never type would be known to have no instances). This makes Never similar to Union{} in the (un)constructable sense, without also defining it to be a subtype of all other types.