Zig: allow integer types to be any range

Created on 29 Nov 2019  Â·  22Comments  Â·  Source: ziglang/zig

Zig already has ranged integer types, however the range is required to be a signed or unsigned power of 2. This proposal is for generalizing them further, to allow any arbitrary range.

comptime {
    assert(i32 == @Int(-1 << 31, 1 << 31));
    assert(u32 == @Int(0, 1 << 32));
    assert(u0 == @Int(0, 1);
    assert(noreturn == @int(0, 0));
}

Let's consider some reasons to do this:

One common practice for C developers is to use -1 or MAX_UINT32 (and related) constants as an in-bound indicator of metadata. For example, the stage1 compiler uses a size_t field to indicate the ABI size of a type, but the value SIZE_MAX is used to indicate that the size is not yet computed.

In Zig we want people to use Optionals for this, but there's a catch: the in-bound special value uses less memory for the type. In Zig on 64-bit targets, @sizeOf(usize) == 8 and @sizeOf(?usize) == 16. That's a huge cost to pay, for something that could take up 0 bits of information if you are willing to give up a single value inside the range of a usize.

With ranged integers, this could be made type-safe:

const AbiSize = @Int(0, (1 << usize.bit_count) - 1);
const MyType = struct {
    abi_size: ?AbiSize,
};
var my_type: MyType = undefined;

test "catching a bug" {
    var other_thing: usize = 1234;

    my_type.abi_size = other_thing; // error: expected @Int(0, 18446744073709551615), found usize
}

Now, not only do we have the Optionals feature of zig protecting against accidentally using a very large integer when it is supposed to indicate null, but we also have the compile error helping out with range checks. One can choose to deal with the larger ranged value by handling the possibility, and returning an error, or with @intCast, which inserts a handy safety check.

How about if there are 2 special values rather than 1?

const N = union(enum) {
    special1,
    special2,
    normal: @Int(0, (1 << u32) - 2),
};

Here, size of N would be 4 bytes.


Let's consider another example, with enums.

Enums allow defining a set of possible values for a type:

const E = enum {
    one,
    two,
    three,
};

There are 3 possible values of this type, so Zig chooses to use u2 for the tag type. It will require 1 byte to represent it, wasting 6 bits. If you wrap it in an optional, that will be 16 bits to represent something that, according to information theory, requires only 2 bits. And Zig's hands are tied; because currently each field requires ABI alignment, each byte is necessary.

If #3802 is accepted and implemented, and the is_null bit of optionals becomes align(0), then ?E can remain 1 byte, and ?E in a struct with align(0) will take up 3 bits.

However, consider if the enum was allowed to choose a ranged integer type. It would choose @Int(0, 3). Wrapped in an optional, it actually could choose to use the integer value 3 as the is_null bit. Then ?E in a struct will take up 2 bits.

Again, assuming #3802 is implemented, Zig would even be able to "flatten" several enums into the same integer:

const Mode = enum { // 2-bit tag type
    Debug,
    ReleaseSafe,
    ReleaseFast,
    ReleaseSmall
};
const Endian = enum { // 1-bit tag type
    big,
    little,
};
pub const AtomicOrder = enum { // 3-bit tag type
    Unordered,
    Monotonic,
    Acquire,
    Release,
    AcqRel,
    SeqCst,
};
pub const AtomicRmwOp = enum { // 4-bit tag type
    Xchg,
    Add,
    Sub,
    And,
    Nand,
    Or,
    Xor,
    Max,
    Min,
};
const MyFancyType = struct {
    mode: Mode align(0),
    endian: Endian align(0),
    atomic_order: AtomicOrder align(0),
    op: AtomicRmwOp align(0),
};

If you add up all the bits of the tag type, it comes out to 10, meaning that the size of MyFancyType would have to be 2 bytes. However, with ranged integers as tag types, zig would be able to flatten out all the enum tag values into one byte. In fact there are only 21 total tag types here, leaving room for 235 more total tags before MyFancyType would have to gain another byte of size.


This proposal would solve #747. Peer type resolution of comptime ints would produce a ranged integer:

export fn foo(b: bool) void {
    // master branch: error: cannot store runtime value in type 'comptime_int'
    const x = if (b) -10 else 100;
    // proposal: @typeOf(x) == @Int(-10, 101)
}

With optional pointers, Zig has an optimization to use the zero address as the null value. The allowzero property can be used to indicate that the address 0 is valid. This is effectively treating the address as a ranged integer type! This optimization for optional pointers could now be described in userland types:

const PointerAddress = @Int(1, 1 << usize.bit_count);
const Pointer = ?PointerAddress;

comptime {
    assert(@sizeOf(PointerAddress) == @sizeOf(usize));
    assert(@sizeOf(Pointer) == @sizeOf(usize));
}

One possible extension to this proposal would be to allow pointer types to override the address integer type. Rather than allowzero which is single purpose, they could do something like this:

comptime {
    assert(*addrtype(usize) i32 == *allowzero i32);
    assert(*addrtype(@Int(1, 1 << usize.bit_count)) == *i32);
}

This would also introduce type-safety to using more than just 0x0 as a special pointer address, which is perfectly acceptable on most hosted operating systems, and also typically set up in freestanding environments as well. Typically, the entire first page of memory is unmapped, and often the virtual address space is limited to 48 bits making @Int(os.page_size, 1 << 48) a good default address type for pointers on many targets! Combining this with the fact that pointers also have alignment bits to play with, this would give Zig's type system the ability to pack a lot of data into pointers which are annotated with align(0).


What about two's complement wrapping math operations? Two's complement only works on powers-of-two integer types. Wrapping math operations would not be allowed on non-power-of-two integer types. Compile error.

breaking proposal

Most helpful comment

I like the core of this proposal, but there are a few things I'm worried about.

First, I feel like the optimizations are too implicit. You have to hope that you left the correct number of values open. If you do it wrong, there is no indication that you have messed up, you just have bigger types than you expected. As code changes over time, it's easy for additional values to be added to enums that are sensitive to this, which may break these optimizations without warning.
If I'm writing Zig and my goal is to remove the overhead of the optional, I would want a guarantee from the compiler that my code has received the optimization, or a compile error. The original suggestion is difficult to think about when optional optimization is your end goal, and it fails silently when the optimization can't be applied.

I'm worried that, because it's so implicit, general zig advice may become "put explicit bounds on every integer type you use, so that the compiler can optimize it". This could create a lot of additional cognitive load when reading and writing zig code. Does ?@Int(0, 262144) receive the optimization? It doesn't, but it's not immediately apparent unless you happen to know off the top of your head that 262144 is 1<<18.

It also seems to me like working with these types would in many cases require @intCast peppered all over the place. If you want to iterate over the range, you need to be able to test when you've gone past the end. Which means you need to use a larger integer type for the iterator and cast it inside the loop. If you want to do math between ranged ints and store the result in a ranged int, you will very probably need a range cast. Those casts can harden the code, making it more difficult to change the ranges if your requirements change. They also have the potential to make generic code that uses ranged ints very difficult to write correctly. And if you get the range wrong, it can't be a compile error because it's a cast. It can only manifest as checked UB. How much mental bandwidth are you willing to spend on calculating the ranges of your intermediate values?

Finally, while applying range information to type resolution would solve #747, it also introduces a problem when assigning directly. var x = 4; would now compile, but set the type of x to @Int(4, 5), which is not a very useful type for a var. This would then move the compile failure to the line where x is modified, with a pretty cryptic error message (or worse, checked UB, depending on how arithmetic between ranged types resolves).

There is a precedent for an opaque optional optimization in Zig: ?*. This generates a sentinel value zero, and I think we can all agree that this is a Good Idea. Where I think the above proposal differs is that this new optimization is dependent on picking range constants that obey properties that are not immediately obvious. I think that we could go a long way with a simpler guarantee like "If there are unused bits in an integer type in a struct, an optional of that integer type is guaranteed to put a valid flag in the unused bits.". This is mechanically the same as making the flag in optionals align(0), but (at least for me) is a much easier wording to load into Brain RAM and work with. We could then add explicit mechanisms to trigger the other optimizations suggested above if you need them.

Here are some off-the-cuff suggestions. There's definitely room for improvement for these, but I think they do a good job of showing the direction I'm looking.

Range-checked Explicit Integers

These are the same as the original proposal, but the integer bit-width is specified by the programmer. For example, @RangedInt(u32, 5, 1<<32-1) or @RangedInt(comptime_int, 10, 101). If the range cannot be represented by the underlying type, it is a compile error. This provides a path for type resolution of arithmetic on ranged integers that won't introduce surprises if ranges change. The type resolution of addition of these values is the result type of addition of the explicit integer types of the two operands, restricted to the maximum range of addition of the two operands, saturated to the maximum representable range of the resolved bit type. For convenience, we could supply a library function SmallestRangedInt(comptime min_value: comptime_int, comptime exclusive_max_value: comptime_int) type. But I don't think an inferred type should be the language default.

Integer Unions:

An integer union (or alternatively, "partition" or "ranged enum") would be a union of non-overlapping ranged ints. It is guaranteed to be the size of its backing integer type.

const Foo = partition(u32) {
    normal = 0..(1 << 32) - 2,
    special1 = (1 << 32) - 2,
    special2 = (1 << 32) - 1,
};

This is very similar to the union example in the original proposal, but with this if the range is specified incorrectly or a new value is added, there would be a compile error instead of a drastic change in representation. If the integer type is not specified, it would be inferred from the ranges as the smallest type that can represent the range. But the ability to specify the bit width and receive an error if it can't be represented is valuable. Ranged unions could also support the '_' option from extensible unions.

Explicit Sentinel Optionals:

If an optional type could be specially constructed by specifying a sentinel, this would solve sentinel checks in a way that guarantees the optimization. Something like @OptionalSentinel(comptime IntType: type, comptime sentinel: meta.BackingIntegerType(IntType)). This documents the sentinel value, which makes debugging much less painful. It also guarantees that the optimization is made, so you don't need to worry that changes to the underlying int range will reintroduce the optional flag bit. IntType could be an integer, ranged integer, enum, or integer union.


None of these counter-proposals are perfect, but I like that they take more direct control over the optimization. E.g. instead of thinking "I'd like to optimize this optional to use a sentinel value, I should put ranges this int type everywhere it's used", I can instead think "I'd like to optimize this optional to use a sentinel value, I should tell zig to optimize this optional to use a sentinel value.".

All 22 comments

I'm worried this might result in a code explosion as different types are passed to a generic.

Idea: input to a generic could be "marked" for which properties are used.
e.g. if I have a pure-storage generic, like ArrayList, then the range of the members doesn't really matter for the implementation of that generic: the only property it cares about at runtime is the size of the type, the rest is only useful at comptime.

@andrewrk i cannot understand how did you get required size of MyFancyType struct 'less that one byte ... leaving room for 235 more tags'.

Since struct is a product type, MyFancyType has a total of 4 * 2 * 6 * 9 = 432 possible states. Requiring at least ⌈log₂(432)⌉ = 9 bits to store them.

@Rocknest my math is wrong, thanks for the correction :+1:

wouldn't it be @Type(TypeInfo.Int{...}) where TypeInfo.Int looks like:

pub const Int = struct {
    is_signed: bool,
    bits: comptime_int,
    from: comptime_int,
    to: comptime_int,
};

because of the @Type builtin?

Yes except only from and to are needed - the rest is redundant information. I'd call it start and end to hint that the upper bound is exclusive. Good point, I forgot that @Int is deprecated.

@Int(0, 1) == @Type(.{ .Int = .{ .start = 0, .end = 1}})

@Type is a bit more verbose though, for such a fundamental type as an integer. I'm not 100% sold on @Type yet.

should this be marked a breaking proposal as well?

Yes except only from and to are needed - the rest is redundant information.

You still need the bits. e.g. I might have an API that takes a usize but only allows values between 1 and 4096

  • 1 is >= 0, so it's unsigned
  • log2(4096) = 12, so it's 12 bits

You're thinking of a different, already planned issue, which is zig keeping track of additional information beyond the type, for expressions and const values, which will allow, for example, this test to pass:

test "runtime value hints" {
    var a: u64 = 99999;
    // `a` is runtime known because it is `var`.
    const b: u64 = a % 256;
    const c: u8 = b; // allowed without @intCast because `b` has a runtime value hint
}

i don't think so. take for example the WebSocket protocol, the opcodes are always 4 bits, but 0xB to 0xF are reserved for future use, so effectively they are not allowed.

i don't think so. take for example the WebSocket protocol, the opcodes are always 4 bits, but 0xB to 0xF are reserved for future use, so effectively they are not allowed.

That sounds like a usecase for a non-exhaustive enum:

enum(u4) {
    Continuation = 0,
    Text = 1,
    Binary = 2,
    Close = 8,
    Ping = 9,
    Pong = 10,
    _,
}

This is very nice and fancy, but I would like to note that it might have a negative effect on performance. This is definitely more efficient memory wise, so for storing it (on disk) this is perfect. However for passing it around in code in this format it would mean that every time there is a calculation bit fiddling comes into play to extract the value and put it back.

Just something to think about.

However for passing it around in code in this format it would mean that every time there is a calculation bit fiddling comes into play to extract the value and put it back.

This is not correct, except with align(0) fields (see #3802). Zig already has arbitrary sized integers. If you use i3, for example, @sizeOf(i3) == 1 because @alignOf(i3) == 1.

Also in many cases if you were to use align(0) fields, the smaller memory use on the cache lines would cause more performance gains than lost to the extra bit shifting.

Yes, sorry I was thinking about align(0) cases. For the other cases this proposal only adds range checks, right?

Also in many cases if you were to use align(0) fields, the smaller memory use on the cache lines would cause more performance gains than lost to the extra bit shifting.

I'm a bit sceptical about this as I've had experience of the opposite. It probably depends a lot on the circumstances. Which is why I don't think a compiler should be sneaky and perform bit fiddling behind your back. However if it just applies to align(0), it is something which you ask for so it's fine.

Somehow I feel though that this is more something to do with optionals and other constructs which have more information than arbitrary numbers and enums, right?

It's more like in this example:

const N = union(enum) {
    special1,
    special2,
    normal: @Int(0, (1 << u32) - 2),
};

Where each case has a special number or range of numbers, such as:

const N = union(enum) {
    special1 = max32,
    special2 = max32 -1,
    normal = 0..max32 - 2,
};

And I don't think it goes for your other example where you have four, what to me seems independant, enums:

const MyFancyType = struct {
    mode: Mode align(0),
    endian: Endian align(0),
    atomic_order: AtomicOrder align(0),
    op: AtomicRmwOp align(0),
};

If you add up all the bits of the tag type, it comes out to 10, meaning that the size of MyFancyType would have to be 2 bytes. However, with ranged integers as tag types, zig would be able to flatten out all the enum tag values into one byte. In fact there are only 21 total tag types here, leaving room for 235 more total tags before MyFancyType would have to gain another byte of size.

Aren't there 4 * 2 * 6 * 9 = 432 possible combinations of those four enums? Which means you still need 9 bits (or two bytes)?

In general I do think it would be great if the optionals can work more generically. Not having to write all the -1 checks would be cool. Especially since sometimes people work with indices instead of pointers and the special value trick is quite common.

Note that integer ranges have a long history in some other languages like Pascal and Ada. There they are heavily used and can be used as ranges for arrays. This leads to some nice code where you can use ranged for loops without any possibility of going out of bounds.

It has been decades since I used Pascal or Ada in anger but even back then I seem to remember that most run time checks could be dropped as the compiler could prove that the use was safe. A very naive compiler would do the checks all the time, but existing compilers show that the possible overhead is quite low. As long as the programmer is aware (such as choosing different arithmetic with overflow wrapping) that this could have a cost, I think this is fine.

The driving case here: more compact representation of enums and optionals is a very strong one IMHO. Having just optionals use the same space as non-optional types would be a huge argument to get programmers to use them. The cache argument is a strong one too. A cache miss can cost 25-50 clock cycles (if you hit L2, worse if you fall all the way to RAM) and you can run a lot of bit twiddling code in that time.

I like the core of this proposal, but there are a few things I'm worried about.

First, I feel like the optimizations are too implicit. You have to hope that you left the correct number of values open. If you do it wrong, there is no indication that you have messed up, you just have bigger types than you expected. As code changes over time, it's easy for additional values to be added to enums that are sensitive to this, which may break these optimizations without warning.
If I'm writing Zig and my goal is to remove the overhead of the optional, I would want a guarantee from the compiler that my code has received the optimization, or a compile error. The original suggestion is difficult to think about when optional optimization is your end goal, and it fails silently when the optimization can't be applied.

I'm worried that, because it's so implicit, general zig advice may become "put explicit bounds on every integer type you use, so that the compiler can optimize it". This could create a lot of additional cognitive load when reading and writing zig code. Does ?@Int(0, 262144) receive the optimization? It doesn't, but it's not immediately apparent unless you happen to know off the top of your head that 262144 is 1<<18.

It also seems to me like working with these types would in many cases require @intCast peppered all over the place. If you want to iterate over the range, you need to be able to test when you've gone past the end. Which means you need to use a larger integer type for the iterator and cast it inside the loop. If you want to do math between ranged ints and store the result in a ranged int, you will very probably need a range cast. Those casts can harden the code, making it more difficult to change the ranges if your requirements change. They also have the potential to make generic code that uses ranged ints very difficult to write correctly. And if you get the range wrong, it can't be a compile error because it's a cast. It can only manifest as checked UB. How much mental bandwidth are you willing to spend on calculating the ranges of your intermediate values?

Finally, while applying range information to type resolution would solve #747, it also introduces a problem when assigning directly. var x = 4; would now compile, but set the type of x to @Int(4, 5), which is not a very useful type for a var. This would then move the compile failure to the line where x is modified, with a pretty cryptic error message (or worse, checked UB, depending on how arithmetic between ranged types resolves).

There is a precedent for an opaque optional optimization in Zig: ?*. This generates a sentinel value zero, and I think we can all agree that this is a Good Idea. Where I think the above proposal differs is that this new optimization is dependent on picking range constants that obey properties that are not immediately obvious. I think that we could go a long way with a simpler guarantee like "If there are unused bits in an integer type in a struct, an optional of that integer type is guaranteed to put a valid flag in the unused bits.". This is mechanically the same as making the flag in optionals align(0), but (at least for me) is a much easier wording to load into Brain RAM and work with. We could then add explicit mechanisms to trigger the other optimizations suggested above if you need them.

Here are some off-the-cuff suggestions. There's definitely room for improvement for these, but I think they do a good job of showing the direction I'm looking.

Range-checked Explicit Integers

These are the same as the original proposal, but the integer bit-width is specified by the programmer. For example, @RangedInt(u32, 5, 1<<32-1) or @RangedInt(comptime_int, 10, 101). If the range cannot be represented by the underlying type, it is a compile error. This provides a path for type resolution of arithmetic on ranged integers that won't introduce surprises if ranges change. The type resolution of addition of these values is the result type of addition of the explicit integer types of the two operands, restricted to the maximum range of addition of the two operands, saturated to the maximum representable range of the resolved bit type. For convenience, we could supply a library function SmallestRangedInt(comptime min_value: comptime_int, comptime exclusive_max_value: comptime_int) type. But I don't think an inferred type should be the language default.

Integer Unions:

An integer union (or alternatively, "partition" or "ranged enum") would be a union of non-overlapping ranged ints. It is guaranteed to be the size of its backing integer type.

const Foo = partition(u32) {
    normal = 0..(1 << 32) - 2,
    special1 = (1 << 32) - 2,
    special2 = (1 << 32) - 1,
};

This is very similar to the union example in the original proposal, but with this if the range is specified incorrectly or a new value is added, there would be a compile error instead of a drastic change in representation. If the integer type is not specified, it would be inferred from the ranges as the smallest type that can represent the range. But the ability to specify the bit width and receive an error if it can't be represented is valuable. Ranged unions could also support the '_' option from extensible unions.

Explicit Sentinel Optionals:

If an optional type could be specially constructed by specifying a sentinel, this would solve sentinel checks in a way that guarantees the optimization. Something like @OptionalSentinel(comptime IntType: type, comptime sentinel: meta.BackingIntegerType(IntType)). This documents the sentinel value, which makes debugging much less painful. It also guarantees that the optimization is made, so you don't need to worry that changes to the underlying int range will reintroduce the optional flag bit. IntType could be an integer, ranged integer, enum, or integer union.


None of these counter-proposals are perfect, but I like that they take more direct control over the optimization. E.g. instead of thinking "I'd like to optimize this optional to use a sentinel value, I should put ranges this int type everywhere it's used", I can instead think "I'd like to optimize this optional to use a sentinel value, I should tell zig to optimize this optional to use a sentinel value.".

I think that we could go a long way with a simpler guarantee like "If there are unused bits in an integer type in a struct, an optional of that integer type is guaranteed to put a valid flag in the unused bits.".

Sorry, small nitpick. It's not actually bits, it's unused values. For example in the pointer case 0 isn't a specific bit, it's just a special value. It's a slight difference. Same with the common -1/max-uint, it's not just the last bit, it's actually one value. But I see from your examples/proposals that you understood.

I agree with you that it would be cleaner if it's more explicit.

Sorry, yeah, my wording was unclear. I'm trying to convey that the simpler guarantee of a bit for optionals of non-power-of-two integers is easier to guarantee and think about than an inferred sentinel, and that something like @OptionalSentinel might be better for when a sentinel value is needed.

If we make it so an optional ranged type interprets any out-of-range value as null, we could do things like this:

// Zero-cost wrapper for an ABI that uses negative values as error codes
const Result = fn (Int: Type) Type {
    return union {
        // Nullable limits to avoid generating unnecessary bounds checks
        success: @Ranged(Int, 0, null),
        failure: @Ranged(Int, null, -1),
    };
};

Define canonical null as the most negative value above the range if it exists, else the most positive value below the range. That way, even if a type is multiple-optionalised, the span of non-null values at any level is always contiguous, and does not change with type widening/narrowing in the most common cases.

Possible optimisation: allow nesting of @Ranged optionals, and only allow the ? type operator to be used on ranged types (pointer types are implicitly ranged), like so:

// Can be optimised to u32 with tag
// -- don't have to set global rules
const OptionalU32 = ?@Ranged(u32, null, maxInt(u32));

// Don't need another range -- canonical null of
// OptionalU32 is inner null, anything else is outer null
const OptionalOptionalU32 = ?OptionalU32;

// ;)
const Bool = ?@Ranged(u1, 1, null);
const True: Bool = 1;
const False: Bool = null;
const And = fn (T: Type, a: ?T, b: ?T) ?T {
    return if (a) { b } else { null };
};
const Or = fn (T: Type, a: ?T, b: ?T) ?T {
    return a orelse b;
};
const Not = fn (a: Bool) Bool {
    return if (a) { null } else { 1 };
};

Pro

  • More explicit in intention
  • Optionals can now be used in packed objects
  • Booleans don't have to be primitive ;-)
    Con
  • Optionalising arbitrary calltime-known type now impossible, though I can't think of a use case for this that isn't possible another way

You could use INT_MIN for null, as it already has some pretty funny properties. This still requires a new type (that interestingly would be agnostic to twos-complement and ones-complement), but it is far less invasion that this proposal, which IMHO belongs in an optimizing compiler and not zig.

We already have a ranged integer type. It is called enum. This proposal creates far more problems that it solves.

An enum is not a ranged integer type. It does not expose any arithmetic properties, but only distinct values that have no relation to each other except for being in the same set.

The virtual machine that zig inherited from llvm which inherited it from C is a computer that thinks in bits, and this is successful because it how a computer thinks. If you are interpreting a language rather than modeling a computer, then you should use language theory, and enum is a building block for that[1], and then _translate_ the language into the language of the computer, which thinks in bits. This proposal doesn't solve any of these problems, it only creates problems because it is a premature optimization of an ill-understood (and non-existent) problem.

[1] but you also need this https://codasip.com/2019/09/04/better-benchmarks-through-compiler-optimizations-codasip-jump-threading/ as I mentioned here https://bugs.llvm.org/show_bug.cgi?id=39043 and is necessary to port re2c to zig. (we know this is possible, so it is not a problem for the zig language, only an implementation detail)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andrewrk picture andrewrk  Â·  3Comments

jayschwa picture jayschwa  Â·  3Comments

jorangreef picture jorangreef  Â·  3Comments

bronze1man picture bronze1man  Â·  3Comments

andrewrk picture andrewrk  Â·  3Comments