Zig: Proposal: `usize` definition should be refined

Created on 27 Apr 2020  路  13Comments  路  Source: ziglang/zig

I came across the definition of usize, which is currently defined as unsigned pointer sized integer and a question arose: Size of what pointer? Function pointer? Pointer to constant data? Pointer to mutable data?

For most platforms, the answer is simple: There is only one address space.

But as Zig tries to target all platforms, we should bear in mind that this is not true for all platforms.

Case Study:
Zig supports AVR at the moment which has two memory spaces:

  • Data
  • Code

Both memory spaces have different adressing modes which can be used with the Z register, which is a 16 bit register. Thus, we could concloud that the pointer size is 16 bit. But the AVR instruction set also has a RAMPZ register that is prepended to the Z register to extend the memory space to 24 bit. Some modern AVRs have more than 128k ROM (e.g. Mega2560). This means that the effective pointer size 24 bit.

The same problem arises when targeting the 8086 CPU with segmentation. The actual pointer is a 20 bit value that is calculated by combining two 16 bit values (segment + offset).

Problem:
usize communicates that it stores the size of something, not the address. Right now, usize can contain values larger than the biggest (continously) adressable object in the language and it takes up more space than needed.

C has two distinct types for that reason:

  • size_t (can store the size of an adressable object)
  • uintptr_t (can store any pointer)

AVR-GCC solves the problem of 24 bit pointers by ignoring it and creates shims for functions that are linked beyond the 128k boundary. Data beyond the 64k boundary cannot be adressed and afaik LLVM has the same restriction. I don't think Zig should ignore such platform specifics and should be able to represent them correctly.

Proposal:
Redefine usize to be can store the size of any object or array and introduce a new type upointer that is pointer sized integer. Same for isize and ipointer.

It should also be discussed if a upointer will have a guaranteed unique representation or may be ambiguous ("storing a linear address or storing segment + descriptor")?

Changes that should be made as well:

  • @ptrToInt and @intToPtr should now return upointer instead of usize
  • @sizeOf will still return usize

Pro:

  • Communicates intend more precise by using distinct types for int-encoded pointers and object sizes / indices
  • Saves memory as object sizes may be 50% smaller than pointers

Con:

  • One more type
  • May spark confusion for people who assume that pointer size is always object size

Example:

// AVR:
const usize = u16;
const upointer = u24;

// 8086:
const usize = u16;
const upointer = u32;

Note:
I'm not quite sure about all of this yet as this is a very special case that only affects some platforms whereas most platforms don't have the object size is not pointer size restriction.

Resources:

Edit: Included answer to the question of @LemonBoy, added pro/con discussion, added example

proposal

Most helpful comment

Real Life Use Case? :thinking:

There is sometimes the need to do pointer/type erasure. Primary use case would be anything OS-relevant (as in you're coding an OS). That's where you work a lot with pointers-as-numbers instead of actual memory slices.

Is there any defense for isize and ipointer?

Yes! Memory/pointer distances. You cannot express an object size delta with usize and you cannot express a pointer distance with upointer. You cannot store this object is 15 byte smaller and 240 byte before X with only sized types.

I like the idea of comptime inspectible memory spaces, but upointer should be able to store all pointers and usize should be able to store all sizes.

All 13 comments

Counter proposal:

Size of what pointer? Function pointer? Pointer to constant data? Pointer to mutable data?

The maximum across all the address spaces. This way we can also keep the ptr-to-usize (and usize-to-ptr) relationship (with the help of the addresspace pointer metadata).

The maximum across all the address spaces. This way we can also keep the ptr-to-usize (and usize-to-ptr) relationship (with the help of the addresspace pointer metadata).

I thought about that, and it has one problem: It will waste a lof of space. @ptrToInt and @intToPtr should use upointer, but if i want to store a size of something (which is the standard case), i should use usize.

This has two advantages: upointer should always store pointers, usize should always store object size. So @sizeOf will return usize.

Otherwise i would waste 50% of my memory with zero padded bytes by storing pointers where i could've used a type only having half of the size.

Added some changes and updates to the original proposal

Since Zig supports arbitrarily sized integer types, each OS could define the bit length of their virtual memory system. This would reduce the well known 'pointer bloat' in the executable.

would you have to change the default typing rules on certain arithmetic events? like would the following operations make sense?

usize + usize OK
upointer + upointer NO
upointer + usize OK
usize * iX OK
usize * usize NO
usize * upointer NO
upointer * iX NO

would you have to change the default typing rules on certain arithmetic events? like would the following operations make sense?

No. I don't think this is something that is such a huge error source that it would be a benefit more than a hassle. Subtracting (and thus) adding values of upointer is quite helpful some times, also note that upointer - upointer may still not fit into a usize.

Is there any defense for isize and ipointer? The only time I can imagine a negative value in those contexts is to check for overflow, or something esoteric. The former can be done with existing operators. The latter could be done with an explicitly safe cast.

The purpose we're trying to imply is "size of a pointer" and "size of a datablock", so I would simply lean toward psize and msize storing "pointer sized unsigned integer" and "maximum-allocatable-memory sized unsigned integer".

This would keep the number of types the same, while increasing functionality and clarity.
isize, usize -> psize, msize

There are OSes that treat pointers as signed. Solaris maybe? That's not a big market :-)

Real Life Use Case? :thinking:

A pointer being treated as signed or not will not affect pointer arithmetic. The only possible situation for signed pointers is some sort of special bit value, but aside from null, what would it be? Zig doesn't even support non-0 null.

In OS's with signed pointers, we can just explicitly cast to the appropriate signed type to shuffle the psize data across language boundaries. I suspect the cast could even be implicit.

Disclaimer: I don't think the following is a well thought out idea as it stands, but my brain's completely pulling out the stops and I would feel remiss to not write it.

What if we dropped the idea of an intrinsic platform-specific pointer size and memory size altogether, and allowed platforms to define their memory spaces?

Something like this.

const avr = @import("avr");

const avr_code_pointer_size = @PointerSize(avr.code_memoryspace);
const avr_code_memory_size = @MemorySize(avr.code_memoryspace);
const avr_data_pointer_size = @PointerSize(avr.data_memoryspace);
const avr_data_memory_size = @MemorySize(avr.data_memoryspace);

User-defined names are obviously a no-go for cross-platform code, but any code that uses multiple memoryspaces wouldn't be cross-platform anyway, and at worst, converting a snippet to a single memoryspace architecture would be a find-replace.

Feels a bit like vkDevice. Zig intends to be very specific about allocations. This takes it one level deeper.

Real Life Use Case? :thinking:

There is sometimes the need to do pointer/type erasure. Primary use case would be anything OS-relevant (as in you're coding an OS). That's where you work a lot with pointers-as-numbers instead of actual memory slices.

Is there any defense for isize and ipointer?

Yes! Memory/pointer distances. You cannot express an object size delta with usize and you cannot express a pointer distance with upointer. You cannot store this object is 15 byte smaller and 240 byte before X with only sized types.

I like the idea of comptime inspectible memory spaces, but upointer should be able to store all pointers and usize should be able to store all sizes.

If one memory space could allocate 2^16-1 bytes and the other 2^24-1, using a 24 bit value for both could waste memory more often than not, depending on which address space is used most often?

I agree with this, but I have an issue with the naming: it's not clear at a glance whether the "size" in usize/isize refers to object size or machine word size. I propose, for object sizes, we use ulen/ilen, for symmetry with slice .len and .ptr, and for ALU words we use uarg/iarg, for arguments of arithmetic/logic ops; and we drop usize/isize entirely, so as not to mislead the programmer. I also propose we add udata/idata, for datapath-sized ints (in case an arch can't load a full register in one access, or it can load multiple), and for architectures capable of sub-byte or restricted to super-byte addressing, ucell/icell, the smallest addressable memory unit (so an allocator would return [n]ucell, and @sizeOf(T) returns the size of T in ucells). So, the size in bits of the largest contiguous object is @max(ulen) * @bitSizeOf(ucell).

So the complete list of ints would be:

  • uX/iX: explicit number of bits
  • uarg/iarg: Largest int that the ALU can take as input in one operation
  • udata/idata: Largest int that can be loaded/stored in a single access
  • ucell/icell: Smallest int that can be loaded/stored
  • uptr/iptr: Int sufficient to encode any pointer
  • ulen/ilen: Int sufficient to encode any contiguous object size

While we're at it, let's change align() to take bit values rather than byte values, so we can remove all assumption of byte-addressing from the language.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zimmi picture zimmi  路  3Comments

dobkeratops picture dobkeratops  路  3Comments

jayschwa picture jayschwa  路  3Comments

andrewrk picture andrewrk  路  3Comments

andrewrk picture andrewrk  路  3Comments