Following lines certainly lead to wrong behavior on 64bit platforms with large arrays cause of #to_u32
.
https://github.com/crystal-lang/crystal/blob/46e9d81b965d9e00dea4aeb0002999c756545b6a/src/pointer.cr#L245
https://github.com/crystal-lang/crystal/blob/46e9d81b965d9e00dea4aeb0002999c756545b6a/src/pointer.cr#L258
https://github.com/crystal-lang/crystal/blob/46e9d81b965d9e00dea4aeb0002999c756545b6a/src/pointer.cr#L499
BTW, why Crystal has neither size_t
nor intptr_t
equivalent? Some places uses UInt64
, some uses UInt32
(like this one). It is just... infantile.
For example, Go defines int
to be ssize_t
and uint
to be size_t
, and has intptr
and uintptr
(though, looks like on all supported platforms they have same size to int
).
Having platform dependent int
has profit in low-level programming, and dis-convenient for application level programming, so it certainly should not be default type.
But still any place that deals with low-level memory handling should operate on platform dependent types/aliases instead of raw UInt32
and/or UInt64
. This bogus lines are just tip of iceberg.
I wouldn't really consider it inherently broken, since, if you're copying >2GB of data at once, your program is probably screwed up beyond belief anyway.
Actually, no; it'd have to be four gigabytes of data in one copy in order to cause trouble.
If you think, real programs doesn't do such "crazy" things, you are not right.
It is broken. You may say "it is workable on useful sizes". But it is still broken on "acceptable" size.
Either error should be thrown, or behavior should be fixed.
Crystal isn't C. Please keep this in mind and be nice when asking questions or reporting issues.
All sizes in Crystal (e.g. String, Array, StaticArray, Slice or Pointer) are clamped at Int32::MAX
. This is by design and that won't change. We use a 32 bit number because larger values are considered broken (i.e. don't bloat memory, stream). We use a signed number because we don't want to leak unsigned integers that would lead to confusing behaviors (e.g. when operating on Slice#size
). A maximum of 2GB is already plenty enough.
Now, since we clamp values, and users may try inconsiderate sizes, we should have bound checks and raise an ArgumentError when values are too large. Would you be willing to open a pull request, with passing specs?
Relates to #3209.
@ysbaddaden I think that String
and Array
should use Int32
but Slice
should have a 64-bit size. Slice is more low level than the other types and isn't resizable so it makes sense for it to support larger sizes: it's a "window" on some memory, you don't always control the memory or it's size. Pointer
obviously supports 64bit already.
Reading pointer.cr
you're right. It does support 64 bit addresses, but it's always 64 bit (even on 32 bit platforms?), it sometimes casts sizes to UInt64
(malloc, realloc) but to UInt32
in other places (clear, copy, move). This seems illogical and incompatible from a quick look.
@RX14 that would defeat the purpose: if I can allocate a Slice with Int64 size, while should String or Array be limited?
@RX14, @ysbaddaden unfortunately mentioned lines are from pointer.cr , so it is Pointer broken.
At least you should fix this lines of Pointer.
We use a 32 bit number because larger values are considered broken (i.e. don't bloat memory, stream).
A maximum of 2GB is already plenty enough.
I'm not against using Int32 as base type for size of all standard-library containers. That is really ok.
But language that claims it suitable for some low-level programming have to have convenient types for.
Lets have Int32 for container's size and IntSize for allocation and IntPtr for pointer arithmetic.
I don't think It is ok to use UInt64 in many places if your platform is 32bit.
All sizes in Crystal (e.g. String, Array, StaticArray, Slice or Pointer) are clamped at Int32::MAX
Sizes, but not memory size of allocated pointer. How much will consume a = Array({Float64, Float64, Float64, Float64}).new(1<<29 - 1)
? It will allocate 1<<34
bytes. Yes, it will be correctly allocated now cause in appropriate places you use to_u64
(hope it never breaks on 32bit platforms). But a.unshift 1.0
is currently broken for such array.
@ysbaddaden because array uses realloc while slice doesn't. Array owns the memory it allocates and uses realloc assuming the memory is small. String because 2gb strings you've gone too far. Whereas slice represents a pointer and a size. A window on memory. It makes no assumptions about what's in that memory it just let's you access it. That memory could come from a C program which allocates more than 2gb of memory. A slice could be used to represent a mmap of a file on disk. These are perfectly valid usecases for a slice larger than 2gb.
@RX14 This is a valid claim, but Crystal doesn't aim for this, and we shouldn't care about potential large allocations C. If we assume Int32 is plaintly enough, then Slice must follow. Developers are discouraged to use Pointer
and to use Slice
instead. If Slice.new(5.gibabytes)
is allowed, we'll get complains that making a String
from a large slice is broken... and so on.
We're already kinda forced to have Pointer
handle large sizes, due to Int32::MAX * sizeof(T)
being possible 鈥攖hough maybe we should just raise, because nobody should do that; I mean with T=Int32, that's 8GB, with T=Int64, that's 16GB... seriously?
Either you should raise, or just fix move_from, copy_from and clear.
I agree with you intention to simplify application programming.
But since language provides low-level features, they should just work.
Lets be clear: Slice
, Array
, String
- they all are part of Standard Library, cause they could be implemented without intrisicts/primitives.
Pointer
is a part of language, cause it is known by compiler and implemented with intrisicts.
Reading very large files is a common task even for applications with 32-bit addressing. These are tools such as grep, diff, databases, video processing and much more. In this case, the Flle MMAP is fine and yes, it requires 64-bit addressing in files and memory.
Disappointed with these decisions. "Crystal is not C" isn't a valid excuse for not making a low level feature work properly.
I hope that these are just leftovers from the past. There certainly has been an effort to clean up these "32isms". Also note that LibC::SizeT
etc. do already exist.
https://github.com/crystal-lang/crystal/issues/823 https://github.com/crystal-lang/crystal/commit/4d350892e0755d9838b1a6601f1d726c3b9ed388
And yeah, some comments here have been worrying. It's never OK to have randomly breaking code when someone happens to allocate a memory region over 2GB.
All of this should be fixed, but it's hard. The main issue is that if Pointer#address
changes of type depending on the architecture, then some methods might break. For example maybe a method accepts UInt32
but Pointer#address
is sometimes UInt32
, sometimes UInt64
. I don't see an easy way to fix this. But having Pointer#address
be UInt64
all the time seems good, because UInt32
fits in UInt64
.
Those to_u32
calls are wrong, they are probably there from before Instrinsics.memcpy
was updated to provide a 64-bit alternative.
As @oprypin says, most of these things are left-over from the past. We should fix them slowly, though I don't know what maximum sizes should Slice, String, Array, Hash, etc. have. Maybe Slice
can have a size of Int64. When trying to create a big string from it we can simply raise an exception.
There's an easy fix for this, it's making a new numeric class like PointerAddress that uses 32 or 64 bit ints under the hood.
@zatherz As we pointed - there is LibC::SizeT and we need simply LibC::PtrDiffT too. But it need some time to rework things inside the compiler (https://github.com/crystal-lang/crystal/pull/4624#pullrequestreview-46246944)
So, what's the real issue here? What code breaks?
I think we can:
Pointer#address
as UInt64 (in 32 bits systems, this will just be a value less than or equal to UInt32::MAX
)Pointer#copy_from
, etc., to not cast some values with to_u32
Slice
have a size
that's UInt64
, and also update String
to check that the slice given to it in the constructor has a size less than some maximum allowed.Is there any other thing that needs a fix?
So, what's the real issue here? What code breaks?
$ # still correct, cause sizeof(array) = 2 GB
$ crystal eval 'a=Array(Int64).new((1<<28)-1){|i| i.to_i64}; p a.reduce{|s,v|s+v}; a.unshift(1_i64); p a.reduce{|s,v|s+v}'
36028796616310785
36028796616310786
$ # wrong, cause sizeof(array) = 4 GB (but number of elements = 512M)
$ crystal eval 'a=Array(Int64).new((1<<29)-1){|i| i.to_i64}; p a.reduce{|s,v|s+v}; a.unshift(1_i64); p a.reduce{|s,v|s+v}'
144115187270549505
144115186733678596
$ # segfault?
$ crystal eval 'a=Array(Int64).new(1<<28){|i| i.to_i64}; p a.reduce{|s,v|s+v}; a.unshift(1_i64); p a.reduce{|s,v|s+v}'
36028796884746240
Invalid memory access (signal 11) at address 0x20000000
[0x55cc41f78e55] *CallStack::print_backtrace:Int32 +117
[0x55cc41f6d71d] __crystal_sigfault_handler +61
[0x7f972e499670] ???
[0x55cc41fb505e] *Pointer(Int64) +14
[0x55cc41fb4ecb] *Array(Int64) +267
[0x55cc41fb4db2] *Array(Int64) +18
[0x55cc41f5e0cf] ???
[0x55cc41f6d619] main +41
[0x7f972dabe3f1] __libc_start_main +241
[0x55cc41f5d76a] _start +42
[0x0] ???
We can probably limit the size of an array. Why do you want an array that big? What's a use case for that?
@asterite you asked wrong question.
Whenever one write program that accepts arbitrary data, there is a chance to have enough data to trigger limit.
512M elements is not too big array. I had to deal with such arrays at least twice solving real world issues. (I didn't use crystal for solving them). I don't remember exact tasks. Something about collecting and aggregating data.
What will change if it will be not Int64
, but Tuple{Int64, Int64, Int32, String}
? Array size will be limited by (1<<32)/sizeof(Tuple{Int64, Int64, Int32, String})
?
2G elements is annoying, but acceptable limit. If there will be stricter limit, Crystal will loose a lot of usability.
And, any way, Pointer
is not container. Pointer
shall not have any restrictions beside hardware/operating system. Otherwise Crystal is just a toy.
Java's List#size
is int
: https://docs.oracle.com/javase/7/docs/api/java/util/List.html#size()
So I guess Java is a toy?
I understand now. We can have Int32::MAX elements, but it still behaves wrong. The problem is that I can't seem to understand the issue from those one-liners. Would you care to explain what's wrong with Array implementation that makes it work wrong, and why my comments here ( https://github.com/crystal-lang/crystal/issues/4589#issuecomment-328832169 ) wouldn't fix that issue?
Would you care to explain what's wrong with Array implementation that makes it work wrong
I think, problem is on Pointer side.
why my comments here ( #4589 (comment) ) wouldn't fix that issue?
I think, it will fix. (Note, if flag?
in intrinsics.cr should check aarch64 as well).
Part of discussion is about: isn't it better to use SizeT and IntPtrT in such places?
@funny-falcon adding type aliases for integers which will probably change between platforms probably isn't something we want to do, as unlike C our integers don't automatically convert (a good thing). We already have that issue with some C types (for example, the bigint PR that broke on 32bit), lets not spread it around the stdlib too.
I agree with @asterite's suggestion of what to do entirely.
I'll send a PR to fix all of this later today.
@RX14
bigint PR is a bad example: it was broken exactly cause every one forget about distinction between types on different platforms, and how libgmp uses this type.
Now there is a risk to forget, how external libraries uses intptr_t, uintptr_t, ptrdiff_t. So, not using native platform type increases risk of bugs, not decrease, as you assume.
Most helpful comment
Disappointed with these decisions. "Crystal is not C" isn't a valid excuse for not making a low level feature work properly.