I am looking at the Crystal code array in src/array.cr, comparing it with this Ruby documentation, trying to learn and understand how Crystal arrays are implemented, particularly how they are initialized and allocated. A few relevant questions follow:
1) What is the difference between @length and @capacity members? The initialize(initial_capacity = 3 : Int) initializer sets length to 0, so does this mean that the array is then expanded every time a new item is pushed to it? Does this cause a performance penalty in comparison to the initialize(size, value : T) method for example?
2) How/when is memory (dedicated to arrays or tuples) freed by the garbage collector?
3) Typing puts Array.new(3) in Ruby returns [nil, nil, nil], while typing puts Array(Float64).new(3) in Crystal returns []. What is the most efficient way in Crystal to allocate memory for an array of say 3 uninitialized float64 elements?
1) Array is a Dynamic array (also known as "array list"). An initial buffer of @capacity capacity is allocated, and once @length exceeds it a new buffer, double the size, is allocated and previous elements are copied to it (using realloc). Ruby's Array is the same. For example here's where they double the capacity (they call it "capa").
To answer your question, the array is not expanded every time an item is pushed. With an initial capacity of 4, when wanting to add a 5th element the internal buffer will be resized, and capacity will be set to 8. Then at the 9th this will happen again. Then at the 17th, then at 33th and so on. So it always takes longer for a resize to happen. This is what "amortized cost" is, and is referred to in that wikipedia article.
Dynamic arrays are very fast and efficient, but they might waste memory. But performance is almost always more important.
The initialize(size, value : T) overload allocates a buffer of exactly that size and assigns value at each position. For this there's no need to do an "index out of bounds" check. If you create an array and push elements to it, every push has to check the length of the array and check if a resize is needed. So it's always better to use initialize(size, value : T) or initialize(size, &block) if you can.
2) Tuples are allocated on the stack and their memory is freed after a method exits. For types allocated on the heap (Array, Reference... and even when you do Pointer.malloc) their memory will be freed when no other object refer to them. "Another object" is another object, a variable in the stack, an instance variable, a global variable, etc. This is also similar to Ruby.
3) This is because Ruby's Array can have any type. But we can't have nil inside an Array(Float64). This is something different from Ruby. When you do Array(Float64).new(3) this just allocates a buffer of size 3 (with @capacity = 3) in the array's internal buffer. Then you can push elements to that array, but you can push more than 3 if you want. There's no such thing as "uninitialized values" in Crystal, at least not in the safe parts of the language (when you don't use Pointer or stack allocate).
To have an array of exactly 3 elements you can use a StaticArray. But an Array is almost always more versatile.
What do you want to do?
As a side note, Ruby's Array don't have a concept of "index out of bounds". This is valid:
a = []
a[1_000_000] = 1
puts a.length #=> 1000001
The above (with a typed array) fails in Crystal with "Index out of bounds". We didn't copy that behaviour because index out of bounds almost always mean there's a bug in your code.
This was a very informative and helpful in-depth reply, thanks, now I get to understand gradually how Crystal arrays work.
In relation to your question about what I want to do, and as a natural continuation to this discussion, here is 3 follow-up questions:
1) There are several numerical C libraries for which we may want to write Crystal bindings. Some representative examples are dSFMT (for which I already wrote the bindings), BLAS and LAPACK. I think the general question here is this; when a C library expects a pointer to array as input argument and possibly the array's length as a second input argument, then should we pass a Crystal array or static array or tuple? What should we pass as the length input argment, the length or the capacity in the case of dynamic arrays?
2) If I get this right, on the Crystal side of things we cover so far one-dimensional arrays (vectors)? Are we going to add two-dimensional arrays (matrices) or higher-dimensional arrays?
3) This is a kind of more specialized question specific to libdSFMT, which is less likely to appear in the broader context of other C bindings. When we ask dSFMT to generate an array of random numbers, there is a min array size. In other words we must ask for an array of at least 312 random numbers if I remember well the number. The natural question-problem to solve here is what we do if the user wants to generate fewer random numbers, let's say only 100 of them. I have been trying to understand how Julia solved this. If I read their source code right, they have some sort of buffer array of size 312 as a data member to their (equivalent) Mersenne twister class. They generate 312 samples and then depending on how many samples the user asks for, they pop the relevant number of samples from their buffer and then refill it in the background, sth like this...
A couple more thoughts I had on the matter today, related to the first 2 questions:
StaticArray(T, N) is great, yet a modification would be required with respect to C bindings. In particular, instead of having 2 generic parameters T and N, it would make more sense to define a new array type CArray(T) and in there have an @N and/or @length member. Of course we should retain the StaticArray definition, and we could have the CArray too. Then, when we call a bounded C function, we can pass for example 2 input arguments to the function, namely CArray(Float64) for example and n, which wil be CArray.length.I'll answer item by item:
1) As I said elsewhere, we should try to make users use Array. StaticArray are mostly for a library's internal usage.
2) Defining Matrix2 or Matrix3 classes is easy, as you can define a [] operator that receives many arguments. For example:
class Matrix2(T)
def initialize(@rows, @cols)
@buffer = Pointer(T).malloc(@rows * @cols)
end
def [](row, col)
check_in_bounds i, j
@buffer[@rows * row + col] # quick typing, not sure the math is ok
end
def []=(row, col, value : T)
# ...
end
end
We might want to have these in the standard library. Same goes with Vector, and to be able to operate between them. Ruby has Matrix and Vector but I never used them. But maybe these are too math/stats oriented and it would be better to have them in a separate library (something like what numpy is for python). Crystal is a general purpose language, I wouldn't want to orient it too much to math/stats.
3) Ah, I think we discusses this in other places.
The last two points:
a) There's another type we have, Slice. I think this is what you want. It's a struct with two things: a pointer and a length. So in a way it's like a safe pointer (Pointer doesn't know its length). I think a CArray in Crystal is Pointer, because C arrays don't know their length. But according to what you want that CArray is Slice. We have lots of types and it might be confusing, but all of these other types (StaticArray, Slice and Pointer) are very handy for implementing efficient things, and for interfacing with C, possibly in the safest way. At the application level users should always use Array because it's the most versatile (and it's pretty efficient too). But if they want more efficiency at the cost of writing more code, and maybe going to unsafe territory, they can use the other types.
b) Above :-)
Thanks @asterite, this is all very helpful. I will try the Slice.
That's a good idea, we could have an external library similar to numpy. Along these lines, I could create an organization (perhaps CrystalScience or perhaps small orgs per topic such as CrystalStats) and place there random number generators, stats, matrices etc.
Ok, I changed arrays to slices in random.cr and this seems to work (I only have one question, which I posted in random.cr by annotating the relevant code).
A thought struck me; would it be better that I maintain this "numpy" library independently or you prefer to have it under the manastech organization? If you choose the latter, you may want to add me as a collaborator specifically for the library so that I can work on it directlly - whichever of the 2 options works for you, both are fine with me :)
For now I would keep separately. If later we see that it makes sense to merge it to the standard library, or put it under an organization, we'll do :-)
Ok, will keep it under my account - if that's ok, will only post an announcement on the Crystal google group soon to let other users and developers know about the random generation capabilities added so that anyone interested can use it (I will give it another week or so to finish some loose ends).
I'll close this, the question was answered (I believe) :-)
open = {} of String => String # < ? pwiz your correction
open["f"] = "fa"
open["f"]["a"] = "noob"
puts open["f"]["a"]
Most helpful comment
1) Array is a Dynamic array (also known as "array list"). An initial buffer of
@capacitycapacity is allocated, and once@lengthexceeds it a new buffer, double the size, is allocated and previous elements are copied to it (using realloc). Ruby's Array is the same. For example here's where they double the capacity (they call it "capa").To answer your question, the array is not expanded every time an item is pushed. With an initial capacity of 4, when wanting to add a 5th element the internal buffer will be resized, and capacity will be set to 8. Then at the 9th this will happen again. Then at the 17th, then at 33th and so on. So it always takes longer for a resize to happen. This is what "amortized cost" is, and is referred to in that wikipedia article.
Dynamic arrays are very fast and efficient, but they might waste memory. But performance is almost always more important.
The
initialize(size, value : T)overload allocates a buffer of exactly that size and assignsvalueat each position. For this there's no need to do an "index out of bounds" check. If you create an array and push elements to it, every push has to check the length of the array and check if a resize is needed. So it's always better to useinitialize(size, value : T)orinitialize(size, &block)if you can.2) Tuples are allocated on the stack and their memory is freed after a method exits. For types allocated on the heap (Array, Reference... and even when you do Pointer.malloc) their memory will be freed when no other object refer to them. "Another object" is another object, a variable in the stack, an instance variable, a global variable, etc. This is also similar to Ruby.
3) This is because Ruby's Array can have any type. But we can't have
nilinside anArray(Float64). This is something different from Ruby. When you doArray(Float64).new(3)this just allocates a buffer of size 3 (with@capacity = 3) in the array's internal buffer. Then you can push elements to that array, but you can push more than 3 if you want. There's no such thing as "uninitialized values" in Crystal, at least not in the safe parts of the language (when you don't use Pointer or stack allocate).To have an array of exactly 3 elements you can use a StaticArray. But an Array is almost always more versatile.
What do you want to do?
As a side note, Ruby's Array don't have a concept of "index out of bounds". This is valid:
The above (with a typed array) fails in Crystal with "Index out of bounds". We didn't copy that behaviour because index out of bounds almost always mean there's a bug in your code.