The previous discussion was in 2015 but tested with pi digits using BigInteger and how do we implement GMP?
https://groups.google.com/forum/#!topic/crystal-lang/cY8bImdHWV4
Tested PHP 7.1 on macOS Mojave completed around 2 seconds for the same number of iterations:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/pidigits-php-5.html
C, Ruby and Crystal Source code:
https://github.com/kostya/crystal-benchmarks-game#pidigits
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/pidigits-yarv-1.html
Php is modifying instances of big int (instead of always returning new ones). Right now we don't have such methods but we could add them. They would be add
, sub
, etc., because you can't redefine +=
.
If someone wants to implement that, please go ahead.
What I mean is something like this:
require "big"
N = (ARGV[0]? || 100).to_i
struct BigInt
def add(other : BigInt) : self
LibGMP.add(self, self, other)
self
end
def add(other : Int) : self
if other < 0
sub(other.abs)
elsif other <= LibGMP::ULong::MAX
LibGMP.add_ui(self, self, other)
else
add(other.to_big_i)
end
self
end
def sub(other : Int) : self
if other < 0
add(other.abs)
elsif other <= LibGMP::ULong::MAX
LibGMP.sub_ui(self, self, other)
else
sub(other.to_big_i)
end
self
end
def sub(other : BigInt) : self
LibGMP.sub(self, self, other)
self
end
def mul(other : BigInt) : self
LibGMP.mul(self, self, other)
self
end
def mul(other : LibGMP::IntPrimitiveSigned) : self
LibGMP.mul_si(self, self, other)
self
end
def mul(other : LibGMP::IntPrimitiveUnsigned) : self
LibGMP.mul_ui(self, self, other)
self
end
end
i = 0
k = 0
ns = 0.to_big_i
a = 0.to_big_i
t = 0
u = 0.to_big_i
k1 = 1
n = 1.to_big_i
d = 1.to_big_i
while true
k += 1
t = n << 1
n.mul(k)
k1 += 2
a.add(t)
a.mul(k1)
d.mul(k1)
if a >= n
t, u = (n * 3 + a).divmod(d)
u.add(n)
if d > u
ns.mul(10)
ns.add(t)
i += 1
if i % 10 == 0
print "%010d\t:%d\n" % {ns.to_u64, i}
ns = 0.to_big_i
end
break if i >= N
a.sub(d*t)
a.mul(10)
n.mul(10)
end
end
end
That already works almost twice as fast as the original code.
Do you still want someone else to implement this? I can have a look at it later today I think :)
True, it's twice faster and closer the gap to ~3x slower than PHP.
Using gtime to measure
PHP consumes 14.5MB (CPU 99%) memory
Crystal consumes 3.4MB (CPU 140%) memory
PHP gains a small performance advantage using ob_start for output buffer.
@cfsamson I wish if we could allocate output buffer for console as well. The other benchmark in the repo could be improve since that hasn't been updated for sometime.
@proyb6 Are you talking about the pidigits algorithm now or about the changes to big_int.cr asterite is talking about?
About the pidigits algorithm.
@proyb6 Well we could buffer like the example below. It gives me around 15-20% faster execution. But still it is a long way to close in on i.e. rust that uses GMP and I can't figure out the exact reason.
require "big"
N = (ARGV[0]? || 100).to_i
struct BigInt
def add(other : BigInt) : self
LibGMP.add(self, self, other)
self
end
def add(other : Int) : self
if other < 0
sub(other.abs)
elsif other <= LibGMP::ULong::MAX
LibGMP.add_ui(self, self, other)
else
add(other.to_big_i)
end
self
end
def sub(other : Int) : self
if other < 0
add(other.abs)
elsif other <= LibGMP::ULong::MAX
LibGMP.sub_ui(self, self, other)
else
sub(other.to_big_i)
end
self
end
def sub(other : BigInt) : self
LibGMP.sub(self, self, other)
self
end
def mul(other : BigInt) : self
LibGMP.mul(self, self, other)
self
end
def mul(other : LibGMP::IntPrimitiveSigned) : self
LibGMP.mul_si(self, self, other)
self
end
def mul(other : LibGMP::IntPrimitiveUnsigned) : self
LibGMP.mul_ui(self, self, other)
self
end
end
#calculate size of buffer
buf_size = 0
order = 2
(0..N).each do |i|
if i % 10 == 0 && i > 0
buf_size += 13 + order
if (i + 10) % 10**order == 0
order += 1
end
end
end
buffer = String::Builder.new(buf_size)
i = 0
k = 0
ns = 0.to_big_i
a = 0.to_big_i
t = 0
u = 0.to_big_i
k1 = 1
n = 1.to_big_i
d = 1.to_big_i
while true
k += 1
t = n << 1
n.mul(k)
k1 += 2
a.add(t)
a.mul(k1)
d.mul(k1)
if a >= n
t, u = (n * 3 + a).divmod(d)
u.add(n)
if d > u
ns.mul(10)
ns.add(t)
i += 1
if i % 10 == 0
#print "%010d\t:%d\n" % {ns.to_u64, i}
buffer << "%010d\t:%d\n" % {ns.to_u64, i}
ns = 0.to_big_i
end
break if i >= N
a.sub(d*t)
a.mul(10)
n.mul(10)
end
end
end
print buffer.to_s
@cfsamson My assumption if using unsafe memory or pointer may speed up?
I think the main issue to figure out, that would benefit Crystal in other ways as well, is to figure out why the interaction with GMP is slower than PHP and Python. I tried to implement Crystal in the same way the Python code was written, and it has the same problem as your original code. But the Python code does instasiate new instances as well, there is no good reason Python should be doing that faster than Crystal that I can think of.
Asterite was probably on the right track reducing the amount of new objects, but there must be something in the Crystal/GMP interaction that is slower than others besides that and if we figured out that the example above should be much faster too. I looked a bit into it, but I don't know enough about the GMP C library to see what can cause it. Maybe a solution is to look into what Python/Php does in their implementation besides what Asterite mentioned?
For your information, neither Ruby nor Python use GMP for big integers. They use custom code.
On the move now, so just brief response but if you look at the benchmarks, they use gmpy2 and it鈥檚 surprisingly fast. I haven鈥檛 looked into gmpy2 closely yet but assumed it was gmp beneath it.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/pidigits-python3-2.html
Ooh... The benchmark does use GMP. I was talking about python and ruby big integers.
Now, we are quite intrigue to learn too because it will be useful for e-commerce, financial and scientific computation.
:) Well, it looks like all the fastest benchmarks are all using gmp, the top 10-15 entries is probably just comparing how efficiently they integrate with an external C library. Still, Crystal should be doing really good there so it's interesting to look into.
I've looked into this, unfortunately i haven't found out the reason for the differeces. Python uses GMP with a heavily custimozed C wrapper for Pyhton it looks like, and Rust uses almost pure GMP for the heavy load but looks quite similar in practice to the implemenation above prefering to mutate a object instead of instansiating new ones. I have some questions though, maybe @asterite has some answers:
When you pass inn self
to LibGMP
, is the value copied as a new value since it's a struct or does it mutate? And is copying a possibly large object as BigInt costly?
Is C calls from Crystal generally cheap or could they incur some overhead of significance?
When you pass inn self to LibGMP, is the value copied as a new value since it's a struct or does it mutate? And is copying a possibly large object as BigInt costly?
No, because you're passing in a pointer, passing self
to a LibGMP
call actually calls self.to_unsafe
which returns a pointer.
Is C calls from Crystal generally cheap or could they incur some overhead of significance?
They have zero overhead, calls are generated exactly the same way a C compiler would generate calls.
@cfsamson
I have a look at NIM repo which compares GMP mpz_t and BigInts, the latter is significantly slower.
https://github.com/def-/nim-benchmarksgame/blob/master/pidigits_gmp.nim
https://github.com/def-/nim-benchmarksgame/blob/master/pidigits_bigints.nim
Is it something we can test with mpz_t instead of Crystal BigInts too?
https://github.com/crystal-lang/crystal/blob/3b3bf3fa2bb51cc15b6db3829c3c54493c8924d6/src/big/big_int.cr
@proyb6 If I'm not mistaken, the BigInts library in Nim is using a Nim implementation of their own BigInt library, and the other one is using GMP and that's the same as Crystal does, Crystal only "wrapps" calls to GMP in a crystal struct to make it better to use and a natural part of the language. In the benchmarks, noone can match the optimizations in GMPlib in their own implementations.
But the big difference is that both Go, Nim, Rust use the GMP library directly in the benchmarks, instead of wrapping it as a part of their language. That might enable some more efficiency spesifically for this benchmark. I.e. both Rust and Nim uses functions like mpz_addmul_ui
, mpz_submul_ui
and mpz_tdiv_q
for other operations. the Crystal does not use these in the implementation above. I don't know how much of an impact that has but that's the only kind of difference I can think of after asterite made the adjustments above to avoid the new allocations that were there originally.
So we could make the Crystal implementation similar to Nim and Rust, and it would probably be faster, if you want to have a go at it so try it :) That would be comparing apples to apples atleast.
@cfsamson I'm new to Crystal, very much open to let experienced contributors do the implementations.
I can implement all the mutating methods for BigInt
and BigFloat
. The only "ugly" thing is that they'll have names like add
, mul
, sub
, etc., because +=
and others can't be overloaded (and it would be very difficult to do that without changing a big portion of the language to return references to memory).
So for example:
a = # some big int
b = # some big int
a.add(b) # a += b
The problem is that this might also not be the most efficient way to implement it. For example in Go we have something like this:
a = # some big int
b = # some big int
c = # some big int
a.add(b, c) # a = b + c
this is the most versatile way to implement this, because it lets you reuse all the memory you want. Say for example you need to compute b + c
and you already had a BigInt
variable that you don't use anymore: with this you can avoid allocating new memory for the result by reusing the memory in a
.
So I suggest we implement all the operations like in Go, which also match GMP's interface. We can still keep the usual operators if one doesn't need super extra performance and prefers more readable code. But we can provide another "uglier but more efficient" interface in case you need it.
Thoughts?
/cc @bcardiff @RX14 @straight-shoota
I鈥檓 fine with the suggestion.
@asterite looks good to me. I had the thought to separate the mutable and immutable bigint into different types but I don't think it's a good idea. Anyone who implements it might want to consider the advantages/disadvantages in more depth.
I agree. I actually tried something similar, but didn't manage to find a way to reuse the memory of a BigInt (struct) without resorting to writing something in C, but it's obvious when you write it like that. It must be the smartest way of doing it.
Hmm, maybe keep both options available? Just fixed what I got stuck at yesterday with the suggestion from @asterite and this is the fastest implementation I've gotten i Crystal so far, and I have an other idea as well that I will try later:
require "big"
N = (ARGV[0]? || 100).to_i
struct BigInt
def add(other : BigInt) : self
LibGMP.add(self, self, other)
self
end
def add(other : Int) : self
if other < 0
sub(other.abs)
elsif other <= LibGMP::ULong::MAX
LibGMP.add_ui(self, self, other)
else
add(other.to_big_i)
end
self
end
def sub(other : Int) : self
if other < 0
add(other.abs)
elsif other <= LibGMP::ULong::MAX
LibGMP.sub_ui(self, self, other)
else
sub(other.to_big_i)
end
self
end
def sub(other : BigInt) : self
LibGMP.sub(self, self, other)
self
end
def mul(other : BigInt) : self
LibGMP.mul(self, self, other)
self
end
def mul(other : LibGMP::IntPrimitiveSigned) : self
LibGMP.mul_si(self, self, other)
self
end
def mul(other : LibGMP::IntPrimitiveUnsigned) : self
LibGMP.mul_ui(self, self, other)
self
end
# --------------------- NEW METHODS -----------------------------
def add(op1 : BigInt, op2 : BigInt) : self
LibGMP.add(self, op1, op2)
self
end
def add(op1 : BigInt, op2 : Int) : self
if op2 < 0
sub(op1, op2.abs)
elsif op2 <= LibGMP::ULong::MAX
LibGMP.add_ui(self, op1, op2)
else
add(op1, op2.to_big_i)
end
self
end
def sub(op1 : BigInt, op2 : Int) : self
if op2 < 0
add(op1, op2.abs)
elsif op2 <= LibGMP::ULong::MAX
LibGMP.sub_ui(self, op1, op2)
else
sub(op1. op2.to_big_i)
end
self
end
def sub(op1 : BigInt, op2 : BigInt) : self
LibGMP.sub(self, op1, op2)
self
end
def mul(op1 : BigInt, op2 : BigInt) : self
LibGMP.mul(self, op1, op2)
self
end
def mul(op1 : BigInt, op2 : LibGMP::IntPrimitiveSigned) : self
LibGMP.mul_si(self, op1, op2)
self
end
def mul(op1 : BigInt, op2 : LibGMP::IntPrimitiveUnsigned) : self
LibGMP.mul_ui(self, op1, op2)
self
end
end
#calculate size of buffer
buf_size = 0
order = 2
(0..N).each do |i|
if i % 10 == 0 && i > 0
buf_size += 13 + order
if (i + 10) % 10**order == 0
order += 1
end
end
end
buffer = String::Builder.new(buf_size)
w = 0.to_big_i
k = 1
n1 = 4.to_big_i
n2 = 3.to_big_i
d = 1.to_big_i
i = 0
produced = 0
extracted = 0
while true
u = n1.tdiv(d)
v = n2.tdiv(d)
if (u <=> v) == 0
buffer << u
i += 1
# echo
break if i == N
# extract
u = d * (-10 * u)
n1.mul(10)
n1.add(u)
n2.mul(10)
n2.add(u)
else
# produce
k2 = k << 1
u.mul(n1, (k2-1))
v.add(n2, n2)
w.mul(n1, (k -1))
n1.add(u, v)
u.mul(n2, (k + 2))
n2.add(w, u)
d.mul((k2 + 1))
k += 1
end
end
print buffer.to_s
NB. I didn't bother to format the output since that is not time consuming
OK, just some performance notes and then I'll stop spamming:
Calculating 10 000 digits on my machine:
The fastest Rust implementation runs at ~0.9 sec
The fastest Go implementation ~1.15
(not using Go's own BigInt type, that's roughly 50% slower)
This implementation in Crystal runs in ~0.9 sec
So I think atleast the performance is getting there. The code is probably not perfect with checks etc, but at least it's a proof of concept.
Note that I had to add the following lines to lib_gmp.cr
fun submul_ui = __gmpz_submul_ui(rop : MPZ*, op1 : MPZ*, op2 : ULong)
fun addmul_ui = __gmpz_addmul_ui(rop : MPZ*, op1 : MPZ*, op2 : ULong)
require "big"
N = (ARGV[0]? || 100).to_i
struct BigInt
def add_with(other : BigInt) : self
LibGMP.add(self, self, other)
self
end
def add_with(other : Int) : self
if other < 0
sub(other.abs)
elsif other <= LibGMP::ULong::MAX
LibGMP.add_ui(self, self, other)
else
add(other.to_big_i)
end
self
end
def sub_with(other : Int) : self
if other < 0
add(other.abs)
elsif other <= LibGMP::ULong::MAX
LibGMP.sub_ui(self, self, other)
else
sub(other.to_big_i)
end
self
end
def sub_with(other : BigInt) : self
LibGMP.sub(self, self, other)
self
end
def mul_with(other : BigInt) : self
LibGMP.mul(self, self, other)
self
end
def mul_with(other : LibGMP::IntPrimitiveSigned) : self
LibGMP.mul_si(self, self, other)
self
end
def mul_with(other : LibGMP::IntPrimitiveUnsigned) : self
LibGMP.mul_ui(self, self, other)
self
end
# --------------------- NEW METHODS -----------------------------
def add(op1 : BigInt, op2 : BigInt) : self
LibGMP.add(self, op1, op2)
self
end
def add(op1 : BigInt, op2 : Int) : self
if op2 < 0
sub(op1, op2.abs)
elsif op2 <= LibGMP::ULong::MAX
LibGMP.add_ui(self, op1, op2)
else
add(op1, op2.to_big_i)
end
self
end
def sub(op1 : BigInt, op2 : Int) : self
if op2 < 0
add(op1, op2.abs)
elsif op2 <= LibGMP::ULong::MAX
LibGMP.sub_ui(self, op1, op2)
else
sub(op1.op2.to_big_i)
end
self
end
def sub(op1 : BigInt, op2 : BigInt) : self
LibGMP.sub(self, op1, op2)
self
end
def mul(op1 : BigInt, op2 : BigInt) : self
LibGMP.mul(self, op1, op2)
self
end
def mul(op1 : BigInt, op2 : LibGMP::IntPrimitiveSigned) : self
LibGMP.mul_si(self, op1, op2)
self
end
def mul(op1 : BigInt, op2 : LibGMP::IntPrimitiveUnsigned) : self
LibGMP.mul_ui(self, op1, op2)
self
end
def tdiv_q(op1 : BigInt, op2 : BigInt)
check_division_by_zero op2
LibGMP.tdiv_q(self, op1, op2)
self
end
def submul(op1 : BigInt, op2 : UInt32)
LibGMP.submul_ui(self, op1, op2)
self
end
def addmul(op1 : BigInt, op2 : UInt32)
LibGMP.addmul_ui(self, op1, op2)
self
end
end
class Context
property k = 0
property tmp1 : BigInt = 0.to_big_i
property tmp2 : BigInt = 0.to_big_i
property acc : BigInt = 0.to_big_i
property den : BigInt = 1.to_big_i
property num : BigInt = 1.to_big_i
def extract_digit(nth : UInt32)
@tmp1.mul(@num, nth)
@tmp2.add(@tmp1, @acc)
@tmp1.tdiv_q(@tmp2, @den)
@tmp1.to_i
end
def eliminate_digit(d)
#x = 0.to_big_i.mul(@den, d)
@acc.submul(@den, d)
@acc.mul_with(10)
@num.mul_with(10)
end
def next_term
@k += 1
k2 = @k * 2 + 1
# x = 0.to_big_i.mul(@num, 2)
@acc.addmul(@num, 2)
@acc.mul_with(k2)
@den.mul_with(k2)
@num.mul_with(@k)
end
end
#calculate size of buffer
buf_size = 0
order = 2
(0..N).each do |i|
if i % 10 == 0 && i > 0
buf_size += 13 + order
if (i + 10) % 10**order == 0
order += 1
end
end
end
buffer = String::Builder.new(buf_size)
context = Context.new
(1...(N + 1)).each do |i|
loop do
context.next_term
next if context.num > context.acc
d = context.extract_digit(3)
next if d != context.extract_digit(4)
context.eliminate_digit(d.to_u)
buffer << d
if i % 10 == 0
buffer << "\t:%d\n" % i
end
break
end
end
print buffer.to_s
@cfsamson good! Time for a PR?
@j8r Yeah maybe. And some good news too, I forgot to uncomment a line of code so it turns out it runs roughly as fast as the Rust version on my Mac which is a pretty good progress from where we started. I updated the code and the time above.
I'm still wondering if we should have both versions with:
a.add(b, c)
and the:
a.add_with(b)
(this is just a suggestion from me, it's probably better to let them just be two overloads of add
since I don't know how good English it is to say: "sub a with b"...)
And also if we should add:
submul
& addmul
However if they are added that should be enough to optimize the most used BigInt operations I think 馃憤
@cfsamson which line did you refer to "uncomment a line of code"?
@proyb6 I left the following line there that should not have been there:
x = 0.to_big_i.mul(@den, d)
So we have 4 versions now as you see here: https://gist.github.com/cfsamson/6b176634b83e016932d0df51e186dd9f
_(note this is all numbers from my Mac from 2013)_
Version 1: adding submul and addmul to lib_gmp.cr, exec time is roughly same as Rust impl
Version 2: same as version 1 without adding addmul and submul to lib_gmp.cr ~2.5 x slower
Version 3: same as Python implementation, ~3x slower
Version 4: original first implementation with mutating BigInts 3.9 x slower than version 1
I removed the initialization of the buffer from the faster implementations, seems to not be a big factor, and actually slow down version 1 a bit.
There's probably more ways to do this, but after matching Rusts impl with a small change in the stdlib I'm not giving the algorithm for pidigits any more focus right now :)
@cfsamson I think it would be nice if any contributors could fix before 0.27 is release.
@proyb6 I'll try a PR for this and get some feedback on the API I think is best.
Most helpful comment
What I mean is something like this:
That already works almost twice as fast as the original code.