Crystal: Crystal and GMP

Created on 2 Sep 2018 · 31Comments · Source: crystal-lang/crystal

The previous discussion was in 2015 but tested with pi digits using BigInteger and how do we implement GMP?
https://groups.google.com/forum/#!topic/crystal-lang/cY8bImdHWV4

Tested PHP 7.1 on macOS Mojave completed around 2 seconds for the same number of iterations:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/pidigits-php-5.html

C, Ruby and Crystal Source code:
https://github.com/kostya/crystal-benchmarks-game#pidigits
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/pidigits-yarv-1.html

question performance stdlib

Source

proyb6

👍2

Most helpful comment

What I mean is something like this:

require "big"
N = (ARGV[0]? || 100).to_i

struct BigInt
  def add(other : BigInt) : self
    LibGMP.add(self, self, other)
    self
  end

  def add(other : Int) : self
    if other < 0
      sub(other.abs)
    elsif other <= LibGMP::ULong::MAX
      LibGMP.add_ui(self, self, other)
    else
      add(other.to_big_i)
    end
    self
  end

  def sub(other : Int) : self
    if other < 0
      add(other.abs)
    elsif other <= LibGMP::ULong::MAX
      LibGMP.sub_ui(self, self, other)
    else
      sub(other.to_big_i)
    end
    self
  end

  def sub(other : BigInt) : self
    LibGMP.sub(self, self, other)
    self
  end

  def mul(other : BigInt) : self
    LibGMP.mul(self, self, other)
    self
  end

  def mul(other : LibGMP::IntPrimitiveSigned) : self
    LibGMP.mul_si(self, self, other)
    self
  end

  def mul(other : LibGMP::IntPrimitiveUnsigned) : self
    LibGMP.mul_ui(self, self, other)
    self
  end
end

i = 0
k = 0
ns = 0.to_big_i
a = 0.to_big_i
t = 0
u = 0.to_big_i
k1 = 1
n = 1.to_big_i
d = 1.to_big_i

while true
  k += 1
  t = n << 1
  n.mul(k)
  k1 += 2
  a.add(t)
  a.mul(k1)
  d.mul(k1)
  if a >= n
    t, u = (n * 3 + a).divmod(d)
    u.add(n)
    if d > u
      ns.mul(10)
      ns.add(t)
      i += 1
      if i % 10 == 0
        print "%010d\t:%d\n" % {ns.to_u64, i}
        ns = 0.to_big_i
      end
      break if i >= N

      a.sub(d*t)
      a.mul(10)
      n.mul(10)
    end
  end
end

That already works almost twice as fast as the original code.

asterite on 2 Sep 2018

🎉7

All 31 comments

Php is modifying instances of big int (instead of always returning new ones). Right now we don't have such methods but we could add them. They would be add, sub, etc., because you can't redefine +=.

If someone wants to implement that, please go ahead.

asterite on 2 Sep 2018

👍2

What I mean is something like this:

require "big"
N = (ARGV[0]? || 100).to_i

struct BigInt
  def add(other : BigInt) : self
    LibGMP.add(self, self, other)
    self
  end

  def add(other : Int) : self
    if other < 0
      sub(other.abs)
    elsif other <= LibGMP::ULong::MAX
      LibGMP.add_ui(self, self, other)
    else
      add(other.to_big_i)
    end
    self
  end

  def sub(other : Int) : self
    if other < 0
      add(other.abs)
    elsif other <= LibGMP::ULong::MAX
      LibGMP.sub_ui(self, self, other)
    else
      sub(other.to_big_i)
    end
    self
  end

  def sub(other : BigInt) : self
    LibGMP.sub(self, self, other)
    self
  end

  def mul(other : BigInt) : self
    LibGMP.mul(self, self, other)
    self
  end

  def mul(other : LibGMP::IntPrimitiveSigned) : self
    LibGMP.mul_si(self, self, other)
    self
  end

  def mul(other : LibGMP::IntPrimitiveUnsigned) : self
    LibGMP.mul_ui(self, self, other)
    self
  end
end

i = 0
k = 0
ns = 0.to_big_i
a = 0.to_big_i
t = 0
u = 0.to_big_i
k1 = 1
n = 1.to_big_i
d = 1.to_big_i

while true
  k += 1
  t = n << 1
  n.mul(k)
  k1 += 2
  a.add(t)
  a.mul(k1)
  d.mul(k1)
  if a >= n
    t, u = (n * 3 + a).divmod(d)
    u.add(n)
    if d > u
      ns.mul(10)
      ns.add(t)
      i += 1
      if i % 10 == 0
        print "%010d\t:%d\n" % {ns.to_u64, i}
        ns = 0.to_big_i
      end
      break if i >= N

      a.sub(d*t)
      a.mul(10)
      n.mul(10)
    end
  end
end

That already works almost twice as fast as the original code.

asterite on 2 Sep 2018

🎉7

Do you still want someone else to implement this? I can have a look at it later today I think :)

cfsamson on 2 Sep 2018

True, it's twice faster and closer the gap to ~3x slower than PHP.

Using gtime to measure
PHP consumes 14.5MB (CPU 99%) memory
Crystal consumes 3.4MB (CPU 140%) memory

PHP gains a small performance advantage using ob_start for output buffer.

@cfsamson I wish if we could allocate output buffer for console as well. The other benchmark in the repo could be improve since that hasn't been updated for sometime.

proyb6 on 2 Sep 2018

🎉1

@proyb6 Are you talking about the pidigits algorithm now or about the changes to big_int.cr asterite is talking about?

cfsamson on 2 Sep 2018

About the pidigits algorithm.

proyb6 on 2 Sep 2018

@proyb6 Well we could buffer like the example below. It gives me around 15-20% faster execution. But still it is a long way to close in on i.e. rust that uses GMP and I can't figure out the exact reason.


require "big"
N = (ARGV[0]? || 100).to_i

struct BigInt
  def add(other : BigInt) : self
    LibGMP.add(self, self, other)
    self
  end

  def add(other : Int) : self
    if other < 0
      sub(other.abs)
    elsif other <= LibGMP::ULong::MAX
      LibGMP.add_ui(self, self, other)
    else
      add(other.to_big_i)
    end
    self
  end

  def sub(other : Int) : self
    if other < 0
      add(other.abs)
    elsif other <= LibGMP::ULong::MAX
      LibGMP.sub_ui(self, self, other)
    else
      sub(other.to_big_i)
    end
    self
  end

  def sub(other : BigInt) : self
    LibGMP.sub(self, self, other)
    self
  end

  def mul(other : BigInt) : self
    LibGMP.mul(self, self, other)
    self
  end

  def mul(other : LibGMP::IntPrimitiveSigned) : self
    LibGMP.mul_si(self, self, other)
    self
  end

  def mul(other : LibGMP::IntPrimitiveUnsigned) : self
    LibGMP.mul_ui(self, self, other)
    self
  end
end

#calculate size of buffer
buf_size = 0
order = 2
(0..N).each do |i|
  if i % 10 == 0 && i > 0
    buf_size += 13 + order
    if (i + 10) % 10**order == 0
      order += 1
    end
  end
end
buffer = String::Builder.new(buf_size)

i = 0
k = 0
ns = 0.to_big_i
a = 0.to_big_i
t = 0
u = 0.to_big_i
k1 = 1
n = 1.to_big_i
d = 1.to_big_i

while true
  k += 1
  t = n << 1
  n.mul(k)
  k1 += 2
  a.add(t)
  a.mul(k1)
  d.mul(k1)
  if a >= n
    t, u = (n * 3 + a).divmod(d)
    u.add(n)
    if d > u
      ns.mul(10)
      ns.add(t)
      i += 1
      if i % 10 == 0
        #print "%010d\t:%d\n" % {ns.to_u64, i}
        buffer << "%010d\t:%d\n" % {ns.to_u64, i}
        ns = 0.to_big_i
      end
      break if i >= N

      a.sub(d*t)
      a.mul(10)
      n.mul(10)
    end
  end
end

print buffer.to_s

cfsamson on 2 Sep 2018

❤1 👍1

@cfsamson My assumption if using unsafe memory or pointer may speed up?

proyb6 on 4 Sep 2018

I think the main issue to figure out, that would benefit Crystal in other ways as well, is to figure out why the interaction with GMP is slower than PHP and Python. I tried to implement Crystal in the same way the Python code was written, and it has the same problem as your original code. But the Python code does instasiate new instances as well, there is no good reason Python should be doing that faster than Crystal that I can think of.

Asterite was probably on the right track reducing the amount of new objects, but there must be something in the Crystal/GMP interaction that is slower than others besides that and if we figured out that the example above should be much faster too. I looked a bit into it, but I don't know enough about the GMP C library to see what can cause it. Maybe a solution is to look into what Python/Php does in their implementation besides what Asterite mentioned?

cfsamson on 4 Sep 2018

For your information, neither Ruby nor Python use GMP for big integers. They use custom code.

asterite on 4 Sep 2018

On the move now, so just brief response but if you look at the benchmarks, they use gmpy2 and it’s surprisingly fast. I haven’t looked into gmpy2 closely yet but assumed it was gmp beneath it.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/pidigits-python3-2.html

cfsamson on 4 Sep 2018

Ooh... The benchmark does use GMP. I was talking about python and ruby big integers.

asterite on 4 Sep 2018

Now, we are quite intrigue to learn too because it will be useful for e-commerce, financial and scientific computation.

proyb6 on 4 Sep 2018

:) Well, it looks like all the fastest benchmarks are all using gmp, the top 10-15 entries is probably just comparing how efficiently they integrate with an external C library. Still, Crystal should be doing really good there so it's interesting to look into.

cfsamson on 4 Sep 2018

I've looked into this, unfortunately i haven't found out the reason for the differeces. Python uses GMP with a heavily custimozed C wrapper for Pyhton it looks like, and Rust uses almost pure GMP for the heavy load but looks quite similar in practice to the implemenation above prefering to mutate a object instead of instansiating new ones. I have some questions though, maybe @asterite has some answers:

When you pass inn self to LibGMP, is the value copied as a new value since it's a struct or does it mutate? And is copying a possibly large object as BigInt costly?
Is C calls from Crystal generally cheap or could they incur some overhead of significance?

cfsamson on 5 Sep 2018

When you pass inn self to LibGMP, is the value copied as a new value since it's a struct or does it mutate? And is copying a possibly large object as BigInt costly?

No, because you're passing in a pointer, passing self to a LibGMP call actually calls self.to_unsafe which returns a pointer.

Is C calls from Crystal generally cheap or could they incur some overhead of significance?

They have zero overhead, calls are generated exactly the same way a C compiler would generate calls.

RX14 on 7 Sep 2018

👍1

@cfsamson
I have a look at NIM repo which compares GMP mpz_t and BigInts, the latter is significantly slower.
https://github.com/def-/nim-benchmarksgame/blob/master/pidigits_gmp.nim
https://github.com/def-/nim-benchmarksgame/blob/master/pidigits_bigints.nim

Is it something we can test with mpz_t instead of Crystal BigInts too?
https://github.com/crystal-lang/crystal/blob/3b3bf3fa2bb51cc15b6db3829c3c54493c8924d6/src/big/big_int.cr

proyb6 on 9 Sep 2018

@proyb6 If I'm not mistaken, the BigInts library in Nim is using a Nim implementation of their own BigInt library, and the other one is using GMP and that's the same as Crystal does, Crystal only "wrapps" calls to GMP in a crystal struct to make it better to use and a natural part of the language. In the benchmarks, noone can match the optimizations in GMPlib in their own implementations.

But the big difference is that both Go, Nim, Rust use the GMP library directly in the benchmarks, instead of wrapping it as a part of their language. That might enable some more efficiency spesifically for this benchmark. I.e. both Rust and Nim uses functions like mpz_addmul_ui , mpz_submul_ui and mpz_tdiv_q for other operations. the Crystal does not use these in the implementation above. I don't know how much of an impact that has but that's the only kind of difference I can think of after asterite made the adjustments above to avoid the new allocations that were there originally.

So we could make the Crystal implementation similar to Nim and Rust, and it would probably be faster, if you want to have a go at it so try it :) That would be comparing apples to apples atleast.

cfsamson on 9 Sep 2018

@cfsamson I'm new to Crystal, very much open to let experienced contributors do the implementations.

proyb6 on 9 Sep 2018

I can implement all the mutating methods for BigInt and BigFloat. The only "ugly" thing is that they'll have names like add, mul, sub, etc., because += and others can't be overloaded (and it would be very difficult to do that without changing a big portion of the language to return references to memory).

So for example:

a = # some big int
b = # some big int
a.add(b) # a += b

The problem is that this might also not be the most efficient way to implement it. For example in Go we have something like this:

a = # some big int
b = # some big int
c = # some big int
a.add(b, c) # a = b + c

this is the most versatile way to implement this, because it lets you reuse all the memory you want. Say for example you need to compute b + c and you already had a BigInt variable that you don't use anymore: with this you can avoid allocating new memory for the result by reusing the memory in a.

So I suggest we implement all the operations like in Go, which also match GMP's interface. We can still keep the usual operators if one doesn't need super extra performance and prefers more readable code. But we can provide another "uglier but more efficient" interface in case you need it.

Thoughts?

/cc @bcardiff @RX14 @straight-shoota

asterite on 10 Sep 2018

👍2

I’m fine with the suggestion.

proyb6 on 10 Sep 2018

@asterite looks good to me. I had the thought to separate the mutable and immutable bigint into different types but I don't think it's a good idea. Anyone who implements it might want to consider the advantages/disadvantages in more depth.

RX14 on 10 Sep 2018

I agree. I actually tried something similar, but didn't manage to find a way to reuse the memory of a BigInt (struct) without resorting to writing something in C, but it's obvious when you write it like that. It must be the smartest way of doing it.

cfsamson on 10 Sep 2018

Hmm, maybe keep both options available? Just fixed what I got stuck at yesterday with the suggestion from @asterite and this is the fastest implementation I've gotten i Crystal so far, and I have an other idea as well that I will try later:

require "big"
N = (ARGV[0]? || 100).to_i

struct BigInt

    def add(other : BigInt) : self
        LibGMP.add(self, self, other)
        self
      end

      def add(other : Int) : self
        if other < 0
          sub(other.abs)
        elsif other <= LibGMP::ULong::MAX
          LibGMP.add_ui(self, self, other)
        else
          add(other.to_big_i)
        end
        self
      end

      def sub(other : Int) : self
        if other < 0
          add(other.abs)
        elsif other <= LibGMP::ULong::MAX
          LibGMP.sub_ui(self, self, other)
        else
          sub(other.to_big_i)
        end
        self
      end

      def sub(other : BigInt) : self
        LibGMP.sub(self, self, other)
        self
      end

      def mul(other : BigInt) : self
        LibGMP.mul(self, self, other)
        self
      end

      def mul(other : LibGMP::IntPrimitiveSigned) : self
        LibGMP.mul_si(self, self, other)
        self
      end

      def mul(other : LibGMP::IntPrimitiveUnsigned) : self
        LibGMP.mul_ui(self, self, other)
        self
      end

# --------------------- NEW METHODS -----------------------------
  def add(op1 : BigInt, op2 : BigInt) : self
    LibGMP.add(self, op1, op2)
    self
  end

  def add(op1 : BigInt, op2 : Int) : self
    if op2 < 0
      sub(op1, op2.abs)
    elsif op2 <= LibGMP::ULong::MAX
      LibGMP.add_ui(self, op1, op2)
    else
      add(op1, op2.to_big_i)
    end
    self
  end

  def sub(op1 : BigInt, op2 : Int) : self
    if op2 < 0
      add(op1, op2.abs)
    elsif op2 <= LibGMP::ULong::MAX
      LibGMP.sub_ui(self, op1, op2)
    else
      sub(op1. op2.to_big_i)
    end
    self
  end

  def sub(op1 : BigInt, op2 : BigInt) : self
    LibGMP.sub(self, op1, op2)
    self
  end

  def mul(op1 : BigInt, op2 : BigInt) : self
    LibGMP.mul(self, op1, op2)
    self
  end

  def mul(op1 : BigInt, op2 : LibGMP::IntPrimitiveSigned) : self
    LibGMP.mul_si(self, op1, op2)
    self
  end

  def mul(op1 : BigInt, op2 : LibGMP::IntPrimitiveUnsigned) : self
    LibGMP.mul_ui(self, op1, op2)
    self
  end

end


#calculate size of buffer
buf_size = 0
order = 2
(0..N).each do |i|
  if i % 10 == 0 && i > 0
    buf_size += 13 + order
    if (i + 10) % 10**order == 0
      order += 1
    end
  end
end
buffer = String::Builder.new(buf_size)

w = 0.to_big_i
k = 1
n1 = 4.to_big_i
n2 = 3.to_big_i
d = 1.to_big_i
i = 0

produced = 0
extracted = 0

while true
    u = n1.tdiv(d)
    v = n2.tdiv(d)
    if (u <=> v) == 0
        buffer << u
        i += 1
        # echo
        break if i == N

        # extract
        u = d * (-10 * u)
        n1.mul(10)
        n1.add(u)
        n2.mul(10)
        n2.add(u)
    else
        # produce
        k2 = k << 1
        u.mul(n1, (k2-1))
        v.add(n2, n2)
        w.mul(n1, (k -1))
        n1.add(u, v)
        u.mul(n2, (k + 2))
        n2.add(w, u)
        d.mul((k2 + 1))
        k += 1
    end

end

print buffer.to_s

NB. I didn't bother to format the output since that is not time consuming

cfsamson on 10 Sep 2018

OK, just some performance notes and then I'll stop spamming:

Calculating 10 000 digits on my machine:
The fastest Rust implementation runs at ~0.9 sec
The fastest Go implementation ~1.15
(not using Go's own BigInt type, that's roughly 50% slower)

This implementation in Crystal runs in ~0.9 sec

So I think atleast the performance is getting there. The code is probably not perfect with checks etc, but at least it's a proof of concept.

Note that I had to add the following lines to lib_gmp.cr

fun submul_ui = __gmpz_submul_ui(rop : MPZ*, op1 : MPZ*, op2 : ULong)
fun addmul_ui = __gmpz_addmul_ui(rop : MPZ*, op1 : MPZ*, op2 : ULong)

require "big"
N = (ARGV[0]? || 100).to_i

struct BigInt
  def add_with(other : BigInt) : self
    LibGMP.add(self, self, other)
    self
  end

  def add_with(other : Int) : self
    if other < 0
      sub(other.abs)
    elsif other <= LibGMP::ULong::MAX
      LibGMP.add_ui(self, self, other)
    else
      add(other.to_big_i)
    end
    self
  end

  def sub_with(other : Int) : self
    if other < 0
      add(other.abs)
    elsif other <= LibGMP::ULong::MAX
      LibGMP.sub_ui(self, self, other)
    else
      sub(other.to_big_i)
    end
    self
  end

  def sub_with(other : BigInt) : self
    LibGMP.sub(self, self, other)
    self
  end

  def mul_with(other : BigInt) : self
    LibGMP.mul(self, self, other)
    self
  end

  def mul_with(other : LibGMP::IntPrimitiveSigned) : self
    LibGMP.mul_si(self, self, other)
    self
  end

  def mul_with(other : LibGMP::IntPrimitiveUnsigned) : self
    LibGMP.mul_ui(self, self, other)
    self
  end

  # --------------------- NEW METHODS -----------------------------
  def add(op1 : BigInt, op2 : BigInt) : self
    LibGMP.add(self, op1, op2)
    self
  end

  def add(op1 : BigInt, op2 : Int) : self
    if op2 < 0
      sub(op1, op2.abs)
    elsif op2 <= LibGMP::ULong::MAX
      LibGMP.add_ui(self, op1, op2)
    else
      add(op1, op2.to_big_i)
    end
    self
  end

  def sub(op1 : BigInt, op2 : Int) : self
    if op2 < 0
      add(op1, op2.abs)
    elsif op2 <= LibGMP::ULong::MAX
      LibGMP.sub_ui(self, op1, op2)
    else
      sub(op1.op2.to_big_i)
    end
    self
  end

  def sub(op1 : BigInt, op2 : BigInt) : self
    LibGMP.sub(self, op1, op2)
    self
  end

  def mul(op1 : BigInt, op2 : BigInt) : self
    LibGMP.mul(self, op1, op2)
    self
  end

  def mul(op1 : BigInt, op2 : LibGMP::IntPrimitiveSigned) : self
    LibGMP.mul_si(self, op1, op2)
    self
  end

  def mul(op1 : BigInt, op2 : LibGMP::IntPrimitiveUnsigned) : self
    LibGMP.mul_ui(self, op1, op2)
    self
  end

  def tdiv_q(op1 : BigInt, op2 : BigInt)
    check_division_by_zero op2
    LibGMP.tdiv_q(self, op1, op2)
    self
  end

  def submul(op1 : BigInt, op2 : UInt32)
    LibGMP.submul_ui(self, op1, op2)
    self
  end

  def addmul(op1 : BigInt, op2 : UInt32)
    LibGMP.addmul_ui(self, op1, op2)
    self
  end
end

class Context
  property k = 0
  property tmp1 : BigInt = 0.to_big_i
  property tmp2 : BigInt = 0.to_big_i
  property acc : BigInt = 0.to_big_i
  property den : BigInt = 1.to_big_i
  property num : BigInt = 1.to_big_i

  def extract_digit(nth : UInt32)
    @tmp1.mul(@num, nth)
    @tmp2.add(@tmp1, @acc)
    @tmp1.tdiv_q(@tmp2, @den)
    @tmp1.to_i
  end

  def eliminate_digit(d)
    #x = 0.to_big_i.mul(@den, d)
    @acc.submul(@den, d)
    @acc.mul_with(10)
    @num.mul_with(10)
  end

  def next_term
    @k += 1
    k2 = @k * 2 + 1
    # x = 0.to_big_i.mul(@num, 2)
    @acc.addmul(@num, 2)
    @acc.mul_with(k2)
    @den.mul_with(k2)
    @num.mul_with(@k)
  end
end

#calculate size of buffer
buf_size = 0
order = 2
(0..N).each do |i|
  if i % 10 == 0 && i > 0
    buf_size += 13 + order
    if (i + 10) % 10**order == 0
      order += 1
    end
  end
end
buffer = String::Builder.new(buf_size)

context = Context.new

(1...(N + 1)).each do |i|
  loop do
    context.next_term
    next if context.num > context.acc
    d = context.extract_digit(3)
    next if d != context.extract_digit(4)

    context.eliminate_digit(d.to_u)
    buffer << d
    if i % 10 == 0
      buffer << "\t:%d\n" % i
    end
    break
  end
end

print buffer.to_s

cfsamson on 10 Sep 2018

👍3

@cfsamson good! Time for a PR?

j8r on 10 Sep 2018

👍1

@j8r Yeah maybe. And some good news too, I forgot to uncomment a line of code so it turns out it runs roughly as fast as the Rust version on my Mac which is a pretty good progress from where we started. I updated the code and the time above.

I'm still wondering if we should have both versions with:
a.add(b, c)
and the:
a.add_with(b)
(this is just a suggestion from me, it's probably better to let them just be two overloads of add since I don't know how good English it is to say: "sub a with b"...)

And also if we should add:
submul & addmul

However if they are added that should be enough to optimize the most used BigInt operations I think 👍

cfsamson on 11 Sep 2018

@cfsamson which line did you refer to "uncomment a line of code"?

proyb6 on 11 Sep 2018

@proyb6 I left the following line there that should not have been there:

x = 0.to_big_i.mul(@den, d)

So we have 4 versions now as you see here: https://gist.github.com/cfsamson/6b176634b83e016932d0df51e186dd9f

_(note this is all numbers from my Mac from 2013)_
Version 1: adding submul and addmul to lib_gmp.cr, exec time is roughly same as Rust impl
Version 2: same as version 1 without adding addmul and submul to lib_gmp.cr ~2.5 x slower
Version 3: same as Python implementation, ~3x slower
Version 4: original first implementation with mutating BigInts 3.9 x slower than version 1

I removed the initialization of the buffer from the faster implementations, seems to not be a big factor, and actually slow down version 1 a bit.

There's probably more ways to do this, but after matching Rusts impl with a small change in the stdlib I'm not giving the algorithm for pidigits any more focus right now :)

cfsamson on 11 Sep 2018

@cfsamson I think it would be nice if any contributors could fix before 0.27 is release.

proyb6 on 20 Sep 2018

@proyb6 I'll try a PR for this and get some feedback on the API I think is best.

cfsamson on 20 Sep 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings