Turing.jl: Simple benchmarking for Tracker, ReverseDiff, and Zygote

Created on 2 Mar 2020  Â·  27Comments  Â·  Source: TuringLang/Turing.jl

The following is a simple benchmarking for loops, on 3 different reverse-mode AD implementations in Julia. At the moment, in terms of efficiency: ReverseDiff > Tracker > Zygote.

julia> mysum(x) = begin z = 0; for i =1:length(x); z += x[i]; end; return z; end
mysum (generic function with 1 method)

julia> using Zygote, Tracker, ReverseDiff, BenchmarkTools

julia> x = rand(10000);

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum, x);
  491.095 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum, x);
  976.786 ms (280102 allocations: 1.50 GiB)

Benchmark setup:

(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [0bf59076] AdvancedHMC v0.2.20
  [131c737c] ArviZ v0.2.4
  [c52e3926] Atom v0.11.3
  [6e4b80f9] BenchmarkTools v0.4.3
  [5d742f6a] CSVFiles v1.0.0
  [8f4d0f93] Conda v1.3.0
  [a93c6f00] DataFrames v0.20.0
  [163ba53b] DiffResults v1.0.2
  [31c24e10] Distributions v0.21.12
  [bbc10e6e] DynamicHMC v2.1.3
  [f6369f11] ForwardDiff v0.10.8
  [7073ff75] IJulia v1.20.2
  [e5e0dc1b] Juno v0.7.2
  [c7f686f2] MCMCChains v1.0.0
  [91a5bcdd] Plots v0.28.4
  [438e738f] PyCall v1.91.2
  [37e2e3b7] ReverseDiff v1.1.0
  [2913bbd2] StatsBase v0.32.0
  [4c63d2b9] StatsFuns v0.9.3
  [f3b207a7] StatsPlots v0.13.0
  [1d978283] TensorFlow v0.11.0
  [9f7883ad] Tracker v0.2.6
  [fce5fe82] Turing v0.8.2
  [0f1e0344] WebIO v0.8.13
  [e88e6eb3] Zygote v0.4.9
  [ade2ca70] Dates 


AD discussion

Most helpful comment

thanks @devmotion, I rerun the benchmarks with $x. The results aren't very different from above ones.

Interestingly, I did another test to separate the effects of indexing and loops, which produces:

julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
  2.396 ms (60021 allocations: 2.23 MiB)

julia> @btime Tracker.gradient(mysum2, [1.0]);
  945.712 μs (60022 allocations: 1.83 MiB)

julia> @btime Zygote.gradient(mysum2, [1.0]);
  7.260 ms (100108 allocations: 3.54 MiB)

julia> @btime ThArrays.gradient(mysum2, [1.0]);
  69.599 ms (40026 allocations: 1.68 MiB)

This result together with the above ones suggest:

  • Zygote isn't that bad with loops (see https://github.com/FluxML/Zygote.jl/issues/371 and https://github.com/FluxML/Zygote.jl/issues/312), but bad at indexing (see https://github.com/FluxML/Zygote.jl/issues/365)
  • ReverseDiff is ok with both loops and indexing
  • Tracker is quite bad with loops, not sure about indexing.
  • Tracker is good with loops, but quite bad with indexing
  • PyTorch / ThArrays is ok with loops but bad with indexing.

Ps

All 27 comments

cc: @phipsgabler

Numbers using PyTorch's via ThArrays

julia> using ThArrays

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

This example might be over-favourable to ReverseDiff. @mohamed82008 @xukai92

@KDr2 can you check whether ThArrays have any type-instability issue?

julia> @code_warntype ThArrays.gradient(mysum, x);
Variables
  #self#::Core.Compiler.Const(ThArrays.gradient, false)
  f::Core.Compiler.Const(mysum, false)
  data::Tuple{Array{Float64,1}}

Body::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
1 ─ %1 = Core.tuple(ThArrays.C_NULL, #self#, f)::Core.Compiler.Const((Ptr{Nothing} @0x0000000000000000, ThArrays.gradient, mysum), false)
│   %2 = Core._apply(ThArrays.:(#gradient#30), %1, data)::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
└──      return %2

@yebai
define

+tensor_from_ptr(p::Ptr{Tensor{T, N}}) where {T, N} = Tensor{T, N}(p, nothing)

and

-function grad(self::Tensor)
+function grad(self::Tensor{T, N}) where {T, N}
     outputs__ = Int[0]
     __cret = ccall((:atg_grad, :libtorch_capi),
                  Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}),
                  outputs__, self.pointer)
-    __o_1 = tensor_from_ptr(Ptr{Cvoid}(outputs__[1]))
+    __o_1 = tensor_from_ptr(Ptr{Tensor{T, N}}(outputs__[1]))
     return __o_1
 end

can erase the type warnings, but I think (and from my experiment on this example) it has little influence on the benchmark results.

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

Can you code this example in C++, and report the run time?

#include <torch/torch.h>
#include <torch/csrc/autograd/variable.h>
#include <torch/csrc/autograd/function.h>

#include <typeinfo>
#include <iostream>

int main() {
    torch::Tensor t = torch::rand({10000}, torch::requires_grad(1));
    torch::Tensor r(t[0].clone());
    r.set_requires_grad(1);

    for(int i=0; i< 10000; i++ ) {
        r += t[i];
    }
    r.backward();
    // std::cout << t.grad() << std::endl;
    t.grad();
    return 0;
}

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

On my machine, it takes 0.5s while the Julia version takes 1.25s.

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

If you use sum instead of mysum, Julia and C++ are at the same level.

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

Are you running this Windows?

On my machine, it takes 0.5s while the Julia version takes 1.25s.

This result (0.5s) seems consistent with julia when using BenchmarkTools:

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

No, on linux, it's a typo, it's an ELF, sorry for the confusion.

$ file example-app
example-app: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4cbe6786f7f899d38c1cc6be65e7435aba5faa0b, for GNU/Linux 3.2.0, not stripped

On my (old) machine, it takes 1.25s in Julia, takes 0.5s in C++.

julia> @btime ThArrays.gradient(mysum, x);
  1.257 s (188995 allocations: 7.92 MiB)

Julia>

$ time ./example-app >/dev/null

real    0m0.519s
user    0m0.491s
sys     0m0.013s

thanks, @KDr2

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

Based on the C++ runtime for PyTorch, maybe the superiority of ReverseDiff is mainly from tape-caching, which avoids duplicate memory allocation?

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

This sounds plausible for a subset of Julia, given that Julia's syntax is so close with Python (torchscript is a subset of Python).

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

I'm wondering if adding @inbounds to mysum could improve the performance of the Julia version?

In real-world practice, we may not do indexing in such a frequency, e.g. we should use sum instead of mysum.

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

That works too, but at a cost of losing control flows: loops are unrolled, branches are recorded for specific branches only.

The numbers are slightly better with @inbounds

julia> mysum_inbounds(x) = begin z = 0; @inbounds for i =1:length(x); z += x[i]; end; return z; end
mysum_inbounds (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum_inbounds, x);
  2.436 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum_inbounds, x);
  439.518 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum_inbounds, x);
  856.985 ms (280102 allocations: 1.50 GiB)

julia> @btime ThArrays.gradient(mysum_inbounds, x);
  465.540 ms (160017 allocations: 7.32 MiB)

I guess you should also use $ to interpolate x. It might affect the results since you're benchmarking with a global variable.

thanks @devmotion, I rerun the benchmarks with $x. The results aren't very different from above ones.

Interestingly, I did another test to separate the effects of indexing and loops, which produces:

julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
  2.396 ms (60021 allocations: 2.23 MiB)

julia> @btime Tracker.gradient(mysum2, [1.0]);
  945.712 μs (60022 allocations: 1.83 MiB)

julia> @btime Zygote.gradient(mysum2, [1.0]);
  7.260 ms (100108 allocations: 3.54 MiB)

julia> @btime ThArrays.gradient(mysum2, [1.0]);
  69.599 ms (40026 allocations: 1.68 MiB)

This result together with the above ones suggest:

  • Zygote isn't that bad with loops (see https://github.com/FluxML/Zygote.jl/issues/371 and https://github.com/FluxML/Zygote.jl/issues/312), but bad at indexing (see https://github.com/FluxML/Zygote.jl/issues/365)
  • ReverseDiff is ok with both loops and indexing
  • Tracker is quite bad with loops, not sure about indexing.
  • Tracker is good with loops, but quite bad with indexing
  • PyTorch / ThArrays is ok with loops but bad with indexing.

Ps

Tracker is quite bad with loops, not sure about indexing.

Tracker is actually the fastest in the loop benchmark, isn't it?

Tracker is actually the fastest in the loop benchmark, isn't it?

You're right! I didn't notice the unit!

Interesting ReverseDiff is so good with indexing, and wonder whether we can apply the same trick from ReverseDiff to improve Tracker and Zygote.

On the flip side, ReverseDiff and Zygote are both slower than Tracker in code with broadcasting.

But I am trying to make RD faster.

I did some optimizations on ThArrays, https://github.com/TuringLang/ThArrays.jl/commit/c0217179fcc99e237e04ccb9161c2d0c3fc96003, indexing is a little faster(10%?) now.

I just re-run the same benchmarks. Here is an update:

julia> @btime ReverseDiff.gradient(mysum, x);

  1.194 ms (70023 allocations: 3.07 MiB)

julia> @btime Tracker.gradient(mysum, x);
  86.576 ms (190012 allocations: 769.27 MiB)

julia> @btime Zygote.gradient(mysum, x);


  90.346 ms (180107 allocations: 769.40 MiB)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hessammehr picture hessammehr  Â·  4Comments

Vaibhavdixit02 picture Vaibhavdixit02  Â·  4Comments

xukai92 picture xukai92  Â·  5Comments

mohamed82008 picture mohamed82008  Â·  3Comments

yebai picture yebai  Â·  5Comments