Turing.jl: Simple benchmarking for Tracker, ReverseDiff, and Zygote

Created on 2 Mar 2020 · 27Comments · Source: TuringLang/Turing.jl

The following is a simple benchmarking for loops, on 3 different reverse-mode AD implementations in Julia. At the moment, in terms of efficiency: ReverseDiff > Tracker > Zygote.

julia> mysum(x) = begin z = 0; for i =1:length(x); z += x[i]; end; return z; end
mysum (generic function with 1 method)

julia> using Zygote, Tracker, ReverseDiff, BenchmarkTools

julia> x = rand(10000);

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum, x);
  491.095 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum, x);
  976.786 ms (280102 allocations: 1.50 GiB)

Benchmark setup:

(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [0bf59076] AdvancedHMC v0.2.20
  [131c737c] ArviZ v0.2.4
  [c52e3926] Atom v0.11.3
  [6e4b80f9] BenchmarkTools v0.4.3
  [5d742f6a] CSVFiles v1.0.0
  [8f4d0f93] Conda v1.3.0
  [a93c6f00] DataFrames v0.20.0
  [163ba53b] DiffResults v1.0.2
  [31c24e10] Distributions v0.21.12
  [bbc10e6e] DynamicHMC v2.1.3
  [f6369f11] ForwardDiff v0.10.8
  [7073ff75] IJulia v1.20.2
  [e5e0dc1b] Juno v0.7.2
  [c7f686f2] MCMCChains v1.0.0
  [91a5bcdd] Plots v0.28.4
  [438e738f] PyCall v1.91.2
  [37e2e3b7] ReverseDiff v1.1.0
  [2913bbd2] StatsBase v0.32.0
  [4c63d2b9] StatsFuns v0.9.3
  [f3b207a7] StatsPlots v0.13.0
  [1d978283] TensorFlow v0.11.0
  [9f7883ad] Tracker v0.2.6
  [fce5fe82] Turing v0.8.2
  [0f1e0344] WebIO v0.8.13
  [e88e6eb3] Zygote v0.4.9
  [ade2ca70] Dates

AD discussion

Source

yebai

Most helpful comment

thanks @devmotion, I rerun the benchmarks with $x. The results aren't very different from above ones.

Interestingly, I did another test to separate the effects of indexing and loops, which produces:

julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
  2.396 ms (60021 allocations: 2.23 MiB)

julia> @btime Tracker.gradient(mysum2, [1.0]);
  945.712 μs (60022 allocations: 1.83 MiB)

julia> @btime Zygote.gradient(mysum2, [1.0]);
  7.260 ms (100108 allocations: 3.54 MiB)

julia> @btime ThArrays.gradient(mysum2, [1.0]);
  69.599 ms (40026 allocations: 1.68 MiB)

This result together with the above ones suggest:

Zygote isn't that bad with loops (see https://github.com/FluxML/Zygote.jl/issues/371 and https://github.com/FluxML/Zygote.jl/issues/312), but bad at indexing (see https://github.com/FluxML/Zygote.jl/issues/365)
ReverseDiff is ok with both loops and indexing
~~Tracker is quite bad with loops, not sure about indexing.~~
Tracker is good with loops, but quite bad with indexing
PyTorch / ThArrays is ok with loops but bad with indexing.

Zygote is slow with broadcasting: https://github.com/FluxML/Zygote.jl/issues/592

yebai on 26 Mar 2020

👍3

All 27 comments

cc: @phipsgabler

trappmartin on 4 Mar 2020

There is a more comprehensive one avaiable here: https://github.com/TuringLang/TuringExamples/blob/kx/improve_functionality/benchmarks/auto_diff/loop-vs-unrolled.ipynb

xukai92 on 4 Mar 2020

👍1

Numbers using PyTorch's via ThArrays

julia> using ThArrays

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

yebai on 26 Mar 2020

👍2

This example might be over-favourable to ReverseDiff. @mohamed82008 @xukai92

yebai on 26 Mar 2020

@KDr2 can you check whether ThArrays have any type-instability issue?

julia> @code_warntype ThArrays.gradient(mysum, x);
Variables
  #self#::Core.Compiler.Const(ThArrays.gradient, false)
  f::Core.Compiler.Const(mysum, false)
  data::Tuple{Array{Float64,1}}

Body::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
1 ─ %1 = Core.tuple(ThArrays.C_NULL, #self#, f)::Core.Compiler.Const((Ptr{Nothing} @0x0000000000000000, ThArrays.gradient, mysum), false)
│   %2 = Core._apply(ThArrays.:(#gradient#30), %1, data)::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
└──      return %2

yebai on 26 Mar 2020

@yebai
define

+tensor_from_ptr(p::Ptr{Tensor{T, N}}) where {T, N} = Tensor{T, N}(p, nothing)

and

-function grad(self::Tensor)
+function grad(self::Tensor{T, N}) where {T, N}
     outputs__ = Int[0]
     __cret = ccall((:atg_grad, :libtorch_capi),
                  Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}),
                  outputs__, self.pointer)
-    __o_1 = tensor_from_ptr(Ptr{Cvoid}(outputs__[1]))
+    __o_1 = tensor_from_ptr(Ptr{Tensor{T, N}}(outputs__[1]))
     return __o_1
 end

can erase the type warnings, but I think (and from my experiment on this example) it has little influence on the benchmark results.

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

KDr2 on 26 Mar 2020

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

Can you code this example in C++, and report the run time?

yebai on 26 Mar 2020

#include <torch/torch.h>
#include <torch/csrc/autograd/variable.h>
#include <torch/csrc/autograd/function.h>

#include <typeinfo>
#include <iostream>

int main() {
    torch::Tensor t = torch::rand({10000}, torch::requires_grad(1));
    torch::Tensor r(t[0].clone());
    r.set_requires_grad(1);

    for(int i=0; i< 10000; i++ ) {
        r += t[i];
    }
    r.backward();
    // std::cout << t.grad() << std::endl;
    t.grad();
    return 0;
}

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

On my machine, it takes 0.5s while the Julia version takes 1.25s.

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

If you use sum instead of mysum, Julia and C++ are at the same level.

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

KDr2 on 26 Mar 2020

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

Are you running this Windows?

yebai on 26 Mar 2020

On my machine, it takes 0.5s while the Julia version takes 1.25s.

This result (0.5s) seems consistent with julia when using BenchmarkTools:

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

yebai on 26 Mar 2020

No, on linux, it's a typo, it's an ELF, sorry for the confusion.

$ file example-app
example-app: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4cbe6786f7f899d38c1cc6be65e7435aba5faa0b, for GNU/Linux 3.2.0, not stripped

On my (old) machine, it takes 1.25s in Julia, takes 0.5s in C++.

KDr2 on 26 Mar 2020

julia> @btime ThArrays.gradient(mysum, x);
  1.257 s (188995 allocations: 7.92 MiB)

Julia>

$ time ./example-app >/dev/null

real    0m0.519s
user    0m0.491s
sys     0m0.013s

KDr2 on 26 Mar 2020

thanks, @KDr2

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

Based on the C++ runtime for PyTorch, maybe the superiority of ReverseDiff is mainly from tape-caching, which avoids duplicate memory allocation?

yebai on 26 Mar 2020

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

This sounds plausible for a subset of Julia, given that Julia's syntax is so close with Python (torchscript is a subset of Python).

yebai on 26 Mar 2020

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

I'm wondering if adding @inbounds to mysum could improve the performance of the Julia version?

devmotion on 26 Mar 2020

In real-world practice, we may not do indexing in such a frequency, e.g. we should use sum instead of mysum.

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

KDr2 on 26 Mar 2020

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

That works too, but at a cost of losing control flows: loops are unrolled, branches are recorded for specific branches only.

yebai on 26 Mar 2020

The numbers are slightly better with @inbounds

julia> mysum_inbounds(x) = begin z = 0; @inbounds for i =1:length(x); z += x[i]; end; return z; end
mysum_inbounds (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum_inbounds, x);
  2.436 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum_inbounds, x);
  439.518 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum_inbounds, x);
  856.985 ms (280102 allocations: 1.50 GiB)

julia> @btime ThArrays.gradient(mysum_inbounds, x);
  465.540 ms (160017 allocations: 7.32 MiB)

yebai on 26 Mar 2020

I guess you should also use $ to interpolate x. It might affect the results since you're benchmarking with a global variable.

devmotion on 26 Mar 2020

thanks @devmotion, I rerun the benchmarks with $x. The results aren't very different from above ones.

Interestingly, I did another test to separate the effects of indexing and loops, which produces:

julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
  2.396 ms (60021 allocations: 2.23 MiB)

julia> @btime Tracker.gradient(mysum2, [1.0]);
  945.712 μs (60022 allocations: 1.83 MiB)

julia> @btime Zygote.gradient(mysum2, [1.0]);
  7.260 ms (100108 allocations: 3.54 MiB)

julia> @btime ThArrays.gradient(mysum2, [1.0]);
  69.599 ms (40026 allocations: 1.68 MiB)

This result together with the above ones suggest:

Zygote isn't that bad with loops (see https://github.com/FluxML/Zygote.jl/issues/371 and https://github.com/FluxML/Zygote.jl/issues/312), but bad at indexing (see https://github.com/FluxML/Zygote.jl/issues/365)
ReverseDiff is ok with both loops and indexing
~~Tracker is quite bad with loops, not sure about indexing.~~
Tracker is good with loops, but quite bad with indexing
PyTorch / ThArrays is ok with loops but bad with indexing.

Zygote is slow with broadcasting: https://github.com/FluxML/Zygote.jl/issues/592

yebai on 26 Mar 2020

👍3

Tracker is quite bad with loops, not sure about indexing.

Tracker is actually the fastest in the loop benchmark, isn't it?

devmotion on 26 Mar 2020

Tracker is actually the fastest in the loop benchmark, isn't it?

You're right! I didn't notice the unit!

yebai on 26 Mar 2020

Interesting ReverseDiff is so good with indexing, and wonder whether we can apply the same trick from ReverseDiff to improve Tracker and Zygote.

yebai on 26 Mar 2020

On the flip side, ReverseDiff and Zygote are both slower than Tracker in code with broadcasting.

mohamed82008 on 27 Mar 2020

But I am trying to make RD faster.

mohamed82008 on 27 Mar 2020

I did some optimizations on ThArrays, https://github.com/TuringLang/ThArrays.jl/commit/c0217179fcc99e237e04ccb9161c2d0c3fc96003, indexing is a little faster(10%?) now.

KDr2 on 27 Mar 2020

I just re-run the same benchmarks. Here is an update:

julia> @btime ReverseDiff.gradient(mysum, x);

  1.194 ms (70023 allocations: 3.07 MiB)

julia> @btime Tracker.gradient(mysum, x);
  86.576 ms (190012 allocations: 769.27 MiB)

julia> @btime Zygote.gradient(mysum, x);


  90.346 ms (180107 allocations: 769.40 MiB)

yebai on 19 Jun 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

JuliaCon hackathon?

hessammehr · 4Comments

Argument to pass warmup sample size.

Vaibhavdixit02 · 4Comments

RFC Sampler type

xukai92 · 5Comments

rand() and broadcast question

mohamed82008 · 3Comments

Distributions not compatible with AD (ForwardDiff/Flux.Tracker)

yebai · 5Comments