Cataclysm-dda: Slowdowns since forever; more profiling needed

Created on 15 Nov 2016  路  17Comments  路  Source: CleverRaven/Cataclysm-DDA

I did some basic profiling while playing a debug build for ~3 minutes, but I can't read the output well.

Half of the time in the testing "disappeared". I'm not sure where did it go - some might have went to profiler internal functions, but the entire time spent in children of main menu (ie. anything you can reach from main menu) is below 50%, meaning that 50% of the profiling time wasn't spent in the game itself.
The percentage values should probably be treated as being 2 times as big as listed.

Here's what I think I understood:

index % time    self  children    called     name
[3]     30.5    0.00   72.02       3         game::draw() [3]
                0.00   38.72       3/3           game::draw_ter(bool) [4]
                0.03   14.07       3/6           map::build_map_cache(int, bool) [8]
                0.01    9.07       3/3           game::draw_sidebar() [24]
                1.22    7.64       3/6           map::update_visibility_cache(int) [13]
                0.00    1.22       3/78          wrefresh(WINDOW*) [6]

Looks like drawing takes 30% of CPU time, BUT only after we include vision calculations and UI drawing.
I'm pretty sure this is drawing to buffer, not drawing directly on the screen.

[7]     13.4    2.98   28.74      46         curses_drawwindow(WINDOW*) [7]
                1.27   11.59  135192/135194      widen(std::string const&) [17]
                5.91    0.00  136664/136664      FillRectDIB(int, int, int, int, unsigned char) [36]
                0.41    5.15  135192/265671      mk_wcwidth(unsigned int) <cycle 9> [237]

That FillRectDIB takes a fair bit of time. In console builds, where it only does memset.
May be worth considering risking some overdraw or even allowing overdraw as an option.

[15]     6.2    4.42   10.24  104544         map::apparent_light_at(tripoint const&, visibility_variables const&) const [15]
                1.03    4.02  104538/115126      map::get_cache_ref(int) const [39]
                0.55    2.78  104544/158306      rl_dist(tripoint const&, tripoint const&) [44]
                0.69    0.71  104538/104550      player::unimpaired_range() const [119]
                0.46    0.00  104544/105002      player::pos() const [330]

map::apparent_light_at could use a lot of caching. Vast majority of its invocations are in a loop, but despite that it still uses safe cache access and re-gets player's stats.
Also, it builds a "distance field" for the whole reality bubble.
It may be worth scrapping the function and instead building the apparent light cache as a dedicated cache, with apparent_light_at being repurposed as a cache lookup only. Then all the caches and variables could be gotten just once and distance could be estimated with some simple heuristics.

[16]     5.9    5.81    8.09       6         map::generate_lightmap(int) [16]
                1.36    0.00  313632/313632      int_id<ter_t>::operator==(int_id<ter_t> const&) const [123]
                0.27    1.08  104544/138703      field::begin() [91]
                0.94    0.38  104556/142397      tripoint::tripoint(int, int, int) [90]
                0.36    0.92  104544/138703      field::end() [92]

A whole 1% of all CPU time wasted just to compare terrain to lava, active console and utility light? Sounds like nothing, but it's just 3 int_id comparisons, so it should be much smaller nothing.
Then there are almost 2% of CPU time wasted on fields being stored in weird list-like containers. This could be skipped if we had a function to check if there are any fields before doing the loop.

                1.27   11.59  135192/135194      curses_drawwindow(WINDOW*) [7]
[17]     5.4    1.27   11.59  135194         widen(std::string const&) [17]
                0.59    7.35  135194/135194      std::vector<wchar_t, std::allocator<wchar_t> >::vector(unsigned int, std::allocator<wchar_t> const&) [29]
                0.29    2.52  135194/135194      std::vector<wchar_t, std::allocator<wchar_t> >::~vector() [65]

5% of all CPU time on a widen function? Now that's actually pretty serious.
The function allocates a whole vector just to turn a regular string (ASCII?) into UTF-16 one. Then it returns a wide (UTF-16) string, just to draw one line of text or possibly just one character.
A lot of extra speed can be squeezed out of this function.

[33]     2.9    0.24    6.58   22501         wputch(WINDOW*, int, long) [33]
                0.41    4.99   22501/24042       waddch(WINDOW*, int) [38]
                0.53    0.48   22501/24188       wattron(WINDOW*, int) [148]
                0.17    0.00   22501/24197       wattroff(WINDOW*, int) [602]

3% just on drawing single characters?
I see why: most of it goes into converting single characters into whole strings, then printing them as such... Even better: those strings are wrapped into std::strings.

Everything else was either below 1% or as expected (loading time is expected to be high due to short length of the test).

Performance stale

All 17 comments

You really must profile on a release build, std containers in particular
are orders of magnitude (yes plural) slower in debug builds, AFAIK this
includes std::string.

Other optimizations are similarly large and uneven.

It's somewhat valid to profile a debug build to try and optimize the debug
build experience, but what you often end up doing is implementing
optimizations manually that you told the compiler not to do for you.

2016-11-16 10 59 30

This is my profiling on a release build. Walking on the country road.

You really must profile on a release build, std containers in particular are orders of magnitude (yes plural) slower in debug builds, AFAIK this includes std::string.

Also we use std::unique_ptr a lot which is guaranteed to be optimized out except in debug builds.

(ie. anything you can reach from main menu) is below 50%, meaning that 50% of the profiling time wasn't spent in the game itself.

By any chance you don't have a 2 core CPU? I'm almost certain the code base is entirely single threaded so cannot benefit from parallelism meaning 50% of capacity is entirely unused?

Profiled release build. Few minutes of late early game (mostly melee combat and some basic crafting).

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  6.47     20.76    20.76    44194     0.47     0.62  curses_drawwindow(WINDOW*)
  4.34     34.68    13.92                             pthread_spin_unlock
  3.75     46.71    12.03                             pthread_spin_lock
  3.74     58.71    12.00                             pthread_setspecific
  3.72     70.64    11.93                             __pthread_self_lite
  2.79     79.60     8.96                             pthread_getspecific
  2.41     87.35     7.75 195631350     0.00     0.00  vehicle_part::hp() const
  2.32     94.81     7.46 195885638     0.00     0.00  item::damage() const
  2.13    101.66     6.85 255923712     0.00     0.00  map::apparent_light_at(tripoint const&, visibility_variables const&) const
  2.07    108.29     6.63 446656781     0.00     0.00  rl_dist(tripoint const&, tripoint const&)
  1.98    114.63     6.34                             _Unwind_SjLj_Register
  1.89    120.71     6.08 136518224     0.00     0.00  UTF8_getch(char const**, int*)
  1.84    126.63     5.92    10371     0.57     0.85  map::generate_lightmap(int)

Note how often is damage() used. It is also bugged - it doesn't round consistently across saves (but that's a separate issue). It may be worth caching damage_int whenever value of damage changes. Weird that it takes so much time, considering it's just a single conditional ceil/floor.
Vehicle HP is also called a surprising lot. I assume it's for drawing code. I assume all of that extra overhead comes from item::damage above, but if it doesn't, it may be worth caching it one extra time.

Call graph:

index % time    self  children    called     name
[1]     51.5   26.27  138.92 17627283+105033650 <cycle 1 as a whole> [1]
                3.30   12.68   14688             map::update_visibility_cache(int) <cycle 1> [9]
                0.09   14.34   16556             input_context::handle_input() <cycle 1> [14]
                1.56   10.43 1187691             vehicle::break_part_into_pieces(int, int, int, bool) <cycle 1> [20]
                0.00   11.16  118734             line_to(int, int, int, int, int) <cycle 1> [22]
                2.92    7.12 16090134             map::draw_maptile(WINDOW*, player&, tripoint const&, maptile const&, bool, bool, tripoint const&, bool, bool, bool) const <cycle 1> [23]
                1.48    8.03    5223             map::draw(WINDOW*, tripoint const&) <cycle 1> [24]
                5.92    2.89   10371             map::generate_lightmap(int) <cycle 1> [26]

Most of this looks as expected.
But one thing really, really stands out here: vehicle::break_part_into_pieces. I can't really explain it, but I'm nearly certain it didn't need to be invoked this many times in my game.

apparent_light_at still takes quite a bit of the time per turn.

In general the new vehicles all take high spots in the call graph:

                0.06    4.78  159018             vehicle::power_parts() <cycle 1> [40]
                0.52    4.20 24224803             vehicle_part::ammo_current() const <cycle 1> [41]
                0.06    1.33  159018             vehicle::idle(bool) <cycle 1> [100]
                0.16    0.67   46958             vehicle::slow_leak() <cycle 1> [124]

Would be a great idea to cache vehicle parts again. This broad YOLO approach, such as doing 4 separate "runs" over all parts just to find those with 3 specific flags, is actually taking few % of game time. Note that my test was done in a standard city, so moderate vehicle density, but no death rollers.

Some drawing:

[27]     2.6    1.10    7.10 38693889         wputch(WINDOW*, int, long) [27]
                4.29    2.17 38693889/39882042     waddch(WINDOW*, int) [33]
                0.64    0.00 38693889/40508853     wattron(WINDOW*, int) [148]

Called a lot, but optimized pretty OK. Could be hard to squeeze any extra speed here.

UTF input is surprisingly heavy:

[15]     4.5    0.35   13.99   16527         input_manager::get_input_event(WINDOW*) [15]
                0.16   13.83   16527/17372       getch() [12]
                0.00    0.00    5349/136518224     UTF8_getch(char const**, int*) [37]

Unless it includes idle time in getch(), in which case I guess there's no way for it not to be big.

Again, drawing is huge:

               20.76    6.77   44189/44194       wrefresh(WINDOW*) [6]
[7]      8.6   20.76    6.77   44194         curses_drawwindow(WINDOW*) [7]
                3.11    0.00 69765748/136518224     UTF8_getch(char const**, int*) [37]
                2.85    0.00 68472063/68472065     widen(std::string const&) [54]
                0.81    0.00 68472063/133261713     mk_wcwidth(unsigned int) [94]

Optimized better than in debug, but still not good.

Here's something I don't understand:

[4]     48.0    0.21  153.89 16421654         std::_Function_handler<VisitResponse (item const*), visitable<Character>::has_item_with(std::function<bool (item const&)> const&) const::{lambda(item const*)#1}>::_M_invoke(std::_Any_data const&, item const*) [4]
               24.47  129.42 16421654/17627283     std::function<bool (item const&)>::operator()(item const&) const <cycle 1> [422]

Why almost 50%? Did it take a second core or something like that?

Who calls the last two most?

I don't see them being assigned to anything. They just "float":

                                                 <spontaneous>
[2]     48.6    0.50  155.58                 visit_internal(std::function<VisitResponse (item*, item*)> const&, item*, item*) [2]
                0.32  138.56 24218481/24218481     std::_Function_handler<VisitResponse (item*, item*), visitable<Character>::visit_items(std::function<VisitResponse (item*)> const&)::{lambda(item*, item*)#1}>::_M_invoke(std::_Any_data const&, item*, item*) [5]

vehicle::break_part_into_pieces might be happening in mapgen, vehicle hulk
generation has historically been kind of terrible.

has_item_with() and friends:
unfortunately this kind of uncertainty about who is responsible for the
invocations is an unsolvable problem with functional code, without
introspecting into the code state profilers can't differentiate between
adjacent call stacks.

Weird that it takes so much time, considering it's just a single conditional ceil/floor.

See http://stackoverflow.com/questions/824118/why-is-floor-so-slow

I'm going to try for a more performant implementation before we resort to caching

On Windows or tiles (anything that uses cursesport.cpp drawing), it's pretty clearly drawing rather than calculations that is the bottleneck.

It was hard to get that info from the profiler (we don't have a drawing test to properly separate drawing from game logic and loading), so I did some crude "manual" testing.

Test | FPS
--- | ---
Pure console build | 5
Tiles build, console mode | 9
No curses_drawwindow | 15
No FillRectDIB | 6.5
No ExtTextOutW | 9
Cygwin | 45

By "No curses_drawwindow" I mean I added if( debug_mode ) return; as the first line of said function. By no [function name] I mean I returned on the first line when debug_mode == true then ran the benchmark with debug mode on.
It seems that the caching done by tiles build essentially removes the cost associated with ExtTextOutW.

The performance of drawing is utterly abysmal for a game where 99% of display is ASCII.

One of the biggest reasons for this is the horrible UTF support which goes kinda like:

  • Save character on screen as a pair of (narrow) strings
  • Convert the first of those strings into an UTF-8 character
  • Discard that character
  • widen the string
  • Draw the widened string, which depicts a single character
  • Discard the widened string

As a result, std::string::assign, UTF8_getch and widen consume a significant part of the CPU time.

Another bunch of profiling data, cleaned up to only show the meaningful (and not pthread_spin_unlock type stuff):

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  4.48      9.83     2.16     4774     0.45     0.65  curses_drawwindow(WINDOW*)
  2.42     16.04     1.17 85924282     0.00     0.00  rl_dist(tripoint const&, tripoint const&)
  2.20     17.10     1.06 24618532     0.00     0.00  vehicle_part::hp() const
  2.07     18.10     1.00                             std::string::assign(std::string const&)
  1.60     19.71     0.77  4287493     0.00     0.00  map::draw_maptile(WINDOW*, player&, tripoint const&, maptile const&, bool, bool, tripoint const&, bool, bool, bool) const
  1.49     20.43     0.72     1416     0.51     0.67  map::generate_lightmap(int)
  1.45     21.84     0.70                             std::istream::sentry::sentry(std::istream&, bool)
  1.16     24.21     0.56 11066627     0.00     0.00  UTF8_getch(char const**, int*)
  1.06     26.36     0.51 35179056     0.00     0.00  map::apparent_light_at(tripoint const&, visibility_variables const&) const
  0.87     28.69     0.42  8365076     0.00     0.00  widen(std::string const&)

Note that most of it is drawing. std::string:assign is also part of drawing, because of the UTF support described above.

As for non-drawing-related: it looks like vehicle_part::hp() really needs to be used less.

What can be done about the UTF support?

I am no expert on Unicode, but couldn't we just save the wide characters?

Storing two strings per every single character is a giant waste.

PDCurses (or perhaps what appears to be the currently active fork at
http://www.projectpluto.com/win32a.htm) is the closest thing to a standard
in this area, might be worth a look.

Looked up PDCurses and they have it like this:

  • Size of character can be defined (compile time) as 1/2 of short, int or long
  • The other half of the integer is attributes
  • Screen is just a 2D array of those wide characters

Sounds like just saving wide characters instead of strings would be enough.

That implementation is really hard to "unwrap".
Each cell can contain an arbitrary number of characters, has arbitrary on-screen width, but is still drawn one character at a time.

Though the one in PDCurses is also rather hard. It has more optimizations, such as not redrawing identical writes, but probably more importantly drawing the whole string at once rather than one character-described-by-string at a time.

Some more examples of functions being too slow, this time not drawing-related:

[5]     14.1    0.02    6.77   24305         monster::move() [5]
                0.00    2.64   24251/24251       monster::move_effects(bool) [12]
                0.00    1.41   20129/20136       monster::stumble() [23]
                0.00    1.31   16661/16661       monster::scent_move() [24]
                0.05    0.47   48556/1495886     Creature::has_effect(string_id<effect_type> const&, body_part) const <cycle 1> [202]
                0.03    0.24   24251/1495886     monster::attitude(Character const*) const <cycle 1> [205]
                0.00    0.14    3781/10510       monster::move_to(tripoint const&, bool, float) [78]

That stumble slowdown is mostly my fault - in z-level mode, stumbling up and down takes too much time. I'm writing a PR to help it a bit.

The move_effect part isn't mine, though. Still, I have a good idea: instead of checking "is monster tied, snared, bound, gagged, delitized or webbed?" it's better to add an "impairs movement" flag to effects and just check that in process_effects (as in "does this effect this monster actually has impair movement?").

[24]     2.7    0.00    1.31   16661         monster::scent_move() [24]
                0.10    0.83   84627/1495886     monster::can_move_to(tripoint const&) const <cycle 1> [120]
                0.00    0.36   16661/16661       monster::is_fleeing(player&) const [81]
                0.01    0.01  270911/270911      scent_map::get(tripoint const&) const [472]

can_move_to may be expensive simply because it is structured wrongly. I don't think the compiler understands multi-compilation-unit side effects all that well, so it can't rearrange variables to where they're used.

[15]     4.9    0.00    2.39    4604         game::cleanup_dead() [15]
                0.24    2.12  217376/1495886     Creature_tracker::find(int) const <cycle 1> [361]
                0.02    0.00  217385/374437      monster::is_dead() const [351]

Now that is atrocious. Almost 10% (the % are for both cores, DDA takes only one) of cycles go to function that looks perfectly fine and non-threatening. Pointer overhead resulting in cache problems?

My newest profiling, this time in tiles build:

 34.24     45.15    45.15                             _mcount_private
 14.61     64.42    19.27                             exp
  2.51     67.73     3.31                             __tcf_32
  2.05     70.44     2.71 242751075     0.00     0.00  square_dist(tripoint const&, tripoint const&)
  1.62     72.58     2.14    11774     0.18     0.49  Font::draw_window(WINDOW*, int, int)
  1.27     74.25     1.68   487933     0.00     0.00  HandleDPad()

exp? That must be lightning+vision calculation.
I'm pretty sure we don't need much precision in those, so we could approximate that exp somehow.
http://stackoverflow.com/questions/6984440/approximate-ex has this:

public static double exp(double val) {
    final long tmp = (long) (1512775 * val) + (1072693248 - 60801);
    return Double.longBitsToDouble(tmp << 32);
}

EDIT: Also here https://github.com/whackashoe/fastapprox/blob/master/fastapprox/src/fastexp.h
This one is better because it's float. But it has a disadvantage: it explodes near ~88.8 argument, which is dangerously near the ranges we still expect to cover with shadowcasting.

Discussion stalled almost a year ago.

Was this page helpful?
0 / 5 - 0 ratings