I did some basic profiling while playing a debug build for ~3 minutes, but I can't read the output well.
Half of the time in the testing "disappeared". I'm not sure where did it go - some might have went to profiler internal functions, but the entire time spent in children of main menu (ie. anything you can reach from main menu) is below 50%, meaning that 50% of the profiling time wasn't spent in the game itself.
The percentage values should probably be treated as being 2 times as big as listed.
Here's what I think I understood:
index % time self children called name
[3] 30.5 0.00 72.02 3 game::draw() [3]
0.00 38.72 3/3 game::draw_ter(bool) [4]
0.03 14.07 3/6 map::build_map_cache(int, bool) [8]
0.01 9.07 3/3 game::draw_sidebar() [24]
1.22 7.64 3/6 map::update_visibility_cache(int) [13]
0.00 1.22 3/78 wrefresh(WINDOW*) [6]
Looks like drawing takes 30% of CPU time, BUT only after we include vision calculations and UI drawing.
I'm pretty sure this is drawing to buffer, not drawing directly on the screen.
[7] 13.4 2.98 28.74 46 curses_drawwindow(WINDOW*) [7]
1.27 11.59 135192/135194 widen(std::string const&) [17]
5.91 0.00 136664/136664 FillRectDIB(int, int, int, int, unsigned char) [36]
0.41 5.15 135192/265671 mk_wcwidth(unsigned int) <cycle 9> [237]
That FillRectDIB takes a fair bit of time. In console builds, where it only does memset.
May be worth considering risking some overdraw or even allowing overdraw as an option.
[15] 6.2 4.42 10.24 104544 map::apparent_light_at(tripoint const&, visibility_variables const&) const [15]
1.03 4.02 104538/115126 map::get_cache_ref(int) const [39]
0.55 2.78 104544/158306 rl_dist(tripoint const&, tripoint const&) [44]
0.69 0.71 104538/104550 player::unimpaired_range() const [119]
0.46 0.00 104544/105002 player::pos() const [330]
map::apparent_light_at could use a lot of caching. Vast majority of its invocations are in a loop, but despite that it still uses safe cache access and re-gets player's stats.
Also, it builds a "distance field" for the whole reality bubble.
It may be worth scrapping the function and instead building the apparent light cache as a dedicated cache, with apparent_light_at being repurposed as a cache lookup only. Then all the caches and variables could be gotten just once and distance could be estimated with some simple heuristics.
[16] 5.9 5.81 8.09 6 map::generate_lightmap(int) [16]
1.36 0.00 313632/313632 int_id<ter_t>::operator==(int_id<ter_t> const&) const [123]
0.27 1.08 104544/138703 field::begin() [91]
0.94 0.38 104556/142397 tripoint::tripoint(int, int, int) [90]
0.36 0.92 104544/138703 field::end() [92]
A whole 1% of all CPU time wasted just to compare terrain to lava, active console and utility light? Sounds like nothing, but it's just 3 int_id comparisons, so it should be much smaller nothing.
Then there are almost 2% of CPU time wasted on fields being stored in weird list-like containers. This could be skipped if we had a function to check if there are any fields before doing the loop.
1.27 11.59 135192/135194 curses_drawwindow(WINDOW*) [7]
[17] 5.4 1.27 11.59 135194 widen(std::string const&) [17]
0.59 7.35 135194/135194 std::vector<wchar_t, std::allocator<wchar_t> >::vector(unsigned int, std::allocator<wchar_t> const&) [29]
0.29 2.52 135194/135194 std::vector<wchar_t, std::allocator<wchar_t> >::~vector() [65]
5% of all CPU time on a widen function? Now that's actually pretty serious.
The function allocates a whole vector just to turn a regular string (ASCII?) into UTF-16 one. Then it returns a wide (UTF-16) string, just to draw one line of text or possibly just one character.
A lot of extra speed can be squeezed out of this function.
[33] 2.9 0.24 6.58 22501 wputch(WINDOW*, int, long) [33]
0.41 4.99 22501/24042 waddch(WINDOW*, int) [38]
0.53 0.48 22501/24188 wattron(WINDOW*, int) [148]
0.17 0.00 22501/24197 wattroff(WINDOW*, int) [602]
3% just on drawing single characters?
I see why: most of it goes into converting single characters into whole strings, then printing them as such... Even better: those strings are wrapped into std::strings.
Everything else was either below 1% or as expected (loading time is expected to be high due to short length of the test).
You really must profile on a release build, std containers in particular
are orders of magnitude (yes plural) slower in debug builds, AFAIK this
includes std::string.
Other optimizations are similarly large and uneven.
It's somewhat valid to profile a debug build to try and optimize the debug
build experience, but what you often end up doing is implementing
optimizations manually that you told the compiler not to do for you.

This is my profiling on a release build. Walking on the country road.
You really must profile on a release build, std containers in particular are orders of magnitude (yes plural) slower in debug builds, AFAIK this includes std::string.
Also we use std::unique_ptr a lot which is guaranteed to be optimized out except in debug builds.
(ie. anything you can reach from main menu) is below 50%, meaning that 50% of the profiling time wasn't spent in the game itself.
By any chance you don't have a 2 core CPU? I'm almost certain the code base is entirely single threaded so cannot benefit from parallelism meaning 50% of capacity is entirely unused?
Profiled release build. Few minutes of late early game (mostly melee combat and some basic crafting).
% cumulative self self total
time seconds seconds calls ms/call ms/call name
6.47 20.76 20.76 44194 0.47 0.62 curses_drawwindow(WINDOW*)
4.34 34.68 13.92 pthread_spin_unlock
3.75 46.71 12.03 pthread_spin_lock
3.74 58.71 12.00 pthread_setspecific
3.72 70.64 11.93 __pthread_self_lite
2.79 79.60 8.96 pthread_getspecific
2.41 87.35 7.75 195631350 0.00 0.00 vehicle_part::hp() const
2.32 94.81 7.46 195885638 0.00 0.00 item::damage() const
2.13 101.66 6.85 255923712 0.00 0.00 map::apparent_light_at(tripoint const&, visibility_variables const&) const
2.07 108.29 6.63 446656781 0.00 0.00 rl_dist(tripoint const&, tripoint const&)
1.98 114.63 6.34 _Unwind_SjLj_Register
1.89 120.71 6.08 136518224 0.00 0.00 UTF8_getch(char const**, int*)
1.84 126.63 5.92 10371 0.57 0.85 map::generate_lightmap(int)
Note how often is damage() used. It is also bugged - it doesn't round consistently across saves (but that's a separate issue). It may be worth caching damage_int whenever value of damage changes. Weird that it takes so much time, considering it's just a single conditional ceil/floor.
Vehicle HP is also called a surprising lot. I assume it's for drawing code. I assume all of that extra overhead comes from item::damage above, but if it doesn't, it may be worth caching it one extra time.
Call graph:
index % time self children called name
[1] 51.5 26.27 138.92 17627283+105033650 <cycle 1 as a whole> [1]
3.30 12.68 14688 map::update_visibility_cache(int) <cycle 1> [9]
0.09 14.34 16556 input_context::handle_input() <cycle 1> [14]
1.56 10.43 1187691 vehicle::break_part_into_pieces(int, int, int, bool) <cycle 1> [20]
0.00 11.16 118734 line_to(int, int, int, int, int) <cycle 1> [22]
2.92 7.12 16090134 map::draw_maptile(WINDOW*, player&, tripoint const&, maptile const&, bool, bool, tripoint const&, bool, bool, bool) const <cycle 1> [23]
1.48 8.03 5223 map::draw(WINDOW*, tripoint const&) <cycle 1> [24]
5.92 2.89 10371 map::generate_lightmap(int) <cycle 1> [26]
Most of this looks as expected.
But one thing really, really stands out here: vehicle::break_part_into_pieces. I can't really explain it, but I'm nearly certain it didn't need to be invoked this many times in my game.
apparent_light_at still takes quite a bit of the time per turn.
In general the new vehicles all take high spots in the call graph:
0.06 4.78 159018 vehicle::power_parts() <cycle 1> [40]
0.52 4.20 24224803 vehicle_part::ammo_current() const <cycle 1> [41]
0.06 1.33 159018 vehicle::idle(bool) <cycle 1> [100]
0.16 0.67 46958 vehicle::slow_leak() <cycle 1> [124]
Would be a great idea to cache vehicle parts again. This broad YOLO approach, such as doing 4 separate "runs" over all parts just to find those with 3 specific flags, is actually taking few % of game time. Note that my test was done in a standard city, so moderate vehicle density, but no death rollers.
Some drawing:
[27] 2.6 1.10 7.10 38693889 wputch(WINDOW*, int, long) [27]
4.29 2.17 38693889/39882042 waddch(WINDOW*, int) [33]
0.64 0.00 38693889/40508853 wattron(WINDOW*, int) [148]
Called a lot, but optimized pretty OK. Could be hard to squeeze any extra speed here.
UTF input is surprisingly heavy:
[15] 4.5 0.35 13.99 16527 input_manager::get_input_event(WINDOW*) [15]
0.16 13.83 16527/17372 getch() [12]
0.00 0.00 5349/136518224 UTF8_getch(char const**, int*) [37]
Unless it includes idle time in getch(), in which case I guess there's no way for it not to be big.
Again, drawing is huge:
20.76 6.77 44189/44194 wrefresh(WINDOW*) [6]
[7] 8.6 20.76 6.77 44194 curses_drawwindow(WINDOW*) [7]
3.11 0.00 69765748/136518224 UTF8_getch(char const**, int*) [37]
2.85 0.00 68472063/68472065 widen(std::string const&) [54]
0.81 0.00 68472063/133261713 mk_wcwidth(unsigned int) [94]
Optimized better than in debug, but still not good.
Here's something I don't understand:
[4] 48.0 0.21 153.89 16421654 std::_Function_handler<VisitResponse (item const*), visitable<Character>::has_item_with(std::function<bool (item const&)> const&) const::{lambda(item const*)#1}>::_M_invoke(std::_Any_data const&, item const*) [4]
24.47 129.42 16421654/17627283 std::function<bool (item const&)>::operator()(item const&) const <cycle 1> [422]
Why almost 50%? Did it take a second core or something like that?
Who calls the last two most?
I don't see them being assigned to anything. They just "float":
<spontaneous>
[2] 48.6 0.50 155.58 visit_internal(std::function<VisitResponse (item*, item*)> const&, item*, item*) [2]
0.32 138.56 24218481/24218481 std::_Function_handler<VisitResponse (item*, item*), visitable<Character>::visit_items(std::function<VisitResponse (item*)> const&)::{lambda(item*, item*)#1}>::_M_invoke(std::_Any_data const&, item*, item*) [5]
vehicle::break_part_into_pieces might be happening in mapgen, vehicle hulk
generation has historically been kind of terrible.
has_item_with() and friends:
unfortunately this kind of uncertainty about who is responsible for the
invocations is an unsolvable problem with functional code, without
introspecting into the code state profilers can't differentiate between
adjacent call stacks.
Weird that it takes so much time, considering it's just a single conditional ceil/floor.
See http://stackoverflow.com/questions/824118/why-is-floor-so-slow
I'm going to try for a more performant implementation before we resort to caching
On Windows or tiles (anything that uses cursesport.cpp drawing), it's pretty clearly drawing rather than calculations that is the bottleneck.
It was hard to get that info from the profiler (we don't have a drawing test to properly separate drawing from game logic and loading), so I did some crude "manual" testing.
Test | FPS
--- | ---
Pure console build | 5
Tiles build, console mode | 9
No curses_drawwindow | 15
No FillRectDIB | 6.5
No ExtTextOutW | 9
Cygwin | 45
By "No curses_drawwindow" I mean I added if( debug_mode ) return; as the first line of said function. By no [function name] I mean I returned on the first line when debug_mode == true then ran the benchmark with debug mode on.
It seems that the caching done by tiles build essentially removes the cost associated with ExtTextOutW.
The performance of drawing is utterly abysmal for a game where 99% of display is ASCII.
One of the biggest reasons for this is the horrible UTF support which goes kinda like:
stringswiden the stringAs a result, std::string::assign, UTF8_getch and widen consume a significant part of the CPU time.
Another bunch of profiling data, cleaned up to only show the meaningful (and not pthread_spin_unlock type stuff):
% cumulative self self total
time seconds seconds calls ms/call ms/call name
4.48 9.83 2.16 4774 0.45 0.65 curses_drawwindow(WINDOW*)
2.42 16.04 1.17 85924282 0.00 0.00 rl_dist(tripoint const&, tripoint const&)
2.20 17.10 1.06 24618532 0.00 0.00 vehicle_part::hp() const
2.07 18.10 1.00 std::string::assign(std::string const&)
1.60 19.71 0.77 4287493 0.00 0.00 map::draw_maptile(WINDOW*, player&, tripoint const&, maptile const&, bool, bool, tripoint const&, bool, bool, bool) const
1.49 20.43 0.72 1416 0.51 0.67 map::generate_lightmap(int)
1.45 21.84 0.70 std::istream::sentry::sentry(std::istream&, bool)
1.16 24.21 0.56 11066627 0.00 0.00 UTF8_getch(char const**, int*)
1.06 26.36 0.51 35179056 0.00 0.00 map::apparent_light_at(tripoint const&, visibility_variables const&) const
0.87 28.69 0.42 8365076 0.00 0.00 widen(std::string const&)
Note that most of it is drawing. std::string:assign is also part of drawing, because of the UTF support described above.
As for non-drawing-related: it looks like vehicle_part::hp() really needs to be used less.
What can be done about the UTF support?
I am no expert on Unicode, but couldn't we just save the wide characters?
Storing two strings per every single character is a giant waste.
PDCurses (or perhaps what appears to be the currently active fork at
http://www.projectpluto.com/win32a.htm) is the closest thing to a standard
in this area, might be worth a look.
Looked up PDCurses and they have it like this:
short, int or longSounds like just saving wide characters instead of strings would be enough.
That implementation is really hard to "unwrap".
Each cell can contain an arbitrary number of characters, has arbitrary on-screen width, but is still drawn one character at a time.
Though the one in PDCurses is also rather hard. It has more optimizations, such as not redrawing identical writes, but probably more importantly drawing the whole string at once rather than one character-described-by-string at a time.
Some more examples of functions being too slow, this time not drawing-related:
[5] 14.1 0.02 6.77 24305 monster::move() [5]
0.00 2.64 24251/24251 monster::move_effects(bool) [12]
0.00 1.41 20129/20136 monster::stumble() [23]
0.00 1.31 16661/16661 monster::scent_move() [24]
0.05 0.47 48556/1495886 Creature::has_effect(string_id<effect_type> const&, body_part) const <cycle 1> [202]
0.03 0.24 24251/1495886 monster::attitude(Character const*) const <cycle 1> [205]
0.00 0.14 3781/10510 monster::move_to(tripoint const&, bool, float) [78]
That stumble slowdown is mostly my fault - in z-level mode, stumbling up and down takes too much time. I'm writing a PR to help it a bit.
The move_effect part isn't mine, though. Still, I have a good idea: instead of checking "is monster tied, snared, bound, gagged, delitized or webbed?" it's better to add an "impairs movement" flag to effects and just check that in process_effects (as in "does this effect this monster actually has impair movement?").
[24] 2.7 0.00 1.31 16661 monster::scent_move() [24]
0.10 0.83 84627/1495886 monster::can_move_to(tripoint const&) const <cycle 1> [120]
0.00 0.36 16661/16661 monster::is_fleeing(player&) const [81]
0.01 0.01 270911/270911 scent_map::get(tripoint const&) const [472]
can_move_to may be expensive simply because it is structured wrongly. I don't think the compiler understands multi-compilation-unit side effects all that well, so it can't rearrange variables to where they're used.
[15] 4.9 0.00 2.39 4604 game::cleanup_dead() [15]
0.24 2.12 217376/1495886 Creature_tracker::find(int) const <cycle 1> [361]
0.02 0.00 217385/374437 monster::is_dead() const [351]
Now that is atrocious. Almost 10% (the % are for both cores, DDA takes only one) of cycles go to function that looks perfectly fine and non-threatening. Pointer overhead resulting in cache problems?
My newest profiling, this time in tiles build:
34.24 45.15 45.15 _mcount_private
14.61 64.42 19.27 exp
2.51 67.73 3.31 __tcf_32
2.05 70.44 2.71 242751075 0.00 0.00 square_dist(tripoint const&, tripoint const&)
1.62 72.58 2.14 11774 0.18 0.49 Font::draw_window(WINDOW*, int, int)
1.27 74.25 1.68 487933 0.00 0.00 HandleDPad()
exp? That must be lightning+vision calculation.
I'm pretty sure we don't need much precision in those, so we could approximate that exp somehow.
http://stackoverflow.com/questions/6984440/approximate-ex has this:
public static double exp(double val) {
final long tmp = (long) (1512775 * val) + (1072693248 - 60801);
return Double.longBitsToDouble(tmp << 32);
}
EDIT: Also here https://github.com/whackashoe/fastapprox/blob/master/fastapprox/src/fastexp.h
This one is better because it's float. But it has a disadvantage: it explodes near ~88.8 argument, which is dangerously near the ranges we still expect to cover with shadowcasting.
Discussion stalled almost a year ago.