Neovim: -Wmaybe-initialized warnings in release mode are real errors

Created on 19 Nov 2017 · 3Comments · Source: neovim/neovim

Tl;dr: The warnings being spit out when compiling neovim in release mode are not bogus. They are bugs in gcc's optimizer. This effectively re-opens issue #5061.

Version info

nvim --version:

NVIM v0.2.2                                                                                                          
Build type: RelWithDebInfo                                                                                           
LuaJIT 2.0.5                                                                                                         
Compilation: /usr/lib64/ccache/cc -Wconversion -DNVIM_MSGPACK_HAS_FLOAT32 -DNVIM_UNIBI_HAS_VAR_FROM -O2 -g -DMIN_LOG_
LEVEL=3 -Og -g -Wall -Wextra -pedantic -Wno-unused-parameter -Wstrict-prototypes -std=gnu99 -Wvla -fstack-protector-s
trong -fdiagnostics-color=auto -DINCLUDE_GENERATED_DECLARATIONS -D_GNU_SOURCE <removed private -I directory stuff>

Vim (version: ) behaves differently? n/a
Operating system/version:

~/neovim> uname -a
Linux <snip> 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
~/neovim> lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.4 (Maipo)
Release:        7.4
Codename:       Maipo

Terminal name/version: n/a
$TERM: n/a
gcc --version: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
gdb --version:

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
This GDB was configured as "x86_64-redhat-linux-gnu".

Bug Description

I decided to look into the compiler warnings generated while compiling neovim in release mode. When inspecting the code, I found it to be well-formed but not fully clear; the possible control paths were convoluted or not-obvious. Every control path led to a properly initialized value.

Digging deeper, I looked at the disassembly for a particular function: server_stop from here. I worked through it instruction by instruction, and I found the bug. Looking at the disassembly:

Dump of assembler code for function server_stop:
   0x0000000000530edc <+0>:     push   %r12
   0x0000000000530ede <+2>:     push   %rbp
   0x0000000000530edf <+3>:     push   %rbx
   0x0000000000530ee0 <+4>:     sub    $0x120,%rsp
   0x0000000000530ee7 <+11>:    mov    %rdi,%rsi
   0x0000000000530eea <+14>:    mov    %fs:0x28,%rax
   0x0000000000530ef3 <+23>:    mov    %rax,0x118(%rsp)
   0x0000000000530efb <+31>:    xor    %eax,%eax
   0x0000000000530efd <+33>:    mov    $0x100,%edx
   0x0000000000530f02 <+38>:    lea    0x10(%rsp),%rdi
   0x0000000000530f07 <+43>:    callq  0x51cf5f <xstrlcpy>
   0x0000000000530f0c <+48>:    mov    $0x0,%ebx
   0x0000000000530f11 <+53>:    jmp    0x530f38 <server_stop+92>
   0x0000000000530f13 <+55>:    movslq %ebx,%rax
   0x0000000000530f16 <+58>:    shl    $0x3,%rax
   0x0000000000530f1a <+62>:    add    0x43902f(%rip),%rax        # 0x969f50 <watchers+16>
   0x0000000000530f21 <+69>:    mov    (%rax),%r12
   0x0000000000530f24 <+72>:    mov    %r12,%rsi
   0x0000000000530f27 <+75>:    lea    0x10(%rsp),%rdi
   0x0000000000530f2c <+80>:    callq  0x436d30 <strcmp@plt>
   0x0000000000530f31 <+85>:    test   %eax,%eax
   0x0000000000530f33 <+87>:    je     0x530f42 <server_stop+102>
   0x0000000000530f35 <+89>:    add    $0x1,%ebx
   0x0000000000530f38 <+92>:    mov    0x439002(%rip),%ebp        # 0x969f40 <watchers>
   0x0000000000530f3e <+98>:    cmp    %ebx,%ebp
   0x0000000000530f40 <+100>:   jg     0x530f13 <server_stop+55>
   0x0000000000530f42 <+102>:   cmp    %ebp,%ebx
   0x0000000000530f44 <+104>:   jl     0x530f7e <server_stop+162>
   0x0000000000530f46 <+106>:   lea    0x10(%rsp),%rax
   0x0000000000530f4b <+111>:   mov    %rax,(%rsp)
   0x0000000000530f4f <+115>:   mov    $0x6d0c44,%r9d
   0x0000000000530f55 <+121>:   mov    $0x1,%r8d
   0x0000000000530f5b <+127>:   mov    $0xc5,%ecx
   0x0000000000530f60 <+132>:   mov    $0x6d0c58,%edx
   0x0000000000530f65 <+137>:   mov    $0x0,%esi
   0x0000000000530f6a <+142>:   mov    $0x3,%edi
   0x0000000000530f6f <+147>:   mov    $0x0,%eax
   0x0000000000530f74 <+152>:   callq  0x504d14 <do_log>
   0x0000000000530f79 <+157>:   jmpq   0x531003 <server_stop+295>
   0x0000000000530f7e <+162>:   mov    $0x6d0c30,%edi
   0x0000000000530f83 <+167>:   callq  0x555573 <os_getenv>
   0x0000000000530f88 <+172>:   test   %rax,%rax
   0x0000000000530f8b <+175>:   je     0x530fa8 <server_stop+204>
   0x0000000000530f8d <+177>:   mov    %rax,%rsi
etc...

The important lines are here:

<+98>:    cmp    %ebx,%ebp
<+100>:   jg     0x530f13 <server_stop+55>
<+102>:   cmp    %ebp,%ebx
<+104>:   jl     0x530f7e <server_stop+162>

These come from the original code:

SocketWatcher *watcher;
// <snip>
int i = 0;  // Index of the server whose address equals addr.
for (; i < watchers.ga_len; i++) {
    watcher = ((SocketWatcher **)watchers.ga_data)[i];
    // <snip>
  }

  if (i >= watchers.ga_len) {
    ELOG("Not listening on %s", addr);
    return;
  }

Normally, the if statement _should_ catch the only control path where watcher is not assigned: watchers.ga_len <= 0. When compiled, the assembly lines 98 and 100 correspond to checking if i < watchers.ga_len, and the lines 102 and 104 correspond to checking if i >= watchers.ga_len. The assembly seems to compare ebp (which is watchers.ga_len) with ebx (which is i), and jump if greater; then do the same comparison and jump if less. This is where gcc makes a mistake: it flips the order of the cmp instruction. This means that the REAL behavior is first check if i < watchers.ga_len and then check if i < watchers.ga_len. Which means the code inside the if statement is NEVER executed; no combination of i and watchers.ga_len will ever trigger ELOG().

So not only is this a use of an uninitialized value if watchers.ga_len == 0 (or technically, if it's less than zero too), it also clobbers any error detection if the for loop reaches the last entry (which would normally cause i == watchers.ga_len too).

It took me a few hours to go through this process, so I haven't looked at any of the other Wmaybe-uninitialized errors. I suspect they may be similar bugs, though.

What to do about it

The first thing I suggest is to put more test coverage around the warning areas. Gcc is clearly doing some weird stuff.

Since this is a bug in GCC itself, we should probably submit a bug report. I have no idea where to even begin, because it may have already been reported. Searching google for stuff like "gcc wmaybe-uninitialized bug" just comes up with people complaining about false positives, not actual code generation bugs.

I would suggest checking the fall-through logic on for loops. While that if statement (if (i >= watchers.ga_len)) technically handles the case where watchers.ga_len == 0, I don't think it was intended to do so?

I would also suggest to use unsigned variables where appropriate. It may help gcc figure out these not-obvious control flow issues, and helps readability. Though I understand that this is treated as a matter of style and not correctness in the open source community, for whatever reason, unless there's something I'm missing.

Steps to reproduce from root directory:

rm -r build
make clean
make CMAKE_BUILD_TYPE=RelWithDebInfo
gdb build/bin/nvim
disassemble server_stop

Actual behaviour

n/a

Expected behaviour

n/a

bug robustness

Source

Phlosioneer

👍4

Most helpful comment

I've fixed the bug for server_stop and submitted a PR. I'm going to investigate the other warnings and add their fixes to the PR. Should I create a tracking issue for them? They're likely to be as complicated and long-winded as this bug, so putting them in the comments here doesn't seem correct.

Phlosioneer on 19 Nov 2017

👍2

All 3 comments

I would also suggest to use unsigned variables where appropriate. It may help gcc figure out these not-obvious control flow issues, and helps readability. Though I understand that this is treated as a matter of style and not correctness in the open source community, for whatever reason, unless there's something I'm missing.

This is for whatever reason defined in the style guide, but our style guide suggests using size_t there and not int. If garray_T was not used throughout the whole code base where new code normally uses kvec (don’t know why it was not used here, it is not like msgpack_rpc/server.c could possibly be inherited from Vim; maybe that code just was merged in before merging in kvec.h), it could be refactored to use size_t, but there are simply too many places where garray_T is already used inherited from Vim. And the very same style guide suggests that iterator variables like i should have the same type as integer they are compared with.