After updating to core v2.3.3 in #4723, the Release configuration of the Mac unit tests is now crashing. The backtrace looks like so:
* thread #1: tid = 0x85e70f, 0x0000000104a3c7f3 Realm`realm::util::InterprocessCondVar::set_shared_part(realm::util::InterprocessCondVar::SharedPart&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 355, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000104a3c7f3 Realm`realm::util::InterprocessCondVar::set_shared_part(realm::util::InterprocessCondVar::SharedPart&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 355
frame #1: 0x0000000104b27c45 Realm`realm::SharedGroup::do_open(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, bool, realm::SharedGroupOptions) + 2901
frame #2: 0x00000001049e1582 Realm`realm::SharedGroup::open(realm::Replication&, realm::SharedGroupOptions) + 226
frame #3: 0x00000001049e1187 Realm`realm::SharedGroup::SharedGroup(realm::Replication&, realm::SharedGroupOptions) + 1527
* frame #4: 0x000000010499e1fe Realm`realm::Realm::open_with_config(realm::Realm::Config const&, std::__1::unique_ptr<realm::Replication, std::__1::default_delete<realm::Replication> >&, std::__1::unique_ptr<realm::SharedGroup, std::__1::default_delete<realm::SharedGroup> >&, std::__1::unique_ptr<realm::Group, std::__1::default_delete<realm::Group> >&, realm::Realm*) [inlined] realm::SharedGroup::SharedGroup(repl=0x0000600000120b40) + 898 at group_shared.hpp:800 [opt]
There is no debug information for the core symbols due to the release version of the core library being stripped.
The disassembly around the crashing instruction looks like so:
0x104a3c7a1 <+273>: leaq 0x23d3e3(%rip), %rsi ; ".cv"
0x104a3c7a8 <+280>: leaq -0x78(%rbp), %rdi
0x104a3c7ac <+284>: callq 0x104c135bc ; symbol stub for: std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::append(char const*)
0x104a3c7b1 <+289>: leaq 0x8(%r14), %rbx
0x104a3c7b5 <+293>: movq 0x10(%rax), %rcx
0x104a3c7b9 <+297>: movq %rcx, -0x50(%rbp)
0x104a3c7bd <+301>: movq (%rax), %rcx
0x104a3c7c0 <+304>: movq 0x8(%rax), %rdx
0x104a3c7c4 <+308>: movq %rdx, -0x58(%rbp)
0x104a3c7c8 <+312>: movq %rcx, -0x60(%rbp)
0x104a3c7cc <+316>: movq $0x0, 0x10(%rax)
0x104a3c7d4 <+324>: movq $0x0, 0x8(%rax)
0x104a3c7dc <+332>: movq $0x0, (%rax)
0x104a3c7e3 <+339>: testb $0x1, (%rbx)
0x104a3c7e6 <+342>: jne 0x104a3c7ef ; <+351>
0x104a3c7e8 <+344>: movw $0x0, (%rbx)
0x104a3c7ed <+349>: jmp 0x104a3c7fe ; <+366>
0x104a3c7ef <+351>: movq 0x18(%r14), %rax
-> 0x104a3c7f3 <+355>: movb $0x0, (%rax)
0x104a3c7f6 <+358>: movq $0x0, 0x10(%r14)
0x104a3c7fe <+366>: xorl %esi, %esi
The crashing instruction appears to be code inlined from std::string, corresponding to the expression m_resource_path = base_path + "." + condvar_name + ".cv";.
I'm not able to reproduce the crash when building against a local version of core v2.3.3 and sync v1.3.1. I've compared the disassembly of InterprocessCondVar::set_shared_part in a version of the framework that experiences the crash vs one that doesn't (built with local core and sync), and they're identical other than the addresses within the code. The code in SharedGroup::do_open that sets up the call to set_shared_part likewise appears to be identical.
I checked on CI and verified that the released bits of core v2.3.3 appear to have been built with Xcode 8.2 as expected, that the headers match those from my local build of core v2.3.3, and that the released bits contain all of the changes that went in to v2.3.3. That is to say, nothing looks obviously bogus about the released version of core that's being pulled down.
@finnschiermer and (mostly) I failed at building a local sync, but can confirm that a clean checkout of realm-cocoa indeed triggers this crash.
I did a fresh clone of the sync repo, checked out the tag corresponding to the latest release (v1.3.1), build the cocoa package, that was unzipped correctly into the realm-cocoa work area. Verified that sh build.sh download-sync did not modify the folder, but nonetheless ran into compilation issues.
Okay, after a bit of trail and error I managed to build core, sync correctly and can also confirm that I cannot repro with a local build using Xcode 8.2.1.
What I've managed to determine so far is that the this pointer in InterprocessCondVar::set_shared_part doesn't point to a valid instance. The representation of the SharedGroup on the heap appears to have 8 bytes of padding between each InterprocessCondVar, while the code that populates the this pointer to m_new_commit_available doesn't account for this padding.
When we see the crash:
// The first condition variable, m_room_to_write, is 0x1c58 bytes into the SharedGroup.
// The four condition variables occupy 40 bytes each, with 8 bytes of padding between them.
(lldb) memory read -c `48 * 4` `$rdi + 0x1c58`
0x10281ae58: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281ae68: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281ae78: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ????????........
0x10281ae88: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281ae98: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281aea8: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ????????........
0x10281aeb8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281aec8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281aed8: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ????????........
0x10281aee8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281aef8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281af08: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ????????........
// The code retrieves m_new_writes_available from 0x1cd0 bytes into the SharedGroup.
// That's not right.
(lldb) memory read -c 40 `$rdi + 0x1cd0`
0x10281aed0: 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ........????????
0x10281aee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x10281aef0: 00 00 00 00 00 00 00 00 ........
When using my local build of core, the representation of SharedGroup on the heap does _not_ have the 8 bytes of padding between the condition variables:
(lldb) memory read -c `40 * 4` `$rdi + 0x1c58`
0x101878c58: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x101878c68: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x101878c78: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ................
0x101878c88: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x101878c98: 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ................
0x101878ca8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x101878cb8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x101878cc8: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ................
0x101878cd8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x101878ce8: 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ................
The code for the SharedGroup constructor that's generated into shared_realm.o has the correct offsets. There's another (weak) copy of the constructor in client_file_access_cache-macosx.o within librealm-macosx.a that has the bogus member offsets. My suspicion is that the version of sync in question, v1.3.1, was not correctly built against the headers from core v2.3.3, but instead used core v2.3.2's headers in which InterprocessCondVar was a different size (prior to 89dbb1a3422fddeeaffc15358f66fd58d3ad5dc7).
When downloading realm-sync-cocoa-1.3.1.tar.xz , I can see that the headers contain the changes from the commit you mentioned ( https://github.com/realm/realm-core/commit/89dbb1a3422fddeeaffc15358f66fd58d3ad5dc7#diff-ea199165135792c1c963cd4d0de0e07eR113 )
But it is possible that sync was not built correctly, I'm looking into it.
I have been able to repro locally. Here's what I did:
This matches our observations that the assembly code looks like core v2.3.2, while the headers are from v2.3.3.
If this is indeed what has happened for that release, the question then becomes _how_. Perhaps @radu-tutueanu is able to offer some insights here?
I was suspecting ccache, but I rebuilt it without ccache and got the same result. Here is what I have until now:
It build fine on lv_host2_2. So the issue is when building sync macos-cph-01.cph.realm, and it is not ccache. Maybe some job installed the core libs? I cannot ssh into the machine to figure out more, as my hotel's Wi-Fi blocks port 22. I can take a look on Monday.
@alebsack confirmed that there were (older) core headers installed on the machine. We need to figure out which job installed them.
@radu-tutueanu has rebuilt the sync v1.3.1 release on a different machine, and the tests now pass. Thanks, @radu-tutueanu!