I was trying to investigate some issues that were reported regarding build failures when doing:
$ mkdir build
$ cd build
build $ cmake ../engine
build $ make
I asked them to try using sh/vsbuild.sh instead as that is our primary method of building at the moment.
While investigating I found that the building that way seems to be dependent of the number of workers passed to make. If only 1 worker (make) then it would fail; if 4 workers (make -j4) then it would pass.
NOTE: sh/vsbuild.sh does make -j $(nproc) which on my system is make -j 4.
Watching the build it seems like there may be some kind of timing issue for why it may pass/fail - perhaps some there's a dependency in there that's not being detected by cmake but gets properly coerced with multiple workers.
While it would be nice to find these things and fix them in the current system, there's likely going to be a lot of changes between current master (0.7.x) and 0.9.x regarding the build tree which might end up solving this just through refactoring. Capturing here to make sure we fix it properly; however it may get solved.
Benjamin, this is weird.
It was always my impression that building with 1 thread is the more stable choice, and then as you add threads a poorly written Makefiles starts to get confused.
The AUR packages are basically build scripts, and the meat of mine looks like this:
build(){
cd build
cmake ../Vega-Strike-Engine-Source/engine/
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_INSTALL_PREFIX=/usr
-DENABLE_PIE=ON
-DUSE_PYTHON_3=ON
make
}
This has not failed one single time. As most of Arch is pre-compiled packages, the focus for the few custom packages from AUR is on correctness, not build times.
Before Arch Linux, my distro of choice was Sorcerer Linux, and every package in this distro was built from source. It's fair to say that I have built thousands of various bits of software from source, and I can't remember a single case where single thread building was unstable and multiple threads were stable.
Could you maybe provide us with a compile log of where it failed? A second pair of eyes might catch something.
@evertvorster yeah - in all the packages I've built I haven't seen that kind of behavior before. Building with 1 thread should be more stable.
I've done lots of vanilla source builds years back, moved to Gentoo (source builds), and then to Debian/Ubuntu as I haven't had time to continue doing so. So I'm pretty much in the same background/boat.
build-with-1-worker.txt
build-with-2-workers.txt
build-with-3-workers.txt
build-with-4-workers.txt
All of those failed... however, once I switched to Debug builds...they all passed.
NOTE: I applied the following diff after one of the builds failed. All builds above had this diff applied.
bmeyer@laptop:~/Devel/personal/vegastrike/Vega-Strike-Engine-Source/build-1$ git diff --cached
diff --git a/engine/src/python/python_class.h b/engine/src/python/python_class.h
index 2d796369f..e4cf4d30c 100644
--- a/engine/src/python/python_class.h
+++ b/engine/src/python/python_class.h
@@ -28,6 +28,7 @@
#include <eval.h>
#include "python/python_compile.h"
#include "cmd/ai/fire.h"
+#include "cmd/base.h"
#include <memory>
#include "init.h"
#define PYTHONCALLBACK(rtype, ptr, str) \
@BenjamenMeyer with 1 worker, you got 13% into the build, failing to build /engine_com.dir/src/python/unit_exports1.cpp
With each of the multiple worker builds, the build failed at 0%, failing to build src/cmd/collide2/Ice/IceAABB.cpp.o
The rest was just the other jobs finishing up. So, there is definitely a concurrency issue, and it is brought on by adding more than 1 worker, as expected. The world is normal, and the sky is not falling. :P
Looking in the CMakeLists.txt, there are a couple of things added to your CFLAGS with a Debug build.
-include config.h -pipe -g2 -std=c++11 -Wall -fvisibility=hidden
Hopefully this helps to dispel some mysteries?
@evertvorster yeah...I was coherent enough to run/capture the builds last night but not for much else. Thanks for look. We'll have to fix that up so all build models works properly.
Looking closer...
https://github.com/vegastrike/Vega-Strike-Engine-Source/blob/master/engine/CMakeLists.txt#L597 (Debug Flags) mirrors https://github.com/vegastrike/Vega-Strike-Engine-Source/blob/master/engine/CMakeLists.txt#L621 (Release Flags) and also https://github.com/vegastrike/Vega-Strike-Engine-Source/blob/master/engine/CMakeLists.txt#L645 (Profiler Flags); the main difference is that Debug adds the -g2 flag which gives us the debugger data.
We can remove https://github.com/vegastrike/Vega-Strike-Engine-Source/blob/master/engine/CMakeLists.txt#L512 as that specifies C++98 and we're adding C++11 in later flags. I can't find anything different other than enabling -g2 between the builds in CMakeLists.txt.
Going to start going through the headers and see if I can find something... at least checking for debug (case insensitive) doesn't reveal anything.
Ran a diff of the single worker builds between what failed (build-1) and what succeeded (build-1-debug). You can see the diff in this attachment:
standard-vs-debug-build.diff.gz
Reviewing it, the biggest thing that stands out is the following - which is in multiple chunks wherever the CXX flags are defined:
-CXX_FLAGS = -O2 -g -DNDEBUG -std=gnu++03 -std=gnu++11
+CXX_FLAGS = -O2 -DNV_CUBE_MAP -DBOOST_PYTHON_NO_PY_SIGNATURES -DBOOST_LOG_DYN_LINK -include config.h -pipe -g2 -std=c++11 -Wall -fvisibility=hidden -std=gnu++03 -std=gnu++11
Two things:
+) there are a lot more things defined than the broken one (in red - starting with the -). This delta seems quite significant.:one: is easy enough to do - there's a small block to remove from CMakeLists.txt.
:two: is a bit harder as will require a refactor of the CMakeLists.txt.
I'd like to get into :two: more once 0.6.x is released. So going to defer further work on this until then.
NOTE: Not sure it'll be 0.7.x or 0.9.x - really depends on how fast those come together.
@BenjamenMeyer Which of those compiler flags are system-wide, and which are from the Makefile?
From the looks of it, none of the CXXFLAGS in the Release build are actually debug flags.
Funnily enough, the "-g" in the release build is actually a debug flag.
https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html
I am not sure which standard GCC will compile to if there are more than one defined. It is the higher one? The first one specified? The last one specified? It's been too long since I read up on that.
@BenjamenMeyer could you post here the log of a failed build please?
@evertvorster the -g2 is in the debug build version. I'm not sure which will win out; guessing the one specified later in the command args, but it really comes down to how GCC processes arguments.
@nabaco I'll take a look at doing so...I'll have to do another build though as I wiped those away last night...though I do have a TGZ of all the builds still...but it was large enough GitHub wouldn't let me drop it in here
@BenjamenMeyer, please clear up some confusion for me here. I was under the impression that your working build is the "debug" taget, and that the "release" target did not build. This belief was further strengthened by your naming convention of "standard" vs "debug" build. Since diff accepts the original as the first file, and usually make little "-" signs there, I assumed that the the line -CXX_FLAGS = -O2 -g -DNDEBUG -std=gnu++03 -std=gnu++11 came from the standard build, and that the line +CXX_FLAGS = -O2 -DNV_CUBE_MAP -DBOOST_PYTHON_NO_PY_SIGNATURES -DBOOST_LOG_DYN_LINK -include config.h -pipe -g2 -std=c++11 -Wall -fvisibility=hidden -std=gnu++03 -std=gnu++11 came from the debug build.
This is why I mentioned that in your standard build you have debug flags, and no debug flags in the debug build.
I'm going to setup a better accounting and methodology of reproducible this to help solve it.
@evertvorster to outline my original repro...
I started by creating some directories (build-1, build-2, build-3, build-4) and did as the description suggested, which reproduced the error. The build-1 directory was used on the left side of the diff that was described in https://github.com/vegastrike/Vega-Strike-Engine-Source/issues/155#issuecomment-651505401.
I later came back and did a build-1.debug and build-4.debug that followed what the vsbuild.sh did and passed successfully. The build-1.debug was used on the right side of the diff that was described in https://github.com/vegastrike/Vega-Strike-Engine-Source/issues/155#issuecomment-651505401.
Hopefully that clears up what I did prior... going forward...I'm thinking of creating a script to make something reproducible.
Timing bug. Hmm. I wonder if the fix is to put in an add_dependencies clause in CMakeLists.txt, like I did here
My €0.02:
-g specified and should strip the binaries (but we should use it only for profiling purposes, not releases)-g specified (this should be the default IMO)-g2 specified (this is the build we should use for CI and dev)AFAIK, the last flag on the command line overrides the preceding ones. This is how you are able to override upstream's flags in the packaging systems I've personally worked with. But I would need to read up on it to be 100% confident.
@ermo agree except that -g and -g2 are now equivalent. Debug should be -g3 now.
My €0.02:
* The release build should **not** have `-g` specified and should strip the binaries (but we should use it only for profiling purposes, not releases) * The RelwithDebugInfo build should have `-g` specified (this should be the default IMO) * The Debug build should should have `-g2` specified (this is the build we should use for CI and dev)AFAIK, the last flag on the command line overrides the preceding ones. This is how you are able to override upstream's flags in the packaging systems I've personally worked with. But I would need to read up on it to be 100% confident.
It turns out that commenting on stuff when you're both tired and out of the loop is not such a great idea. sigh
On 0.6.x, it has come to my attention that RelWithDebInfo was removed.
Hence, the list becomes:
-O2 -g specified (and not have -werror set) and allow distros to strip the binaries for debug symbols and essentially be the new RelWithDebInfo.-Og -g3 specified and should probably be the default build type (used in CI because it turns on -werror) because users compiling it locally probably want to help develop or troubleshoot the build.we haven't heard any more on this, so closing for now. We can re-open if someone finds it an issue again. Most likely this will be going away with CMake changes related to #56
Most helpful comment
It turns out that commenting on stuff when you're both tired and out of the loop is not such a great idea. sigh
On 0.6.x, it has come to my attention that RelWithDebInfo was removed.
Hence, the list becomes:
-O2 -gspecified (and not have-werrorset) and allow distros to strip the binaries for debug symbols and essentially be the new RelWithDebInfo.-Og -g3specified and should probably be the default build type (used in CI because it turns on-werror) because users compiling it locally probably want to help develop or troubleshoot the build.