After running into an odd JVM crash while testing datatypes branch changes, @raver119 and I tracked this back to a compiler issue. I was running an old gcc version - (6.x I think? not sure) and after upgrading, the JVM crash issue isn't present anymore.
To reduce the probability of running into issues like this in the future, we should consider enforcing a recent GCC version during the libnd4j build.
GCC 6.x is pretty recent, and pretty much the version we need to use to support CentOS 6, so if we have an issue with that we should find a workaround in libnd4j.
I'll let @raver119 comment, but the problem occurred in a trivial code segment using Open MP for pairwise ops. Removing the last pragma here is sufficient to 'fix' the issue with old GCC, so we're quite sure it's a compiler bug:
if (xEws == 1 && yEws == 1 && zEws == 1) {
#pragma omp parallel num_threads(info._numThreads) if (info._numThreads > 1) default(shared)
{
auto threadNum = omp_get_thread_num();
Nd4jLong threadOffset = info.getThreadOffset(threadNum);
auto xi = x + threadOffset;
auto yi = y + threadOffset;
auto zi = z + threadOffset;
#pragma omp simd
for (Nd4jLong i = 0; i < info.getItersPerThread(threadNum); i++)
zi[i] = OpType::op(xi[i], yi[i], extraParams);
}
}
Edit: I will add this was encountered on Windows, might be specific to gcc on windows then.
Actually, GCC 7 should work with devtoolset-7 on CentOS 6, but probably doesn't with CUDA 9.0. If we're OK with dropping support for CUDA 9.0, that's OK I guess.
Anyway, that's a documentation thing IMO, failing the build because the user doesn't have the compiler we want them to is not standard practice.
That's not exactly trivial, but we can put those in blocks such as:
#if __GNUC__ > 6
#pragma omp simd
#endif
We already need to do that for Clang anyway.
no guys, problem in that code block wasn't in pragma itself.
pragma caused troubles only in specific _sequence_ of operations, and highly likely on specific hardware.
in operation executed standalone - there were no troubles.
@raver119 so, back to the original issue - should we be doing anything (like enforcing gcc versions) to avoid running into this in the future? It's at least a few hours work to track down issues like this...
It's N+1 issue, and nobody is guaranteed from that happening, even with latest compilers. Like we had issues after Meltdown/Spectre fixes. So i believe best thing we can do here is tests that are executed in the same build mode that will be used for release. And, using latest compilers, where it's possible.
Luckily, such issues aren't happening THAT often.
P.s. if we need GCC 6 to be used for some specific reason - we should enforce its use in test & dev environment as well. If we can go with 7 or 8 - that would be better though.
P.s. if we need GCC 6 to be used for some specific reason - we should enforce its use in test & dev environment as well. If we can go with 7 or 8 - that would be better though.
Hm, I like this idea, we'd be more likely to catch stuff during normal dev, instead of in final pre-release testing (or not at all). Definitely worth considering IMO (though let's make it overrideable with build flag)
@sshepel Can you post a list of compilers/versions used on all platforms we ship to?
|Platform | Compiler/version |
|-------------|------------------|
| android-arm | Clang 5.0.300080 |
| android-arm64 | Clang 5.0.300080 |
| android-x86 | Clang 5.0.300080 |
| android-x86_64 | Clang 5.0.300080 |
| ios-arm64 | AppleClang 8.0.0.8000038 |
| ios-x86_64 | AppleClang 8.0.0.8000038 |
| linux-armhf | GNU 4.9.3 |
| linux-ppc64le | GNU 5.4.0 |
| linux-ppc64le-cuda* | GNU 5.4.0 |
| linux-x86_64-centos6 | GNU 7.3.1 |
| linux-x86_64-centos6-cpu-avx2 | GNU 7.3.1 |
| linux-x86_64-cpu-avx512 | GNU 7.3.1 |
| linux-x86_64-cuda* | GNU 6.3.1 |
| macosx-x86_64 | GNU 7.3.0 |
| macosx-x86_64-cpu-avx2 | GNU 5.5.0 |
| macosx-x86_64-cuda* | AppleClang 8.0.0.8000038 |
| windows-x86_64 | GNU 7.2.0 |
| windows-x86_64-cpu-avx2 | GNU 7.2.0 |
| windows-x86_64-cuda* | MSVC 19.0.24215.1 |
For PRs situation for Linux CPU builds a bit different:
|Platform | Compiler/version |
|-------------|------------------|
| linux-x86_64-centos7 | GNU 7.3.1 |
| linux-x86_64-centos7-cpu-avx2 | GNU 7.3.1 |
| linux-x86_64-cpu-avx512 | GNU 7.3.1 |
@raver119 ^^
macosx-x86_64 vs macosx-x86_64-cpu-avx2
Why different versions used there?
@raver119 good question, looks like one of the build agents has been updated by someone...
Which version do we need? I can fix that...
use 7.3 there then, as used on macosx-x86_64
use 7.3 there then, as used on macosx-x86_64
ok, will try to fix that asap...
@raver119 all macosx-x86_64 build agents now have 7.3 compiler version installed.
This issue is fairly outdated now.
Most helpful comment
@raver119 all
macosx-x86_64build agents now have7.3compiler version installed.