I came across a very interesting problem with (active) cell iterator being used as a member of a class in a user library
template <int dim>
class Atom
{
public:
Atom();
typename dealii::DoFHandler<dim>::active_cell_iterator parent_cell;
};
This leads to double free or corruption on GCC in Release mode only with the current master branch. The code runs fine when linked against 8.4.1 with GCC or current master with Clang both in Debug and Release modes.
The prerequisites are as follows (if any of those is not the case, problem disappears!)
Utilities::MPI::MPI_InitFinalize in .cc fileA one-dummy-class library with a dummy test which demonstrates the issue can be download from
https://groups.google.com/group/dealii-developers/attach/3b8b33a54794d/init_finalize.tar.gz?part=0.1&authuser=0 . Currently there is no way to attach this unit test to the library, unless @tamiko knows some tricks how to abuse unit tests to build a single class library.
By doing bisection between now and 8.4.0 I figure out that ec7d6ee1cf00b4dd074bd92e3efcca53e93331f3 by @bangerth is the first commit which demonstrates the issue. This commit is a part of https://github.com/dealii/dealii/pull/2179.
Previous commit 6461bb7092b2dc18745afadb29f248de434475f6 runs fine both in Debug and Release mode with GCC.
I don't understand how adding an Assert may result in different behaviour during Release build. @bangerth : any ideas (especially given strange prerequisite list)? I think it should still be possible to have an object of active_cell_iterator initialized without DoFHandler and alike. We have things like find_active_cell_around_point which return iterators. They may fail to find the cell in case of p::d::Tria, but this may be a correct behaviour to be processes by a user.
In reference to this post https://groups.google.com/forum/#!topic/dealii-developers/huPy80Fpra4
p.s. appears with deal.II build with Candi and Spack on Ubuntu 16 and 14 with GCC 4.8, 5.4, 6.2.
The assert dereferences dof_handler, which is a null pointer in your case. I think ec7d6ee is broken.
well, it's not only that. The point is that the error appears in release mode, where Assert should not matter.
It turns out this bug has been fixed in #2191 by myself and I can not recall doing that. :-)
Does your code still crash with #2191 applied?
yes, it crashes with master (see OP). The Assert is not an issue here, which is strange.
well, it's not only that. The point is that the error appears in release mode, where Assert should not matter.
I know, but if your code breaks with this patch, the assert does matter because it is the only change.
Can you try master and comment out the assert?
i will tomorrow when I have access to Ubuntu, but on the current master debug works, while release is broken with that
double free or corruption
error.
I can reproduce the bug with the current dev version. Here's the backtrace I get:
#0 0x00007ffff1092c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff1096028 in __GI_abort () at abort.c:89
#2 0x00007ffff10cf2a4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff11dd6b0 "*** Error in `%s': %s: 0x%s ***\n")
at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff10db55e in malloc_printerr (ptr=<optimized out>, str=0x7ffff11dd798 "double free or corruption (!prev)", action=1) at malloc.c:4996
#4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3840
#5 0x00007fffe4bbc653 in deallocate (__p=<optimized out>, this=0x7ffff7dcd5e0 <dealii::Polynomials::Hierarchical::recursive_coefficients>)
at /usr/include/c++/4.8/ext/new_allocator.h:110
#6 _M_deallocate (__n=<optimized out>, __p=<optimized out>, this=0x7ffff7dcd5e0 <dealii::Polynomials::Hierarchical::recursive_coefficients>)
at /usr/include/c++/4.8/bits/stl_vector.h:174
#7 ~_Vector_base (this=0x7ffff7dcd5e0 <dealii::Polynomials::Hierarchical::recursive_coefficients>, __in_chrg=<optimized out>)
at /usr/include/c++/4.8/bits/stl_vector.h:160
#8 std::vector<std::shared_ptr<std::vector<double, std::allocator<double> > const>, std::allocator<std::shared_ptr<std::vector<double, std::allocator<double> > const> > >::~vector (this=0x7ffff7dcd5e0 <dealii::Polynomials::Hierarchical::recursive_coefficients>, __in_chrg=<optimized out>)
at /usr/include/c++/4.8/bits/stl_vector.h:416
#9 0x00007ffff109853a in __cxa_finalize (d=0x7fffe764a640) at cxa_finalize.c:56
#10 0x00007fffe0cd96c3 in __do_global_dtors_aux () from /home/bangerth/p/deal.II/1/install/lib/libdeal_II.g.so.8.5.0-pre
#11 0x00007fffffffda30 in ?? ()
#12 0x00007ffff7dea70a in _dl_fini () at dl-fini.c:252
So this happens at the end of the run. But the location is not helpful. I suppose this is the second call to free when the first one may be to blame.
@davydden -- I suspect that that test happened to accidentally fail with the version you found, and was fixed by @tjhei later on. Can you go back one month and check again, two months and check again, etc, to avoid the intermittent stretch where it failed?
Also, one way to figure out better what is going on is to make sure that GCC also uses -ggdb in optimized mode. I think it should be enough to set CXXFLAGS=-ggdb before calling cmake. This way you should get a more useful backtrace than the one above which really is not particularly helpful.
@bangerth i was bisecting it from now untill 8.4.0, along the way I tried several hashes newer than the commit I found. I also did a few checks manually as I originally thought it has to do with MPI_Init_Finalize. None of the hashes I tried worked. @tjhei fix does not affect the bug in release mode, it only changes the behaviour from failing to passing in debug mode. In other words, it fixes the debug mode only.
Will try with extra flags tomorrow.
For the record, I ran the program under valgrind but it didn't provide anything useful.
Since you know what the patch is that apparently broke the library, what if you did as @tjhei suggested: remove the assertion in question altogether? Does that really restore your ability to run the code?
remove the assertion in question altogether? Does that really restore your ability to run the code?
tested on master, by commenting the Assert the error disappears in release mode. I am puzzled...
I tried with non-empty constructor, i.e. something dummy like
{
const double val = 1.0;
Assert (val > 0., ExcInternalError());
}
and it also works.
p.s. sorry, glitches of Ubuntu with closing/opening of the issue.
When running cmake in your example, I am seeing a warning
CMake Warning at /ssd/deal-build/build-gcc/share/deal.II/macros/macro_deal_ii_add_test.cmake:212 (ADD_
EXECUTABLE):
Cannot generate a safe runtime search path for target
atoms_to_cells_01.release because there is a cycle in the constraint graph:
dir 0 is [/ssd/deal-build/build-gcc/lib]
dir 1 is [/users/heister/Dropbox/work/deal.II/mailinglist/active_cell_iterator_crash]
dir 2 is [/ssd/libs-candi/p4est-1.1/FAST/lib]
dir 3 must precede it due to runtime library [libp4est-1.1.so]
dir 3 is [/ssd/libs-candi/p4est-1.1/DEBUG/lib]
dir 2 must precede it due to runtime library [libp4est-1.1.so]
so it looks like the project is messing up debug and release libraries. Could that be the related?
Otherwise add std::cout or run gdb in the constructor to check this, triangulation, and dof_handler for NULL.
so it looks like the project is messing up debug and release libraries.
could be, but then it's the fault of cmake scripts we provide within deal.II to setup user library.
Anyway, the point is that the same error happens with Spack, which does not build different p4est's (debug and release). So i doubt that this is the source of the problem.
could be, but then it's the fault of cmake scripts we provide within deal.II to setup user library.
yes.
So you are not seeing these warnings then?
So you are not seeing these warnings then?
i do, but only with Candi. When I use Spack's build deal.II, there are no warnings like that but the problem persist. So that's why I conclude that those warnings are not relevant here.
can you do an ldd tests/core/atoms_to_cells_01.release/atoms_to_cells_01.release? For me it links to libdeal_II.so.8.5.0-pre and libdeal_II.g.so.8.5.0-pre... :zap:
same issue here with Spack, links against both:
$ ldd tests/core/atoms_to_cells_01.release/atoms_to_cells_01.release | grep deal_II
libdeal_II.so.8.5.0-pre => /home/davydden/spack/opt/spack/linux-Ubuntu16-x86_64/gcc-5.4.0/dealii-develop-b4u3pdzih7qp3kymzkmtk2msfz6hjyfz/lib/libdeal_II.so.8.5.0-pre (0x00007f19b0988000)
libdeal_II.g.so.8.5.0-pre => /home/davydden/spack/opt/spack/linux-Ubuntu16-x86_64/gcc-5.4.0/dealii-develop-b4u3pdzih7qp3kymzkmtk2msfz6hjyfz/lib/libdeal_II.g.so.8.5.0-pre (0x00007f199a1a3000)
Timo, that was it :tada:
$ cmake ../ -DCMAKE_BUILD_TYPE=Release
-- Configuring done
-- Generating done
-- Build files have been written to: /home/davydden/libs/init_finalize/_build
davydden@davyddenubuntu:~/libs/init_finalize/_build$ make
Scanning dependencies of target qc_dealii
[ 50%] Building CXX object CMakeFiles/qc_dealii.dir/source/atom/atom.cc.o
[100%] Linking CXX shared library libqc_dealii.so
[100%] Built target qc_dealii
davydden@davyddenubuntu:~/libs/init_finalize/_build$ make test
Running tests...
Test project /home/davydden/libs/init_finalize/_build
Start 1: core/atoms_to_cells_01.release
1/1 Test #1: core/atoms_to_cells_01.release ... Passed 0.91 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 0.91 sec
davydden@davyddenubuntu:~/libs/init_finalize/_build$ cmake ../ -DCMAKE_BUILD_TYPE=Debug
-- Configuring done
-- Generating done
-- Build files have been written to: /home/davydden/libs/init_finalize/_build
davydden@davyddenubuntu:~/libs/init_finalize/_build$ make && make test
Scanning dependencies of target qc_dealii
[ 50%] Building CXX object CMakeFiles/qc_dealii.dir/source/atom/atom.cc.o
[100%] Linking CXX shared library libqc_dealii.so
[100%] Built target qc_dealii
Running tests...
Test project /home/davydden/libs/init_finalize/_build
Start 1: core/atoms_to_cells_01.debug
1/1 Test #1: core/atoms_to_cells_01.debug ..... Passed 0.93 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 0.93 sec
So how do we fix it in Cmake scripts so that Debug and Release got linked properly?
It just took almost one week to figure out the problem, well.., could be worse :smile:
And now it suddenly also all makes sense -- both libs create the same objects, in the same location; they probably run the constructor twice, creating a memory leak at times, but definitely mess up when they both destroy the same objects twice...
@tamiko : you know best CMake internals of deal.II. Can one make user projects and unit tests link correctly against Debug and Release versions of deal.II when CMAKE_BUILD_TYPE is not specified or it is set to DebugRelease?
Alternative is to add a note in https://www.dealii.org/developer/users/cmakelists.html#cmakesimple.build_type saying that CMAKE_BUILD_TYPE has to be set to either Debug or Release, end of story. Throw an error if it is something else.
p.s. DEAL_II_INITIALIZE_CACHED_VARIABLES suggests that DebugRelease is supported:
The macro will set an uninitialized CMAKE_BUILD_TYPE variable to Debug (or Release if the debug library is not available). If CMAKE_BUILD_TYPE is specified it will automatically be reset if the given value is unsupported by the deal.II installation (i.e., if it is not equal to Debug, Release, or DebugRelease).
Timo, that was it
Great! I guess the lesson here is that warnings (even coming from cmake) shouldn't be ignored. ;-)
Great! I guess the lesson here is that warnings (even coming from cmake) shouldn't be ignored. ;-)
i did not have them with Spack, when i originally came across the problem ;-)
@davydden This is much more subtle than I thought. I will have to revise quite a bit of logic. PR with detailed explanation incoming.
@tamiko thanks for looking into this!
Most helpful comment
can you do an
ldd tests/core/atoms_to_cells_01.release/atoms_to_cells_01.release? For me it links to libdeal_II.so.8.5.0-pre and libdeal_II.g.so.8.5.0-pre... :zap: