Dealii: FullMatrix<number>::gauss_jordan() returns incorrect inverse in multithreading.

Created on 6 Nov 2019  路  28Comments  路  Source: dealii/dealii

I am trying to run step-51 with the development version of dealii (installed by spack). It turns out that the solver fails to converge at a certain cycle and the resulting residual is above 1e18. One problem is that FullMatrix<number>::gauss_jordan() used the local assembling part may return incorrect inverse in multithreading and the correct error table can be obtained when running in one thread.

In step-51 with multithreading, I try to check whether the product of ll_matrix and its inverse gives the identity matrix. Here is one resulting matrix I got using Q1 element and global refinement at cycle 7:

 1.007e+00  7.910e-03  8.820e-03  6.692e-03  1.239e-16 -9.637e-16 -1.584e-16  1.195e-15 -8.769e-02 -8.769e-02  1.532e-01  1.532e-01 
-6.692e-03  9.921e-01 -8.820e-03 -6.692e-03  5.329e-16  2.790e-16 -5.540e-16 -3.443e-16  8.769e-02  8.769e-02 -1.532e-01 -1.532e-01 
-1.277e-03 -4.700e-03  9.934e-01 -5.338e-03  1.264e-16  2.130e-16 -3.206e-16 -3.083e-16  2.465e-01  2.465e-01 -9.193e-02 -9.193e-02 
 1.277e-03  4.700e-03  6.615e-03  1.005e+00 -5.921e-16 -2.007e-16  6.752e-16  2.404e-16 -2.465e-01 -2.465e-01  9.193e-02  9.193e-02 
 4.957e-04 -1.209e-04  6.226e-04  1.269e-04  1.000e+00  2.840e-16 -2.136e-16 -5.235e-16  4.440e-02  4.440e-02  3.569e-02  3.569e-02 
-1.166e-03  2.882e-04 -1.461e-03 -2.947e-04 -7.685e-16  1.000e+00  6.583e-16  3.824e-17 -1.047e-01 -1.047e-01 -8.394e-02 -8.394e-02 
-4.957e-04  1.209e-04 -6.226e-04 -1.269e-04  1.777e-16 -6.540e-16  1.000e+00  6.437e-16 -4.440e-02 -4.440e-02 -3.569e-02 -3.569e-02 
 1.166e-03 -2.882e-04  1.461e-03  2.947e-04  4.688e-17  2.307e-17  3.559e-17  1.000e+00  1.047e-01  1.047e-01  8.394e-02  8.394e-02 
 6.440e-04 -6.440e-04  1.288e-03 -1.288e-03  6.440e-04  1.288e-03 -6.440e-04 -1.288e-03  1.001e+00 -2.453e-02 -2.677e-02 -9.430e-03 
-4.549e-05  4.549e-05 -9.097e-05  9.097e-05  5.655e-04 -7.020e-04 -5.655e-04  7.020e-04 -6.155e-03  1.026e+00 -7.372e-03 -2.768e-03 
-2.188e-05  2.188e-05 -4.376e-05  4.376e-05  6.122e-04 -6.779e-04 -6.122e-04  6.779e-04 -6.318e-03  1.930e-02  9.995e-01 -2.465e-03 
-1.184e-03  1.184e-03 -5.176e-04  5.176e-04 -1.187e-03 -5.141e-04  1.187e-03  5.141e-04  1.381e-02  8.880e-03  7.487e-03  1.016e+00 
Bug High Priority Linear Algebra Parallel shared

All 28 comments

I can reproduce the issue. If multi threading is disabled, everything works as expected.

Are you using LAPACK? If so, can you try to disable the LAPACK path in FullMatrix::gauss_jordan() and see whether that avoids the problem? It may be that the LAPACK installation you have is just not thread-safe -- in which case we need to put a mutex around that call.

I don't think that this is related to LAPACK: we only switch to LAPACK if the matrix has at least 15 columns, but the one here only has twelve.

Then we must be using some other variable at the same time on multiple threads...

How do we want to attack this issue? @lwy5515 -- can you say how you modified the program to show that the matrix inversion really is the problem?

@bangerth @luca-heltai
Here is the convergence table (for Q3 elements with the global refinement) I got from the original step-51 code.

cells dofs      val L2          grad L2        val L2-post    
   16   160 3.613e-01     - 1.891e+00      - 3.020e-01      - 
   36   336 6.411e-02  4.26 5.081e-01   3.24 3.238e-02   5.51 
   64   576 3.480e-02  2.12 2.533e-01   2.42 5.276e-03   6.31 
  144  1248 8.297e-03  3.54 5.924e-02   3.58 6.330e-04   5.23 
  256  2176 2.254e-03  4.53 1.636e-02   4.47 1.403e-04   5.24 
  576  4800 4.558e-04  3.94 3.277e-03   3.96 1.844e-05   5.01 
 1024  8448 1.471e-04  3.93 1.052e-03   3.95 4.378e-06   5.00 
 2304 18816 2.956e-05  3.96 2.104e-04   3.97 5.750e-07   5.01 
 4096 33280 8.080e-05 -3.49 1.237e-02 -14.16 6.292e-05 -16.32 
 9216 74496 1.695e-05  3.85 3.162e-05  14.72 1.685e-05   3.25 

It seems that something behaves strange when n DoFs is large. If I guard the function call of gauss_jordan() (at line 926) with Threads::Mutex like

cell_inversion_lock.lock();
scratch.ll_matrix.gauss_jordan();
cell_inversion_lock.unlock();

I get the expected convergence table:

cells dofs      val L2        grad L2      val L2-post   
   16   160 3.613e-01    - 1.891e+00    - 3.020e-01    - 
   36   336 6.411e-02 4.26 5.081e-01 3.24 3.238e-02 5.51 
   64   576 3.480e-02 2.12 2.533e-01 2.42 5.277e-03 6.31 
  144  1248 8.297e-03 3.54 5.924e-02 3.58 6.330e-04 5.23 
  256  2176 2.254e-03 4.53 1.636e-02 4.47 1.403e-04 5.24 
  576  4800 4.558e-04 3.94 3.277e-03 3.96 1.844e-05 5.01 
 1024  8448 1.471e-04 3.93 1.052e-03 3.95 4.378e-06 5.00 
 2304 18816 2.956e-05 3.96 2.104e-04 3.97 5.750e-07 5.01 
 4096 33280 9.428e-06 3.97 6.697e-05 3.98 1.361e-07 5.01 
 9216 74496 1.876e-06 3.98 1.330e-05 3.99 1.789e-08 5.01 

I think we see this in testsuite failures like this one
https://cdash.43-1.org/testDetails.php?test=36993023&build=5013
as well. So this is something we should investigate. Apparently it both affects the case of small n where we use our implementation, and the larger one that goes through LAPACK.

I've tried to reproduce this, but it seems to run just fine on my laptop. How many cores do you have? Is deal.II configured with/without LAPACK?

(For reference: My laptop has 12 cores, and yes to LAPACK.)

@kronbichler One interesting aspect of the failing build you are referring to is the fact that libmkl (as lapack dependency) and libopenblas (pulled in by petsc and trilinos) are getting mixed. Maybe they clash with multithreading.
It is actually quite likely that with that the blas/lapack symbol that actually gets called is entirely up to the dynamic loader.

That's a good point; I thought I had updated the deal.II configuration to OpenBLAS but apparently forgot. Let me fire up the tests again.

But this happens also when lapack is not called....

I know, but it is nonetheless good to exclude flaws in my setup just to maximize chances for reproducing - it takes 5 more minutes, then my results should be online.

As suggested by @luca-heltai, we still fail on my machine (now without MKL and openblas uniformly):
https://cdash.43-1.org/viewTest.php?onlyfailed&buildid=5550
So this means I have a machine where it happens. It does not happen all the time (which is why cdash gives those step-51 tests the "unstable" label), but it does happen. I will try to attach with gdb later today.

I looked into the test bits/step-51.debug a bit more. I could reproduce the original problem of an incorrect matrix inversion after FullMatrix::gauss_jordan(), but I sometimes get wrong convergence tables also when the gauss_jordan call creates correct results. As the problem only happens sometimes (but more often when I increase the number of threads from 3 to say 6), this very much looks like a race condition in the WorkStream execution. It could be TBB, it could be our implementation of WorkStream, or it could be the step-51 code. I did look over the latter two but did not see anything suspicious.
I tried to attach with gdb to a failing case when I put an assert to check the matrix inversion. I looked into the scratch_data array pointers over here
https://github.com/dealii/dealii/blob/b6255edd8d3f0dd6e78088461dfcc55cddd8ce2a/include/deal.II/base/work_stream.h#L544-L546
which is where the code finally calls into the user function but they seem to be ok for all participating threads (ok = different addresses). The problem at this stage was that I could not access the thread-local copy of the scratch data list
https://github.com/dealii/dealii/blob/b6255edd8d3f0dd6e78088461dfcc55cddd8ce2a/include/deal.II/base/work_stream.h#L507-L508
because GDB does not allow me to access that. So I removed the scope braces to keep the scratch_data_list alive during the call, but without really seeing much more (I was not able to look into the list in a meaningful way). Anyone with ideas where to look into?

I also tried to debug with valgrind's drd tool (a thread checker) but TBB seems to do so many operations that are race conditions by drd's standards that I got lost. I obtained some messages while running the work stream, where drd complained that some items where destructed of the scratch_data_list by one thread while other threads were waiting for a lock in Subscriptor (via FEValues or something like that). But that should not disturb computations, and looks like many other false positives in the analysis.

So overall I'm running out of ideas here. Something is fundamentally broken but it's hard to say what.

So overall I'm running out of ideas here. Something is fundamentally broken but it's hard to say what.

There is other multithreaded code running in this example (FE/Mapping, output, etc.). I am worried that we are mixing different paradigms (std::thread, OMP, tbb:task, etc.) and they interfere with the workstream. Does gdb show anything interesting on your end?
I found a large number of Tasks created during setup and while running a work stream (FEFaceValues::initialize inside ScratchData inside workstream inside assemble_system for example). Maybe you could try disabling the task framework as a test? It doesn't strike me as a great idea to do this anyways.
Maybe threadlocal storage gets confused when we are spawning additional tasks? Does tbb guarantee that a task is not being migrated to a different thread?

I can check that in a minute. For the record, I have removed FEFaceValues and a number of other things for my minimized test case:

#include "../tests.h"

#include <deal.II/base/quadrature_lib.h>
#include <deal.II/base/work_stream.h>
#include <deal.II/dofs/dof_handler.h>
#include <deal.II/fe/fe_dgq.h>
#include <deal.II/fe/fe_system.h>
#include <deal.II/fe/fe_values.h>
#include <deal.II/grid/grid_generator.h>
#include <deal.II/grid/tria.h>

namespace Step51
{
  template <int dim>
  class HDG
  {
  public:
    HDG(const unsigned int degree);
    void
    run();

  private:
    void
    setup_system();
    void
    assemble_system();

    struct PerTaskData;
    struct ScratchData;

    void
    assemble_system_one_cell(
      const typename DoFHandler<dim>::active_cell_iterator &cell,
      ScratchData &                                         scratch,
      PerTaskData &                                         task_data);

    void
    copy_local_to_global(const PerTaskData &data);

    Triangulation<dim> triangulation;

    FESystem<dim>   fe_local;
    DoFHandler<dim> dof_handler_local;
  };


  template <int dim>
  HDG<dim>::HDG(const unsigned int degree)
    : fe_local(FE_DGQ<dim>(degree), dim + 1)
    , dof_handler_local(triangulation)
  {}



  template <int dim>
  void
  HDG<dim>::setup_system()
  {
    dof_handler_local.distribute_dofs(fe_local);
  }



  template <int dim>
  struct HDG<dim>::PerTaskData
  {
    PerTaskData() = default;
  };



  template <int dim>
  struct HDG<dim>::ScratchData
  {
    FEValues<dim> fe_values_local;

    FullMatrix<double> ll_matrix;
    FullMatrix<double> ll_matrix_cpy;
    FullMatrix<double> ll_matrix_test;

    std::vector<Tensor<1, dim>> q_phi;
    std::vector<double>         q_phi_div;
    std::vector<double>         u_phi;
    std::vector<Tensor<1, dim>> u_phi_grad;

    ScratchData(const FiniteElement<dim> &fe_local,
                const QGauss<dim> &       quadrature_formula,
                const UpdateFlags         local_flags)
      : fe_values_local(fe_local, quadrature_formula, local_flags)
      , ll_matrix(fe_local.dofs_per_cell, fe_local.dofs_per_cell)
      , ll_matrix_cpy(ll_matrix)
      , ll_matrix_test(ll_matrix)
      , q_phi(fe_local.dofs_per_cell)
      , q_phi_div(fe_local.dofs_per_cell)
      , u_phi(fe_local.dofs_per_cell)
      , u_phi_grad(fe_local.dofs_per_cell)
    {}

    ScratchData(const ScratchData &sd)
      : fe_values_local(sd.fe_values_local.get_fe(),
                        sd.fe_values_local.get_quadrature(),
                        sd.fe_values_local.get_update_flags())
      , ll_matrix(sd.ll_matrix)
      , ll_matrix_cpy(sd.ll_matrix_cpy)
      , ll_matrix_test(sd.ll_matrix_test)
      , q_phi(sd.q_phi)
      , q_phi_div(sd.q_phi_div)
      , u_phi(sd.u_phi)
      , u_phi_grad(sd.u_phi_grad)
    {}
  };



  template <int dim>
  void
  HDG<dim>::assemble_system()
  {
    const QGauss<dim> quadrature_formula(fe_local.degree + 1);
    const UpdateFlags local_flags(update_values | update_gradients |
                                  update_JxW_values | update_quadrature_points);

    PerTaskData task_data;
    ScratchData scratch(fe_local, quadrature_formula, local_flags);

    WorkStream::run(dof_handler_local.begin_active(),
                    dof_handler_local.end(),
                    *this,
                    &HDG<dim>::assemble_system_one_cell,
                    &HDG<dim>::copy_local_to_global,
                    scratch,
                    task_data);
  }



  template <int dim>
  void
  HDG<dim>::assemble_system_one_cell(
    const typename DoFHandler<dim>::active_cell_iterator &loc_cell,
    ScratchData &                                         scratch,
    PerTaskData &                                         task_data)
  {
    const unsigned int n_q_points =
      scratch.fe_values_local.get_quadrature().size();

    const unsigned int loc_dofs_per_cell =
      scratch.fe_values_local.get_fe().dofs_per_cell;

    const FEValuesExtractors::Vector fluxes(0);
    const FEValuesExtractors::Scalar scalar(dim);

    scratch.ll_matrix = 0;
    scratch.fe_values_local.reinit(loc_cell);

    for (unsigned int q = 0; q < n_q_points; ++q)
      {
        const Tensor<1, dim> convection = Point<dim>(1.0, -1.2);
        const double         JxW        = scratch.fe_values_local.JxW(q);
        for (unsigned int k = 0; k < loc_dofs_per_cell; ++k)
          {
            scratch.q_phi[k] = scratch.fe_values_local[fluxes].value(k, q);
            scratch.q_phi_div[k] =
              scratch.fe_values_local[fluxes].divergence(k, q);
            scratch.u_phi[k] = scratch.fe_values_local[scalar].value(k, q);
            scratch.u_phi_grad[k] =
              scratch.fe_values_local[scalar].gradient(k, q);
          }
        for (unsigned int i = 0; i < loc_dofs_per_cell; ++i)
          {
            for (unsigned int j = 0; j < loc_dofs_per_cell; ++j)
              scratch.ll_matrix(i, j) +=
                (scratch.q_phi[i] * scratch.q_phi[j] -
                 scratch.q_phi_div[i] * scratch.u_phi[j] +
                 scratch.u_phi[i] * scratch.q_phi_div[j] -
                 (scratch.u_phi_grad[i] * convection) * scratch.u_phi[j] +
                 scratch.u_phi[i] * scratch.u_phi[j]) *
                JxW;
          }
      }

    for (unsigned int i = 0; i < scratch.ll_matrix.m(); ++i)
      for (unsigned int j = 0; j < scratch.ll_matrix.n(); ++j)
        scratch.ll_matrix_cpy(i, j) = scratch.ll_matrix(i, j);

    scratch.ll_matrix.gauss_jordan();

    scratch.ll_matrix.mmult(scratch.ll_matrix_test, scratch.ll_matrix_cpy);
    for (unsigned int i = 0; i < scratch.ll_matrix.m(); ++i)
      for (unsigned int j = 0; j < scratch.ll_matrix.n(); ++j)
        AssertThrow(i != j && std::abs(scratch.ll_matrix_test(i, j)) < 1e-12 ||
                      i == j &&
                        std::abs(scratch.ll_matrix_test(i, j) - 1.) < 1e-12,
                    ExcMessage("Wrong result"));
  }



  template <int dim>
  void
  HDG<dim>::copy_local_to_global(const PerTaskData &data)
  {}


  template <int dim>
  void
  HDG<dim>::run()
  {
    GridGenerator::hyper_cube(triangulation);
    triangulation.refine_global(4);
    setup_system();
    assemble_system();
  }

} // end of namespace Step51



int
main()
{
  initlog();
  MultithreadInfo::set_thread_limit(20);

  Step51::HDG<2> hdg_problem(1);
  hdg_problem.run();

  return 0;
}

On a machine with 40 cores this fails in about 2 out of 10 cases - but there are also phases when I get 50 runs in a row without error.

Now I tried with this setting:

$ git diff include/deal.II/base/thread_management.h
diff --git a/include/deal.II/base/thread_management.h b/include/deal.II/base/thread_management.h
index 4568730..1c73ed1 100644
--- a/include/deal.II/base/thread_management.h
+++ b/include/deal.II/base/thread_management.h
@@ -33,6 +33,9 @@
 #  include <tuple>
 #  include <utility>
 #  include <vector>
+
+#undef DEAL_II_WITH_THREADS
+
 DEAL_II_DISABLE_EXTRA_DIAGNOSTICS
 #  ifdef DEAL_II_WITH_THREADS
 #    include <thread>
@@ -1868,6 +1871,9 @@ namespace Threads

 //---------------------------------------------------------------------------
 DEAL_II_NAMESPACE_CLOSE
+
+#define DEAL_II_WITH_THREADS
+
 // end of #ifndef dealii_thread_management_h
 #endif
 //---------------------------------------------------------------------------

which should disable tasks if I understand things correctly, but I still get the error 1-2 times per 10 runs.

Now I realize that I probably did a rookie mistake: I had verified that we do call FullMatrix::gauss_jordan() and do not use BLAS/LAPACK, but what I did not verify is that we do FullMatrix::mmult() just a few lines below. At least on my machine, that calls into dgemm_ due to the function here:
https://github.com/dealii/dealii/blob/33245a616fc9f7084fa9468c80589e1cc2417b01/include/deal.II/lac/full_matrix.templates.h#L559-L573
Now I changed this to a manual mat-mat multiplication for the inverse and I have been running 1000 instances without error. As this mat-mat multiplication also happens for some of the original code
https://github.com/dealii/dealii/blob/33245a616fc9f7084fa9468c80589e1cc2417b01/tests/bits/step-51.cc#L769-L773
it is likely BLAS we should blame after all at least for the failure on my machine. I guess the reason why the lock used by @lwy5515 around the gauss_jordan call helped was that as one serializes the inversion, the parts after that (which are quicker) are essentially serial as well. @lwy5515 for the sake of reproducing, can you try to disable BLAS/LAPACK completely and see if the error goes away?

Poking in the dark: You mentioning thread-local variables reminds me of a realization @tjhei pointed me to many years ago on a bike ride in College Station (somewhere on Rock Prairie or Birdpond) when I was very frustrated with a bug I couldn't find in the WorkStream implementation. Imagine you have code like this:

  ThreadLocalStorage<...> x;
  Vector<double> v;

  void f() {
     auto val = something_complicated();
     x = val;
     v = 0;
     assert (x == val);

If you run this on a threaded code, you'd expect the assertion to succeed. But that's not necessarily the code: v = 0 calls Vector::operator=, which internally does a parallel-for loop. So it spawns a bunch of other tasks and then waits for completion of these tasks. While waiting, the TBB may determine to run something else on the current thread, and this may again be a different call to f(). It may set the thread-local variable x's value to something different, the function exits, and then the original invocation of f() resumes after the v=0 call, at which point we run and fail the assertion.

The point I'm trying to make is that thread-local variables must be treated as volatile unless (i) one can be sure that between write and read there are no other calls that could possibly invoke parallel sections, or (ii) they are guarded by a mutex. Of course, a section using a mutex can also not call code that may itself be parallelized or you may end up with a deadlock.

Not sure anything of this is going on here, but maybe it spurs something with someone who has looked at the code recently. In any case, it's an educational lesson worth keeping in mind since it's just so incredibly difficult to debug.

From the changelog from OpenBLAS from Aug 2019:

  * a new option USE_LOCKING was added to ensure thread safety when
    OpenBLAS itself is built without multithreading but will be 
    called from multiple threads.

I will try that version on my machine.

At least for me a thread-safe BLAS/LAPACK seems to do the trick. Failing tests are down to 1:
https://cdash.43-1.org/viewTest.php?onlyfailed&buildid=5705
(and that is p4est-related).

It's a relief every time to know that the bug is in someone else's package :-)

The question then is: How do we detect and rule this version of BLAS out?

@lwy5515 @luca-heltai do we need to keep this open?

I think we can close this, but we should make a comment in step-51 that issues may occur if you use a non-threadsafe BLAS/LAPACK

By the way, wasn't there a comment above that said that this also happens for matrices that are so small that we don't even call BLAS?

We had this comment, yes, but what happened (at least for my experiments) is that while we don't use LAPACK for the matrix inversion, we use it for the subsequent matrix-matrix multiplication (dgemm), so it was the same context any way. And since mat-mat multiplication is faster than the inversion, serializing the inversion also serializes the mat-mat multiplication afterwards and makes the problem go away.

Oh, good analysis. Yes, thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tamiko picture tamiko  路  5Comments

francesco-st picture francesco-st  路  5Comments

bangerth picture bangerth  路  6Comments

jppelteret picture jppelteret  路  4Comments

adamqc picture adamqc  路  3Comments