Kratos: [MPI][AMGCL] MPI version of AMGCL solver not performing as expected

Created on 17 Jan 2019  Â·  27Comments  Â·  Source: KratosMultiphysics/Kratos

The following observations were made for the AMGCL linear solver in cases using different solver types. In particular, I tested the "two_fluids" and "monolithic" solver. For comparison purposes, otherwise absolutely identical cases were computed with the serial and the mpi-parallel version of the solver.

AMGCL (for serial applications with 1 and more threads)

  • fast solution
  • quick convergence

amgcl (for MPI based applications)

  • frequent output of "AMGCL: Unknown parameter class" as already mentioned in #2989
  • unsatisfying convergence behaviour, sometimes even diverging into "+/- nan" for a case that quickly converges in serial version
  • extremely slow execution

If you are interested in reproducing the behaviour, please, find a small and well-known case (original case provided by @rubenzorrilla) in the archive folder I am attaching.

@philbucher Since I assume that several people may be affected and aslo for the sake of documentation, I finally created an issue for this observation.

AMGCLissue.tar.gz

Help Wanted Parallel-MPI

All 27 comments

Minor comment, just regarding the last point. Can you confirm that you are disabling the OpenMP parallelism in mpi runs? (for example with export OMP_NUM_THREADS=1 in your shell session before calling mpirun). This makes a great difference in performance.

Yes, this setting was chosen and the run was repeated to double-check. The wording "slow execution" may be a bad formulation: Due to only slowly approaching the termination criterion, the solution is often repeated and this takes a long time. Looking at a single solution step, the MPI version is faster.

Ok understood. Thanks for the clarification.
I believe that @RiccardoRossi is the expert regarding the use of the MPI AMGCL solver in Kratos.

I'll need @RiccardoRossi to help me reproduce this. May be chat later today?

yes i can defnitely help you.

@jcotela, this is just to acknowledge that @ddemidov is the author of amgcl...and definitely more expert than me 😉

@RiccardoRossi, I think @jcotela was correct, because you know more than me re actually using mpi amgcl solver in Kratos (I still have very little idea of how to actually run Kratos).

I managed to compile enough of Kratos to run serial part of @swenczowski example. When I export the matrix, I can solve it with standalone amgcl solver, both serial and mpi one (and I don't see too much of a discrepancy between the two).

I can not compile METIS_APPLICATION on my machine, it seems that metis headers included into Kratos source are in conflict with metis version installed in my system (I have metis v5.1 and parmetis v4.0 installed on Arch linux). The error I get is

FAILED: applications/metis_application/CMakeFiles/KratosMetisApplication.dir/custom_python/add_processes_to_python.cpp.o 
/usr/bin/g++  -DADD_ -DAMGCL_GPGPU -DKRATOS_PYTHON -DKRATOS_USING_MPI -DKratosMetisApplication_EXPORTS -DVEXCL_BACKEND_CUDA -Iapplications/metis_application -I../applications/metis_application -I../external_libraries -I../kratos -I/usr/include/python3.7m -isystem /opt/cuda/include -isystem /home/demidov/work/vexcl -msse3 -std=c++11 -Wno-class-memaccess -funroll-loops -Wall -std=c++11 -Wsuggest-override -fopenmp -O3 -DNDEBUG -fPIC -fvisibility=hidden   -fPIC -std=c++14 -flto -fno-fat-lto-objects -Wall -Wno-missing-braces -Wno-deprecated-declarations -Wno-ignored-attributes -Wno-unused-local-typedefs -Wno-catch-value -std=gnu++11 -MD -MT applications/metis_application/CMakeFiles/KratosMetisApplication.dir/custom_python/add_processes_to_python.cpp.o -MF applications/metis_application/CMakeFiles/KratosMetisApplication.dir/custom_python/add_processes_to_python.cpp.o.d -o applications/metis_application/CMakeFiles/KratosMetisApplication.dir/custom_python/add_processes_to_python.cpp.o -c ../applications/metis_application/custom_python/add_processes_to_python.cpp
In file included from ../applications/metis_application/custom_processes/metis_divide_input_to_partitions_process.h:63,
                 from ../applications/metis_application/custom_processes/metis_divide_heterogeneous_input_process.h:11,
                 from ../applications/metis_application/custom_python/add_processes_to_python.cpp:23:
../applications/metis_application/custom_processes/metis_graph_partitioning_process.h:31:18: error: conflicting declaration of C function ‘int METIS_PartMeshDual(int*, int*, int*, int*, int*, int*, int*, int*, int*)’
       extern int METIS_PartMeshDual(int*, int*, int*, int*, int*, int*, int*, int*, int*);
                  ^~~~~~~~~~~~~~~~~~
In file included from /usr/include/parmetis.h:18,
                 from ../applications/metis_application/custom_processes/metis_divide_heterogeneous_input_process.h:7,
                 from ../applications/metis_application/custom_python/add_processes_to_python.cpp:23:
/usr/include/metis.h:219:16: note: previous declaration ‘int METIS_PartMeshDual(idx_t*, idx_t*, idx_t*, idx_t*, idx_t*, idx_t*, idx_t*, idx_t*, real_t*, idx_t*, idx_t*, idx_t*, idx_t*)’
 METIS_API(int) METIS_PartMeshDual(idx_t *ne, idx_t *nn, idx_t *eptr, idx_t *eind,
                ^~~~~~~~~~~~~~~~~~

@RiccardoRossi you are completely right on that. I did not intend to play down @ddemidov's role in this, I was just trying to solve what it looked (to me) as an integration issue in kratos internally before calling him in... (thanks @ddemidov by the way)

@ddemidov kratos "assumes" an older version of metis by default. Can you check that you have set the option -DUSE_METIS_5=ON in your cmake?

Thanks, that solved the metis issue! Now, when I run python MainKratos.py from the mpi example, I get

Traceback (most recent call last):
  File "MainKratos.py", line 32, in <module>
    simulation.Run()
  File "/home/demidov/work/Kratos/kratos/python_scripts/analysis_stage.py", line 47, in Run
    self.Initialize()
  File "/home/demidov/work/Kratos/kratos/python_scripts/analysis_stage.py", line 68, in Initialize
    self._GetSolver().ImportModelPart()
  File "/home/demidov/work/Kratos/applications/FluidDynamicsApplication/python_scripts/trilinos_navier_stokes_solver_vmsmonolithic.py", line 110, in ImportModelPart
    self.trilinos_model_part_importer = trilinos_import_model_part_utility.TrilinosImportModelPartUtility(self.main_model_part, self.settings)
  File "/home/demidov/work/Kratos/applications/trilinos_application/python_scripts/trilinos_import_model_part_utility.py", line 18, in __init__
    raise NameError("MPI number of processors is 1.")
NameError: MPI number of processors is 1.

Should I run it as mpirun -np 4 python MainKratos.py instead?

Yes, please. Also "-n 2" showed the described behavior for me.

@ddemidov see #3880 if you want to test with a single process. (edited-- wrong '@')

I can see that the solver returns nan as error. But, if I solve the same system (saved to MatrixMarket file by setting verbosity=4) with the same parameters, with the standalone amgcl solver, I can not reproduce this.

Serial solve:

~/work/amgcl/build/examples/solver -A A.mm -f b.mm.rhs \
-p precond.coarse_enough=500 \
   precond.relax.type=damped_jacobi \
   precond.coarsening.type=aggregation \
   precond.coarsening.aggr.eps_strong=0 \
   solver.type=lgmres -b3

Solver
======
Type:             LGMRES(30,3)
Unknowns:         3072
Memory footprint: 2.69 M

Preconditioner
==============
Number of levels:    2
Operator complexity: 1.12
Grid complexity:     1.13
Memory footprint:    3.38 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0         3072          20772      2.53 M (89.00%)
    1          403           2567    876.55 K (11.00%)

Iterations: 24
Error:      5.01826e-09

[Profile:       0.317 s] (100.00%)
[ self:         0.001 s] (  0.16%)
[  reading:     0.244 s] ( 77.14%)
[  setup:       0.012 s] (  3.85%)
[  solve:       0.060 s] ( 18.85%)

MPI solve:

mpirun -np 2 ~/work/amgcl/build/examples/mpi/mpi_amg -A A.mm -f b.mm.rhs \
-p precond.coarse_enough=500 \
   precond.relax.type=damped_jacobi \
   precond.coarsening.type=aggregation \
   precond.coarsening.aggr.eps_strong=0 \
   solver.type=lgmres -b3

World size: 2
Partitioning[PT-Scotch] 2 -> 2
Type:             LGMRES(30,3)
Unknowns:         1550
Memory footprint: 1.37 M

Number of levels:    2
Operator complexity: 1.12
Grid complexity:     1.13

level     unknowns       nonzeros
---------------------------------
    0         3072          20772 (89.16%) [2]
    1          397           2525 (10.84%) [2]

Iterations: 25
Error:      8.58271e-09

[Profile:         0.310 s] (100.00%)
[ self:           0.031 s] ( 10.14%)
[  partition:     0.007 s] (  2.26%)
[  read:          0.229 s] ( 73.91%)
[  setup:         0.007 s] (  2.22%)
[  solve:         0.036 s] ( 11.47%)

When I set verbosity=4 in mpi case, I still get single matrix file. Is there a way to get partitioned matrix parts? How is the system matrix partitioned here?

EDIT: In fact, I should be able to temporary add the matrix saving code myself.

Partitioning does not seem to be a problem. When I save matrix parts to separate files, I can still solve those with (modified) standalone solver:

mpirun -np 2 ~/work/amgcl/build/examples/mpi/mpi_amg \
-p precond.coarse_enough=500 \
   precond.relax.type=damped_jacobi \
   precond.coarsening.type=aggregation \
   precond.coarsening.aggr.eps_strong=0 \
   solver.type=lgmres -b3 

World size: 2
Type:             LGMRES(30,3)
Unknowns:         1541
Memory footprint: 1.36 M

Number of levels:    2
Operator complexity: 1.12
Grid complexity:     1.12

level     unknowns       nonzeros
---------------------------------
    0         3072          20772 (89.40%) [2]
    1          384           2464 (10.60%) [2]

Iterations: 26
Error:      5.3051e-09

[Profile:     0.204 s] (100.00%)
[ self:       0.033 s] ( 16.42%)
[  read:      0.123 s] ( 60.29%)
[  setup:     0.009 s] (  4.57%)
[  solve:     0.038 s] ( 18.73%)

just a workaround while we look for the problem:

please set

   "use_block_matrices_if_possible" : false

Some progress: I was able to reproduce the NaN error with standalone amgcl solver by switching from MatrixMarket to binary output format. So, the slight difference in precision between text and binary formats is important here, which makes me suspicious about the problem correctness (could it be singular?).

Still don't know what exactly is the problem, but at least we can be sure that Kratos code wrapping amgcl is working correctly.

Another possible workaround is to switch from damped_jacobi to spai0 as relaxation:

mpirun -n 2 ~/work/amgcl/build/examples/mpi/mpi_amg -P P.json -B -b3 \
-p precond.relax.type=spai0

World size: 2
Type:             LGMRES(500,3)
Unknowns:         1541
Memory footprint: 21.80 M

Number of levels:    3
Operator complexity: 1.13
Grid complexity:     1.14

level     unknowns       nonzeros
---------------------------------
    0         3072          20772 (88.34%) [2]
    1          387           2471 (10.51%) [2]
    2           49            272 ( 1.16%) [2]

Iterations: 129
Error:      7.84784e-09

[Profile:     0.339 s] (100.00%)
[ self:       0.040 s] ( 11.84%)
[  read:      0.001 s] (  0.44%)
[  setup:     0.018 s] (  5.29%)
[  solve:     0.280 s] ( 82.43%)

Ok, I think I know the reason for NaNs: 168-th diagonal block of the system matrix belonging to the second MPI process is singular (its inverse is 3x3 matrix of nans). The block in question is:

-0.00999503 0 0
 0 0 0
 0 0 0

In fact, the whole 168-th row looks bad (column number followed by 3x3 value):

159:  -0.00979576 -0.00400725 0.00310985
 0 0 0
 0 0 0

160:  -0.00648279 0.000788925 0.00395082
 0 0 0
 0 0 0

161:  -0.0217919 0 0
 0 0 0
 0 0 0

168:  -0.00999503 0 0
 0 0 0
 0 0 0

The rows below and above it look normal.

EDIT: this could be a problem with block matrix adapter, which expects matrix rows to be sorted by column numbers. It looks like this is not the case with the matrices I get here. Will look at this later.

Should be fixed by #3896.

@swenczowski, can you please confirm that #3896 helps?

Thank you very much for the update. The branch was checked out and compiled in Release mode. The implementation in #3896 was tested on my local machine in different cases meaning

  • different solvers
  • different (but still relatively small) mesh sizes
  • different geometries.

The repeating observation was, that the solver operates fast and (judging from the results) correctly when launched as "mpirun -n 2". However, when I use 3 or 4 processes on my machine, the previously described unexpected behaviour is still present. I am very sorry.

@ddemidov Can you, in return, reproduce this observation? For me, it could also be seen in the case attached to the issue. If you prefer I different case with certain specifications, please, just ask and I will try and generate one.

( Edit: Node of the above mentioned workarounds was applied at the same time )

What exactly is the unexpected behavior you observe with mpi=3 and mpi=4? I don't see any problems with your mpi example with #3896. As @jcotela said, remember to disable openmp parallelism, or it may seem that the solution is too slow.

EDIT: Also, you should not need the workarounds (disabling block solves or switching to spai0).

My apologies for the false alarm. I was convinced that I had export OMP_NUM_THREADS = 1 specified in my ~/.bashrc. However, I made a typo there and several threads were started. I am sorry and thank you very much for the reminder.

Yes, now the solver seems to perform fast and correctly on all locally tested cases. Thank you very much for your effort.

Edit: Now also successfully tested on CoolMUC2 and a larger case. Great difference in performance compared to the MultiLevel solver!

Hi,

I have again the same issue as @swenczowski had once. I am simulating a simple 2D fluid simulation using OMP 8 threads, 1 thread 1 process MPI, 1 thread 8 processes MPI with AMGCL solver. I observed the following:

  1. OMP 8 threads and 1 thread 1 process MPI produceses same "A.mm" file.
  2. OMP 8 threads and 1 thread 8 process MPI producesses different "A.mm" files.

The problem with 2nd observation is, it does not converge.

I used "use_block_matrices_if_possible" : false as well. It does not change the observations I made.
I checked example given by @swenczowski as well, it also produces different "A.mm" files in the case of "1" and "2" methods.

I have attached the case which I used to produce above mentioned observations.
Would you be able to look at it @ddemidov, @RiccardoRossi ?

Thanks a lot

caarc_90_2d_slip_omp_mpi_test.zip

can it be that the matrix is simply reordered?

I checked the entries like (1,1) in both files... they are totally different. I'm not sure how equation ids and dof ids are distributed in omp and mpi. I thought they are the same, dont they?

explainging a little more what i mean:

if you import the two A.mm in say python, and solve it with a direct solver, try to see if the norm(dx) is the same in the two cases.

there is no guarantee the ordering of the matix is the same (generally it will not be)

I checked it with direct solvers in python. They all give same norm(dx) value. But it still not solve the issue of non convergence. That is,

If i run the same case with MPI with 8 processes and 1 thread, it does not converge (but the OMP run converges without a problem). Am i doing something wrong with the settings?

Followings are the settings i am using in mpi case
'
"solver_type": "amgcl",
"gmres_krylov_space_dimension":300,
"coarse_enough": 5000,
"use_block_matrices_if_possible" : false
'

apart from the defaults in amgcl.
Is there a way to get around this?

Was this page helpful?
0 / 5 - 0 ratings