Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem:usage of STRUMPACK #80

Open
Never-settle opened this issue Nov 10, 2022 · 1 comment
Open

Problem:usage of STRUMPACK #80

Never-settle opened this issue Nov 10, 2022 · 1 comment

Comments

@Never-settle
Copy link

Recently, we have been using STRUMPACK to solve the system of linear equations, which is a sparse matrix with a scale of 300000 x 300000, and it will be solved for 200 iterations. In each iteration, the matrix and right-hand side only have different values but the same structure. And then some of the questions we ran into:

My code is modified from the example “testMMdoubleMPIDist.cpp” provided in the STRUMPACK library and uses MPI and OpenMP hybrid parallel programming.

In the first version, we recreate a solver in each iteration: “StrumpackSparseSolverMPIDist<double,int> spss = new StrumpackSparseSolverMPIDist<double,int> (MPI_COMM_WORLD);”, then use “(spss).set_distributed_csr_matrix(local_n, local_row_ptr.data(), local_col_ind.data(), local_values.data(), dist /, false/);”. After reordering(“(*spss).reorder()”) and numerical factorization(“(*spss).factor()”), we start solving the system of linear equations(“(*spss).solve(local_b.data(), laocal_x.data())”). In the first iteration, it can be solved correctly, however, during the second iteration, when performing the reordering, the program reports an error and gives the following error message: “Intel MKL BLACS fatal error: cannot allocate memory, aborted.”

【 More complete error display】

Initializing STRUMPACK

using 1 OpenMP thread(s)

using 24 MPI processes

matrix equilibration, r_cond = 1 , c_cond = 1 , type = N

initial matrix:

- number of unknowns = 349,272

- number of nonzeros = 5,048,562

nested dissection reordering:

- Metis reordering

- used METIS_NodeNDP (iso METIS_NodeND)

- supernodal tree from METIS_NodeNDP is used

- strategy parameter = 8

- number of separators = 43,659

- number of levels = 12

- nd time = 4.54

- symmetrization time = 0.0249

Intel MKL BLACS fatal error: cannot allocate memory, aborted.
Intel MKL BLACS fatal error: cannot allocate memory, aborted.
Intel MKL BLACS fatal error: cannot allocate memory, aborted.
Intel MKL BLACS fatal error: cannot allocate memory, aborted.
Intel MKL BLACS fatal error: cannot allocate memory, aborted.
Intel MKL BLACS fatal error: cannot allocate memory, aborted.
【End】

In the second version, we only create one solver and modified the values of the matrix in it in each iteration (since the structure of the matrix has not changed). We use “(spss).update_matrix_values(local_n, local_row_ptr.data(), local_col_ind.data(), local_values.data(), dist /, false*/);”, then call the function “(*spss).solve(local_b.data(), laocal_x.data())”. This avoids having to reorder every iteration, but we ran into a new error on the second iteration: “{ -1, -1}: On entry to \n DESCINIT parameter number 6 had an illegal value \n ERROR: Could not create DistributedMatrix descriptor!”

【 More complete error display】

multifrontal factorization:

- estimated memory usage (exact solver) = 2.64e+03 MB

- minimum pivot, sqrt(eps)*|A|_1 = 5.36e-07

- replacing of small pivots is not enabled

{ -1, -1}: On entry to
DESCINIT parameter number 6 had an illegal value
ERROR: Could not create DistributedMatrix descriptor!
{ -1, -1}: On entry to
DESCINIT parameter number 6 had an illegal value
ERROR: Could not create DistributedMatrix descriptor!
{ -1, -1}: On entry to
DESCINIT parameter number 6 had an illegal value
ERROR: Could not create DistributedMatrix descriptor!
{ -1, -1}: On entry to
DESCINIT parameter number 6 had an illegal value
ERROR: Could not create DistributedMatrix descriptor!
{ -1, -1}: On entry to
DESCINIT parameter number 6 had an illegal value
ERROR: Could not create DistributedMatrix descriptor!

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 227459 RUNNING AT ca0602
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 227460 RUNNING AT ca0602
= KILLED BY SIGNAL: 6 (Aborted)

【End】

We studied for a long time and did not find where the specific problem is. At the same time we simply modified the original example (modify the value of the matrix and loop it many times), and found that it could run correctly and the program did not report an error. We don't know why this error occurs when ported to our code.

@pghysels
Copy link
Owner

Hi, I'm sorry for the delay.
This error also happens when you create a completely new solver each iteration?
Can you try with a different ordering algorithm? For instance:

spss->options().set_reordering_method(ReorderingStrategy::AND);

Do you properly delete the spss object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants