Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::length_error', when total nonzeros is higher than maximum of integer4, Fortran use of STRUMPACK_MPIdist #94

Open
GuoqiMa opened this issue May 28, 2023 · 23 comments

Comments

@GuoqiMa
Copy link

GuoqiMa commented May 28, 2023

Hi, sorry to bother you again.

I would like to ask the problem about integer type. I didn't see any integer type in SRC/fortran. So, the default integer type is integer(4), right?

Now I need to use long integer, namely integer(8) and float complex,. I changed the integer type in my code, but I meet segmentation error like below:

Do I have to both compile and link with flag -i8 ?

Also, could you please explain the difference between float complex and float complex_64?

@GuoqiMa
Copy link
Author

GuoqiMa commented May 29, 2023

I found Integer 8 is not necessary, so I change back and use float complex even though my model is very large. I split it to different ranks so that integer4 is fine, but the total nonzeros are above the limit of integer4.

Now it seems I get stuck in this step

Initializing STRUMPACK
using 2 OpenMP thread(s)
using 20 MPI processes
matching job: maximum matching with row and column scaling
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append

For a small model, there is no vector:: M_default_append, I looked for some information online, it should be related to C++.
I wonder if you know what is happening?

@GuoqiMa GuoqiMa changed the title long integer type in Fortran use of STRUMPACK_MPIdist error: ''vector::_M_default_append' when running large model in Fortran use of STRUMPACK_MPIdist May 29, 2023
@GuoqiMa
Copy link
Author

GuoqiMa commented May 29, 2023

This seems unable to overcome, because STRUMPACK uses 32-bit indexing for BLAS, LAPACK and ScaLAPACK, However, total non-zeros is over 3.5billion, 32-bit integer can hold a maximum digit around 2 billion.

@pghysels
Copy link
Owner

I think the code is running out of memory in the column permutation phase. This uses the MC64 code, which is sequential. So the code needs to gather the whole input matrix to the root MPI process, then call the MC64 code there, and then broadcast the result. The column permutation is done to maximize the diagonal entries, but if your problem already is diagonally dominant (or has non-zero diagonal entries) then this step might not be necessary. You can try to disable it with

STRUMPACK_set_matching(S, STRUMPACK_MATCHING_NONE)

@pghysels
Copy link
Owner

If you want to use the 64bit interface, you need to specify:

STRUMPACK_FLOATCOMPLEX_64

or equivalent for the different floating point precisions.

There is a way to use 64 bit integers for the BLAS/LAPACK routines, but that should not be necessary.
The integer arguments to the BLAS/LAPACK routines are only referring to rows or columns (not the total elements), and this should be less than the 32 bit maximum. In case you really need it, the CMake option is STRUMPACK_USE_BLAS64. Then you need to link with a 64 bit BLAS/LAPACK, which you can specify through CMake via: -DBLA_SIZEOF_INTEGER=8, or -DBLA_VENDOR=Intel10_64ilp (see https://cmake.org/cmake/help/latest/module/FindBLAS.html)

@GuoqiMa
Copy link
Author

GuoqiMa commented May 30, 2023

Thanks for your reply.
I did try matching_none, but seems the similar problem still exist

Initializing STRUMPACK
using 2 OpenMP thread(s)
using 20 MPI processes
matching job: none
matrix equilibration, r_cond = 1 , c_cond = 1 , type = N
terminate called after throwing an instance of 'std::length_error'
what(): cannot create std::vector larger than max_size()

@pghysels
Copy link
Owner

Could this just be running out of memory?
It might help to run with fewer MPI ranks and fewer OpenMP threads. Then you leave some cores idle, and it will take longer, but it will take less memory.

Do you have an estimate for the required memory usage? Perhaps from some smaller runs you can extrapolate the memory usage to get an estimate.

@GuoqiMa
Copy link
Author

GuoqiMa commented May 30, 2023

Yes, I have estimated the memory, usually, the actual memory usage is above 1.6 times more than than the estimated value. Now I am testing it with enough memory.

In fact, I my previous tests, I deliberately run it under insufficient memory, but it will stop at later steps, not that step right after matrix equilibration. Anyhow I will test this. Thanks very much.

@GuoqiMa GuoqiMa changed the title error: ''vector::_M_default_append' when running large model in Fortran use of STRUMPACK_MPIdist std::length_error', when total nonzeros is higher than maximum of integer4, Fortran use of STRUMPACK_MPIdist Jun 3, 2023
@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 3, 2023

Hi, Pieter.

I have tried by creating a banded matrix by myself and seems it is not the problem of insufficient memory. I found that it when total nonzeros exceeds the maximum of integer4 (2,147,483,648), below error will imediately show up before the step intiate matrix, while under the maximum of integer4, code is successfull. So, does this mean I have to use in 64-bit STRUMPACK?

Initializing STRUMPACK
using 2 OpenMP thread(s)
using 20 MPI processes
matching job: none
matrix equilibration, r_cond = 1 , c_cond = 1 , type = N
rminate called after throwing an instance of 'std::length_error'
what(): cannot create std::vector larger than max_size()
rrtl: error (76): Abort trap signal

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 3, 2023

Or is it the Metis should be 64-bit? because I use metis for reordering

@pghysels
Copy link
Owner

pghysels commented Jun 3, 2023

You can try with 64bit METIS. STRUMPACK can use either 32 or 64 bit METIS.

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 3, 2023

For the step initializing matrix, does it reach matrix reordering? Now I should reconfiger with also 64 metis, besides what you mentioned, right

If you want to use the 64bit interface, you need to specify:

STRUMPACK_FLOATCOMPLEX_64

or equivalent for the different floating point precisions.

There is a way to use 64 bit integers for the BLAS/LAPACK routines, but that should not be necessary. The integer arguments to the BLAS/LAPACK routines are only referring to rows or columns (not the total elements), and this should be less than the 32 bit maximum. In case you really need it, the CMake option is STRUMPACK_USE_BLAS64. Then you need to link with a 64 bit BLAS/LAPACK, which you can specify through CMake via: -DBLA_SIZEOF_INTEGER=8, or -DBLA_VENDOR=Intel10_64ilp (see https://cmake.org/cmake/help/latest/module/FindBLAS.html)

@pghysels
Copy link
Owner

pghysels commented Jun 4, 2023

Yes, you can try METIS with 64 bit integers. STRUMPACK can use METIS with either 32 or 64 bit.

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 11, 2023

Hi, Pieter. I find each integer input of ''STRUMPACK_set_distributed_csr_matrix(S, c_loc(locN), c_loc(IA3), c_loc(JA3), c_loc(A2),c_loc(RowS),1)'' must be integer4 type, otherwise segmentation errror will happen. Maybe I think it is not the problem of BLAS/LAPACK as you said, just the problem of the data type of interfaces of STRUMPACK, In STRUMPACK, there is step that calculates total non-zeros, this is must integer8 for my case. So, I wonder if there is any solution for my problem?

So, I think you have to work out a 64-bit integer strumpack interface, like this STRUMPACK_set_distributed_csr_matrix, so that it can call BLAS/LAPACK or metis to work.

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 11, 2023

Hi, Pieter.

I have tried by creating a banded matrix by myself and seems it is not the problem of insufficient memory. I found that it when total nonzeros exceeds the maximum of integer4 (2,147,483,648), below error will imediately show up before the step intiate matrix, while under the maximum of integer4, code is successfull. So, does this mean I have to use in 64-bit STRUMPACK?

Initializing STRUMPACK using 2 OpenMP thread(s) using 20 MPI processes matching job: none matrix equilibration, r_cond = 1 , c_cond = 1 , type = N rminate called after throwing an instance of 'std::length_error' what(): cannot create std::vector larger than max_size() rrtl: error (76): Abort trap signal

Like I tested, the problem is in the initialization, in which 64-bit integer should be used. I guess initialization does not need BLAS/LAPACK, or METIS.

@pghysels
Copy link
Owner

For

STRUMPACK_set_distributed_csr_matrix

the arguments local_rows, row_ptr, col_ind, dist should be 32 or 64 bit signed integer pointers (depending on whether you use STRUMPACK_FLOATCOMPLEX or STRUMPACK_FLOATCOMPLEX_64).
values is a pointer to complex float values, the last argument symmetric_pattern is an integer (32 bit).

It could be that the total number of nonzeros is larger than the 32 bit limit, but locally on each rank the number of nonzeros fits in 32 bit. In that case, I think it will print out a negative number for the number of nonzeros, but it might still run correctly. I can try to rewrite that part of the code, so that it uses 32 bit integers everywhere except for printing the total nnz, but I need some time to do that carefully. But you can run with 64 bit integers everywhere. I don't see why that wouldn't work.

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 12, 2023

Thank you very much. Oh, I thought floatcomplex_64 is higher double complex precision. Now I know how to run it. I should use this floatcomplex_64.

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 12, 2023

Using floatcomplex_64 the error changed, I think the error new is just something else.

When under the 32-bit limit, it seems fine:

**_ Initializing STRUMPACK
using 2 OpenMP thread(s)
using 10 MPI processes
matching job: none
matrix equilibration, r_cond = 5.00001e-08 , c_cond = 0.0192308 , type = B
initial matrix:

  • number of unknowns = 20,000,000
  • number of nonzeros = 2,059,997,348
    nested dissection reordering:
  • Natural reordering
  • strategy parameter = 8
  • number of separators = 1
  • number of levels = 1
  • nd time = 35.7113
  • symmetrization time = 9.53674e-07
    symbolic factorization:
  • nr of dense Frontal matrices = 1
  • symb-factor time = 48.2309
    multifrontal factorization:
  • estimated memory usage (exact solver) = 3.2e+09 MB
  • minimum pivot, sqrt(eps)*|A|_1 = 0.0251465
  • replacing of small pivots is not enabled
    STRUMPACK: out of memory!_**

However, when above the 32-bit integer limit, the error is different than before, like this:

**_Initializing STRUMPACK
using 2 OpenMP thread(s)
using 10 MPI processes
matching job: none
matrix equilibration, r_cond = 1 , c_cond = 1 , type = N

BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
RANK 0 PID 753361 RUNNING AT cn-08-21
KILLED BY SIGNAL: 9 (Killed)_**

The difference is the matrix equilibration, from type B to type N, do you have any idea why this difference happens?

@pghysels
Copy link
Owner

That should be the same.
So it looks like the matrix isn't passed correctly.
Can you share some code I can look at?

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 13, 2023

Actually, I didn't change anything but the number of nonzeros. Below the 32-bit limit, it works, above the limit, it does not work. I do see there are some 'bus error' on some ranks.

call STRUMPACK_init(S,MPI_COMM_WORLD, STRUMPACK_FLOATCOMPLEX_64,  STRUMPACK_MPI_DIST, 0,  c_null_ptr, 1)
Call STRUMPACK_set_matching(S, STRUMPACK_MATCHING_NONE)
 call STRUMPACK_set_reordering_method(S, STRUMPACK_METIS)
locN=Iend-Istart
allocate(x(1:Nrows))
x(1:Nrows)=(0.d0,0.d0)
IA2(1:Nrows)=(IA2(1:Nrows)-IA2(1))!from 0
JA2(1:Nzs)=JA2(1:Nzs)-1!from 0
call STRUMPACK_set_distributed_csr_matrix(S, c_loc(locN),  c_loc(IA2), c_loc(JA2), c_loc(A2),c_loc(RowS),1)
ierr = STRUMPACK_solve(S, c_loc(B2), c_loc(x), 0);
B2(1:Nrows)=x(1:Nrows)

@pghysels
Copy link
Owner

The equilibration is based on the xgeequ routine from LAPACK:
https://netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga56565ae06016954202aee24cdfc38257.html
The equilibration type, (R=row, C=column or B=both), is decided based on the matrix values (see r_cond and c_cond).
I would think that if row equilibration is required for the smaller problem it should also be required for the larger one.
But for your larger problem r_cond = 1 and c_cond = 1 seems suspicious. It makes me believe the matrix is not passed correctly. But I can't find anything wrong in the code.

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 14, 2023

Thank you. May I ask another question.

Recently, my HPC manager has reconfigured the STRUAMPACK. However, my old code can not get correct solutions anymore, even the log seems correct. But before, I have checked that the solutions are correct, for this installation now, solutions are all 0 all the time. I have no idea what happened to it. It is very wired. I asked my manager to reinstall it again and also try to ask if you have a clue.

Initializing STRUMPACK
using 2 OpenMP thread(s)
using 10 MPI processes
matching job: maximum matching with row and column scaling
matrix equilibration, r_cond = 1 , c_cond = 1 , type = N
Matrix padded with zeros to get symmetric pattern.
Number of nonzeros increased from 18,741,393 to 31,796,591.
initial matrix:

  • number of unknowns = 176,505
  • number of nonzeros = 31,796,591
    nested dissection reordering:
  • Metis reordering
    • used METIS_NodeND (iso METIS_NodeNDP)
    • supernodal tree was built from etree
  • strategy parameter = 8
  • number of separators = 6,525
  • number of levels = 14
  • nd time = 9.78745
  • matching time = 14.2286
  • symmetrization time = 0.269281
    symbolic factorization:
  • nr of dense Frontal matrices = 6,525
  • symb-factor time = 0.190588
    multifrontal factorization:
  • estimated memory usage (exact solver) = 15869.5 MB
  • minimum pivot, sqrt(eps)*|A|_1 = 1.56934e-07
  • replacing of small pivots is not enabled
    DenseMPI factorization complete, no GPU support, P=2, T=2: 3.46566 seconds, 760.986 GFLOPS, 219.579 GFLOP/s, ds=1867, du=6183
    DenseMPI factorization complete, no GPU support, P=2, T=2: 3.5328 seconds, 862.828 GFLOPS, 244.233 GFLOP/s, ds=1986, du=6353
    DenseMPI factorization complete, no GPU support, P=2, T=2: 3.26074 seconds, 955.25 GFLOPS, 292.955 GFLOP/s, ds=1840, du=7117
    DenseMPI factorization complete, no GPU support, P=3, T=2: 5.00299 seconds, 1195.37 GFLOPS, 238.93 GFLOP/s, ds=1953, du=7751
    DenseMPI factorization complete, no GPU support, P=3, T=2: 4.90208 seconds, 1304.77 GFLOPS, 266.167 GFLOP/s, ds=1992, du=8033
    DenseMPI factorization complete, no GPU support, P=2, T=2: 4.97933 seconds, 1155.33 GFLOPS, 232.026 GFLOP/s, ds=1967, du=7565
    DenseMPI factorization complete, no GPU support, P=5, T=2: 5.37258 seconds, 2644.16 GFLOPS, 492.157 GFLOP/s, ds=3495, du=7924
    DenseMPI factorization complete, no GPU support, P=5, T=2: 5.65052 seconds, 2733.49 GFLOPS, 483.76 GFLOP/s, ds=3580, du=7924
    DenseMPI factorization complete, no GPU support, P=10, T=2: 1.77755 seconds, 1326.33 GFLOPS, 746.157 GFLOP/s, ds=7924, du=0
  • factor time = 22.4018
  • factor nonzeros = 991,844,543
  • factor memory = 15869.5 MB
    REFINEMENT it. 0 res = 2.27357e-11 rel.res = 1 bw.error = 1
    DIRECT/GMRES solve:
  • abs_tol = 1e-10, rel_tol = 1e-06, restart = 30, maxit = 5000
  • number of Krylov iterations = 0
  • solve time = 0.00257802

@GuoqiMa
Copy link
Author

GuoqiMa commented Jun 14, 2023

And this is configuring file from my manager:

!#STRUMPACK 7.1.2
!#Introduction
Compiled on cn-09-32 by Sergio Martinez
Requested by Guoqi Ma Ma
https://portal.nersc.gov/project/sparse/strumpack/v7.1.0/installation.html

!###Prepare modules
module purge
module load mkl/2021.3
module load intel/2021.3-gcc-9.3
module load impi/2021.3
module load scotch/6.1.1
module load metis/5.1.0
module load parmetis/4.0.3

!### Prepare directories
ROOT_DIR=/apps/ku
COMPILER_DEP=intel-2021_3-gcc-9_3
MPI_DEP=impi-2021_3
APP_NAME=strumpack
APP_VERSION=7.1.2
APPS=${ROOT_DIR}/${COMPILER_DEP}/${MPI_DEP}/${APP_NAME}/${APP_VERSION}
BUILD=${ROOT_DIR}/build/${APP_NAME}/${APP_VERSION}

mkdir -p ${BUILD}
mkdir -p ${APPS}

!### Download and extract
cd $BUILD
wget https://github.com/pghysels/STRUMPACK/archive/refs/tags/v7.1.2.tar.gz
tar -xvf v7.1.2.tar.gz
cd STRUMPACK-7.1.2

!## Configure, Build and Install
mkdir -p build
cd build
cmake ../ -DCMAKE_BUILD_TYPE=Release
-DCMAKE_INSTALL_PREFIX=$APPS
-DCMAKE_CXX_COMPILER=mpiicpc
-DCMAKE_C_COMPILER=mpiicc
-DCMAKE_Fortran_COMPILER=mpiifort
-DTPL_SCALAPACK_LIBRARIES="${MKLROOT}/lib/intel64/libmkl_scalapack_lp64.a;-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_cdft_core.a ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_lp64.a -Wl,--end-group;-liomp5;-lpthread;-lm;-ldl"
-DMETIS_INCLUDE_DIR=/apps/ku/intel-2021_3-gcc-9_3/metis/5.1.0/include
-DMETIS_LIBRARIES=/apps/ku/intel-2021_3-gcc-9_3/metis/5.1.0/lib/libmetis.a
-DSTRUMPACK_USE_PARMETIS=ON
-DTPL_PARMETIS_INCLUDE_DIRS=/apps/ku/intel-2021_3-gcc-9_3/impi-2021_3/parmetis/4.0.3/include
-DTPL_PARMETIS_LIBRARIES=/apps/ku/intel-2021_3-gcc-9_3/impi-2021_3/parmetis/4.0.3/lib/libparmetis.a
-DTPL_ENABLE_SCOTCH=ON
-DTPL_SCOTCH_INCLUDE_DIRS=/apps/ku/intel-2021_3-gcc-9_3/impi-2021_3/scotch/6.1.1/include
-DTPL_SCOTCH_LIBRARIES="/apps/ku/intel-2021_3-gcc-9_3/impi-2021_3/scotch/6.1.1/lib/libscotch.a;/apps/ku/intel-2021_3-gcc-9_3/impi-2021_3/scotch/6.1.1/lib/libscotcherr.a"
-DTPL_ENABLE_PTSCOTCH=ON
-DTPL_PTSCOTCH_INCLUDE_DIRS=/apps/ku/intel-2021_3-gcc-9_3/impi-2021_3/scotch/6.1.1/include
-DTPL_PTSCOTCH_LIBRARIES="/apps/ku/intel-2021_3-gcc-9_3/impi-2021_3/scotch/6.1.1/lib/libptscotch.a;/apps/ku/intel-2021_3-gcc-9_3/impi-2021_3/scotch/6.1.1/lib/libptscotcherr.a"

make
!#make test
!#https://portal.nersc.gov/project/sparse/strumpack/master/FAQ.html
!#All MPI tests will fail because Intel MPI doesn't recognize --oversubscribe option
make install

!## Modulefile
!### Location
/apps/ku/modulefiles/MPI/intel/2021.3-gcc-9.3/impi/2021.3/strumpack/7.1.2.lua
!### Content
local pkgName = myModuleName()
local pkgVersion = myModuleVersion()
local pkgNameVer = myModuleFullName()
local hierA = hierarchyA(pkgNameVer,2)
local mpiD = hierA[1]:gsub("/","-"):gsub("%.","")
local compilerD = hierA[2]:gsub("/","-"):gsub("%.","
")
local base = pathJoin("/apps/ku", compilerD, mpiD, pkgNameVer)
whatis("Name: " ..pkgName)
whatis("Version: " .. pkgVersion)
whatis("Description: STRUMPACK is a software library providing linear algebra routines and linear system solvers for sparse and for dense rank-structured linear systems.")
whatis("URL: https://portal.nersc.gov/project/sparse/strumpack/index.html")

depends_on("mkl/2021.3")
depends_on("scotch/6.1.1")
depends_on("metis/5.1.0")
depends_on("parmetis/4.0.3")

prepend_path("CPATH",           pathJoin(base,"include"))
prepend_path("LD_LIBRARY_PATH", pathJoin(base,"lib"))
prepend_path("LIBRARY_PATH",    pathJoin(base,"lib"))

setenv("STRUMPACK_DIR",  base)

@pghysels
Copy link
Owner

I checked the fexample.f90, and that gives the correct results (see f5be648)
Do you have an MPI example, perhaps a modification of fexample.f90 that I can try?
Sorry, I could write this example myself, but I'm not very familiar with Fortran.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants