SEAS problems are very expensive to solve. The most expensive of the proposed method is the solution of a linear system of equations. Due to high variability in time-step size and the potentially long simulated time of thousands of years, millions of time-steps may be required, where each time-step needs several stages, i.e. multiple solves per time-step. A very efficient and scalable solver is therefore mandatory.

  1. Flexible solvers using PETSc
    1. Sparse LU (MUMPS)
    2. Two-level deflation
    3. P-multigrid
  2. Discrete Green’s function
  3. References

Flexible solvers using PETSc

Our implementation is MPI parallel and interfaces to PETSc which allows us to efficiently prototype solvers (Balay et al., 2021; Balay et al., 2021; Balay et al., 1997). We focus on the following three types of solvers:

  1. Direct sparse LU (e.g. MUMPS, PARDISO)
  2. GMRES with two-level deflation preconditioner, where the eigendecomposition is computed via Randomised Linear Algebra
  3. Conjugate gradients with P-multigrid preconditioner

We present small-scale numerical experiments for a 3D elasticity problem on the unit box. The polynomial degree is fixed to P=6P=6 in all experiments and the relative tolerance is set to 10810^{-8}. The experiments are executed on an AMD EPYC 7662 processor.

Sparse LU (MUMPS)

We obtained the following results on a single core:

Resolution DOFs Setup time Solve time
1.000 1260 0.1505s 0.001098s
0.500 10080 3.649s 0.02512s
0.250 80640 159.9s 0.5515s
0.125 645120 6545s 6.469s

The LU decomposition requires a huge setup time. Moreover, we observe that the LU decomposition requires more than 100 GB of memory on the finest grid. Nevertheless, we can re-use the LU millions of times in a SEAS simulation, such that LU is still competitive due to its low solve time.

Two-level deflation

The LU decomposition will very likely not scale to very large problems due to its huge memory requirement and its serial nature. One way to scale direct methods are iterative domain decomposition methods, where a direct solver is used to solve the elasticity problem on subdomains.

Here, we test a multiplicative two-level method, where on the first level the smallest eigenvalues are deflated, and on the second level an additive Schwarz method is used (PC type asm in PETSc). In the following table of results, 64 eigenvalues are deflated, where the eigenvalues are obtained via a randomised linear algebra.

Ranks Resolution Setup time Solve time Iterations
1 0.25 542.8s 0.7187s 1
2 0.25 573.8s 2.643s 6
4 0.25 1266s 6.738s 14
8 0.25 1666s 25.98s 16
16 0.25 2184s 40.11s 20
32 0.25 1489s 28.67s 23
64 0.25 1643s 46.49s 20

The solve times are not competitive to sparse LU, although the total time for a single solve is lower.

P-multigrid

In the P-multigrid method coarse grids are constructed by reducing the polynomial degree. In this experiment, we use P-levels 6, 3, and 1, and use the sparse LU solver on the coarsest grid. On a single core we obtain the following results:

Resolution DOFs Setup time Solve time Iterations
1.0000 1260 0.2317s 0.146s 11
0.5000 10080 2.521s 2.179s 19
0.2500 80640 23.51s 25.46s 22
0.1250 645120 195.6s 171.2s 22
0.0625 5160960 1717s 1111s 21

We observe that the number of iterations stays almost constant (except for the smallest case, where fewer iterations are required).

The P-multigrid is well-suited for parallelisation. We scale the h=0.125h=0.125 mesh from 1 to 32 cores in the following experiment for two different implementation: Assembled (AS) uses assembled matrices on each level, Matrix-free (MF) uses a matrix-free operator application on the finest level.

Type Variable 1 2 4 8 16 32
AS solve 271.1s 148.9s 87.1s 72.41s 72.3s 40.15s
  speed-up 1x 1.8x 3.1x 3.7x 3.7x 6.8x
MF solve 169.9s 100.6s 48.69s 26.1s 16.96s 9.843s
  speed-up 1x 1.7x 3.5x 6.5x 10x 17x
  fine-grid GFLOPS 21.5 42.6 83.7 162 328 635

We observe that the matrix-free variant outperforms the assembled variant. While an operator application with the matrix-free variant requires more floating point operations than an operator application with an assembled matrix, it can make better use of modern hardware, which is strongly biased towards compute-intensive workloads.

The performance of the matrix-free operator is measured in isolation, by applying the operator a hundred times on random data. On a single core we measure 21.5 GFLOPS, which corresponds to 41–67% of the theoretical peak performance (in the frequency range of 2–3.3 GHz). On 32 cores we measure 635 GFLOPS or 38–64% of the theoretical peak. The matrix-free operator is implemented using Yet Another Tensor Toolbox (Uphoff & Bader, 2020).

The coarse grid solve is likely the bottleneck of P-multigrid when scaling the method to more cores. In future work, we are going to consider other coarse grid solvers, such as the two-level deflation method or algebraic multi-grid.

Discrete Green’s function

The major cost stems from solving the linear system Au=bA\bm{u}=\bm{b}. In the time-stepping loop, the right-hand side depends linearly on time and the slip-vector, and may be written as following:

b=b0+bDt+BS\bm{b} = \bm{b}_0 + \bm{b}_D t + B\bm{S}

Note that the size of S\bm{S} is (D1)n(D-1)n, DD being the dimension, nn being the number of on-fault basis functions.

Moreover, on-fault traction depends linearly on displacement and can be written as

[σn(u),τ(u)]=Cu,[\sigma_n(\bm{u}), \tau(\bm{u})] = C\bm{u},

where CC is a Dn×DNDn \times DN matrix, NN being the number of basis functions in the domain. Therefore, we may write the on-fault traction as following

[σn(u),τ(u)]=CA1b0G0+tCA1bDGD+i=1(D1)nCA1BeiGiSi,[\sigma_n(\bm{u}), \tau(\bm{u})] = \underbrace{CA^{-1}\bm{b}_0}_{\bm{G}_0} + t\underbrace{CA^{-1}\bm{b}_D}_{\bm{G}_D} + \sum_{i=1}^{(D-1)n} \underbrace{CA^{-1}B\bm{e}_i}_{\bm{G}_i}S_i,

where ei\bm{e}_i is the i-th unit vector. The vectors G0,GD,G1,,G(D1)n\bm{G}_0, \bm{G}_D, \bm{G}_1,\dots,\bm{G}_{(D-1)n} can be precomputed by solving the linear system of equations (D1)n+2(D-1)n + 2 times. As we only need the on-fault tractions in the time integrator, we can exchange the solve of a sparse system with a DN×DNDN \times DN matrix with dense matrix multiplication of size (D1)n×(D1)n(D-1)n \times (D-1)n and two AXPYs. Hence, this approach is attractive if nNn \ll N and the number of time-steps is very large.

Disussion. On the one hand:

  1. The memory requirement of this approach are potentially huge, as an (D1)n×(D1)n(D-1)n\times(D-1)n matrix needs to be stored.
  2. The pre-computation time is huge if nn is large.

On the other hand:

  1. Pre-computation of the G\bm{G}-vectors is embarassingely parallel.
  2. The approach is simple to implement and orthogonal to solver or performance optimisation.
  3. After pre-computation of the G\bm{G}-vectors, the complexity of this method is comparable to the a boundary element method for non-planar fault geometry, but without the need for an analytic Green’s function. That is, there is no limitation to “simple” material variations.
  4. The time-to-solution for the full BP3-QD benchmark problem is reduced from about 1 week run-time to about 3 hours using this method.

References

  1. Balay, S., Abhyankar, S., Adams, M. F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W. D., Karpeyev, D., Kaushik, D., Knepley, M. G., May, D. A., McInnes, L. C., Mills, R. T., Munson, T., Rupp, K., Sanan, P., … Zhang, H. (2021). PETSc Web page. https://www.mcs.anl.gov/petsc
  2. Balay, S., Abhyankar, S., Adams, M. F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W. D., Karpeyev, D., Kaushik, D., Knepley, M. G., May, D. A., McInnes, L. C., Mills, R. T., Munson, T., Rupp, K., Sanan, P., … Zhang, H. (2021). PETSc Users Manual (ANL-95/11 - Revision 3.15; Number ANL-95/11 - Revision 3.15). Argonne National Laboratory. https://www.mcs.anl.gov/petsc
  3. Balay, S., Gropp, W. D., McInnes, L. C., & Smith, B. F. (1997). Efficient Management of Parallelism in Object Oriented Numerical Software Libraries. In E. Arge, A. M. Bruaset, & H. P. Langtangen (Eds.), Modern Software Tools in Scientific Computing (pp. 163–202). Birkhäuser Press.
  4. Uphoff, C., & Bader, M. (2020). Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications. ACM Trans. Math. Softw., 46(4). https://doi.org/10.1145/3406835