Parallel solvers for
discrete-time algebraic Riccati equations

rafael mayo ¹, enrique s. quintana-ortí , gregorio quintana-ortí , vicente hernández

SUMMARY

We investigate the numerical solution of discrete-time algebraic Riccati equations on a parallel distributed architecture. Our solvers obtain an initial solution of the Riccati equation via the disc function method, and then refine this solution using Newton's method. The Smith iteration is employed to solve the Stein equation that arises at each step of Newton's method.

The numerical experiments on an Intel Pentium-II cluster, connected via a Myrinet switch, report the performance and scalability of the new algorithms.

1. INTRODUCTION

Consider the discrete-time algebraic Riccati equation (DARE)

0 = R_d( X )

: =

Q + A^T X A - X - A^T X B (R+B^TXB)^-1 B^T X A,

(1)

where A,Q,X Î I R^n×n, Q = Q^T, B Î I R^n×m, R Î I R^m×m, R = R^T, and X = X^T is the sought-after solution.

Algebraic Riccati equations arise in many analysis and synthesis algorithms in control theory. In particular, in the last years model reduction for large-scale control systems has become one of the most important problems in systems and control theory. Many of the algorithms recently proposed for this problem require the solution of equations of the form (1); see, e.g., [,,].

In general, the desired solution of the DARE (hereafter X_d) is stabilizing in the sense that all the eigenvalues of A_d = A - B (R+B^TX_dB)^-1 B^T X_d A are inside the unit circle (i.e., A_d is d-stable, denoted as r(A_d) < 1). Throughout this paper we will assume that such a stabilizing solution exists. Note that if it exists, it is unique [].

The need for parallel computing in this area is clear as already for control systems with state-space dimension n = 1000, (1) represents a set of nonlinear equations with more than half a million unknowns (exploiting the symmetry of X). Systems of such a dimension are common in chemical engineering applications, are standard for second order systems, and represent rather coarse grids when derived from the discretization of a partial differential equation; see, e.g., [,,].

The inverse-free iteration for the matrix disc function [] provides an approximate numerical solution for the DARE []. The iterative procedure for the matrix disc function is composed of QR factorizations and matrix products which can be efficiently parallelized on a parallel distributed computer.

This approximate solution can then be refined using, e.g., Newton's method [,], which converges to the solution of the DARE []. At each iteration, the method requires a few basic matrix computations and the solution of a Stein equation (or discrete-time Lyapunov equation) of the form

0 =

~
A

~
X

~
A

~
X

~
Q

(2)

where [A\tilde],[Q\tilde],[X\tilde] Î I R^n×n, [A\tilde] is d-stable, and [Q\tilde] = [Q\tilde]^T.

Numerical methods for the Stein equation are studied in [,]. In the initial stage of these methods, [A\tilde] is reduced to real Schur form by means of the QR algorithm []. This is followed by a quite less expensive back substitution process. The QR algorithm has been reported as difficult to parallelize on distributed memory architectures due to two factors: the algorithm is composed of fine-grain computations with a low ratio of computation/communication and the use of traditional block scattered data layouts leads to an unbalanced distribution of the computational load [,]. Different approaches have been proposed to avoid these drawbacks in [,,]. The experimental studies however report parallelism and scalability results which are still far from those of the kernels arising in linear systems (see, e.g., [,]).

In this paper we study the numerical solution of the DARE on a parallel distributed computer. Our Stein solver is based on the Smith iteration and only requires matrix products, which are well-known to be highly efficient on these architectures []. Thus, we avoid the parallelization difficulties in the QR algorithm.

In Sections 2 and 3 we summarize, respectively, the disc function and Newton's method for solving the DARE. In Section 4 we review our Stein solver based on the Smith iteration. A brief study of the computational and communication cost of the parallel implementations is presented in Section 5. In Section 6 we analyze the experimental performance and scalability of our solvers on a parallel distributed architecture, an Intel Pentium-II cluster, connected via a Myrinet switch. Concluding remarks are given in Section 7.

2. THE DISC FUNCTION METHOD FOR THE DARE

In [], Malyshev proposed an inverse-free iteration for the computation of a deflating subspace of a matrix pencil. Specifically, the iteration divides the spectrum of the matrix pencil along the unit circle. The iteration was made truly inverse free by Bai, Demmel, and Gu in [], and the subspace extraction stage was improved and the cost further reduced by Sun and Quintana in [].

The inverse-free iteration for the solving the DARE can be formulated as follows []. We denote the identity matrix of order n as I_n.

The disc function method for the DARE.
Input: A,Q Î I R^n×n, Q = Q^T, B Î I R^n×m, R Î I R^m×m, R = R^T,
and a user-defined tolerance tol.
Output: Approximate solution X_* Î I R^n×n of (1). (Starting guess
for Newton's method.)

Set L₀ : = [

A
0

Q
I_n
], M₀ : = [

I_n
-B R^-1 B^T

0
A^T
], j : = 0.
WHILE ||R_j+1 - R_j ||_F > tol ·||R_j ||_F

Compute the QR factorization

é
ê
ë

L_j

-M_j
ù
ú
û = é
ê
ë

U₁₁
U₁₂

U₂₁
U₂₂
ù
ú
û é
ê
ë

R_j

0
ù
ú
û .

L_j+1 : = U₂₂^T L_j.
M_j+1 : = U₁₂^T M_j.
j : = j+1.

END WHILE
Let M_j = [ M₁, M₂ ], with M₁, M₂ Î I R^2n×n, then M₁ X_* = M₂.

The convergence of the previous iteration can be shown to be globally quadratic []. The matrix disc function is related with the iteration defined by 2.1-2.3 as

disc(L₀,M₀) =

lim
j ® ¥

(L_j+M_j)^-1L_j;

see [].

3. NEWTON'S METHOD FOR THE DARE

Newton's method was first employed by Kleinman, in [], to solve the continuous-time version of the algebraic Riccati equation. Based on this formulation, Hewer extended Newton's method for the solution of the DARE in []. A discussion of its convergence properties can be found in [,].

Specifically, the iteration can be stated as follows.

Newton's method for the DARE.
Input: A,Q Î I R^n×n, Q = Q^T, B Î I R^n×m, R Î I R^m×m, R = R^T, a
user-defined tolerance tol, and an initial starting guess X₀ = X^T₀.
Output: Approximate solution X_j+1 Î I R^n×n of (1).

j : = 0.

WHILE ||R_d( X_j ) ||_F > tol·||X_j ||_F

A_j : = A - B (R+B^TX_jB)^-1 B^T X_j A.
Solve for N_j the Stein equation

0 = R_d( X_j ) + A_j^T N_j A_j - N_j .

X_j+1 : = X_j + N_j.

j : = j+1.

END WHILE

Under mild conditions, all the iterates X_j are stabilizing (i.e., r(A_j) < 1), lim_{j ® ¥} X_j = X_d, and the convergence is globally quadratic; see [].

4. THE SMITH ITERATION FOR THE STEIN EQUATION

We can rewrite equation (2) in fixed point form, [X\tilde] = [A\tilde]^T [X\tilde][A\tilde]+ [Q\tilde], and form the fixed point iteration

X₀ : =

~
Q

, X_j+1 : =

~
Q

~
A

X_j

~
A

, j = 0,1,2,¼.

Then this iteration converges to [X\tilde] if r([A\tilde]) < 1, and the convergence rate is linear. A quadratically convergent version of this fixed point iteration is suggested in [,]. Specifically, the iteration can be written as follows.

The squared Smith iteration for the Stein equation.
Input: [A\tilde],[Q\tilde] Î I R^n×n, [Q\tilde] = [Q\tilde]^T, and a user-defined tolerance tol.
Output: Approximate solution X_j+1 Î I R^n×n of (2).

Set A₀ : = [A\tilde], X₀ : = [Q\tilde], j : = 0.
WHILE ||A_j ||_F > tol

X_j+1 : = A_j^T X_j A_j + X_j.
A_j+1 : = A_j².
j : = j+1.

END WHILE

The convergence theory of the Smith iteration derived in [] yields that for r([A\tilde]) < 1 there exist real constants 0 < m and 0 < r < 1 such that

~
X

- X_j ||₂ £ m||

~
Q

||₂ (1 - r)^-1r^{2^j}.

(3)

This shows that the method converges for all equations with r([A\tilde]) < 1, i.e., convergence is guaranteed if we employ the Smith iteration for solving the Stein equations arising at each step of Newton's method. Nevertheless, if [A\tilde] is highly non-normal such that ||[A\tilde]||₂ > 1 then overflow may occur in the early stages of the iteration due to increasing ||A_j ||₂ although eventually, lim_j®¥ A_j = 0 if r([A\tilde]) < 1.

The parallelization of this approach was analyzed in [].

5. PARALLELIZATION AND IMPLEMENTATION DETAILS

The most appealing feature of all the numerical methods exposed in the three previous sections regarding its parallel implementation is that all the computational cost comes from matrix products, linear system solvers, and QR factorizations. These kernels are known to be highly parallelizable; see, e.g., [].

Our parallel algorithms are implemented using the libraries PB-BLAS (parallel block BLAS) and ScaLAPACK (scalable linear algebra package) []. These are public-domain parallel libraries for MIMD computers which provide scalable parallel distributed routines for many matrix algebra kernels. These libraries employ BLAS and LAPACK [] for serial computations, and BLACS (basic linear algebra communication subprograms) for communication. BLACS can be used on any machine which supports either PVM [] or MPI [], thus providing a highly portable environment.

For scalability purposes, we employ square-like topologies with one process per processor. The following problem and machine parameters are used:

: n: Size of the problem. (For simplicity, we assume m = n.)
: p = Öp ×Öp: Dimension of the square grid of processors.
: t: Time of a double-precision floating-point arithmetic operation.
: a (latency): Time required to communicate a zero-length message between two processors.
: b (reciprocal of bandwidth): Time required to communicate a double-precision floating-point number between two processors.

For simplicity, we assume the time required to communicate a message of length l is a+ bl.

We have employed the following parallel (double-precision) routines from ScaLAPACK [] in our DARE and Stein solvers:

: PDGETRF: LU factorization with partial pivoting.
: PDTRSM: Solution of a triangular linear system.
: PDGEMM: Matrix product.
: PDGEQRF: QR factorization.
: PDORMQR: Application of Householder reflectors to a matrix.

Following the performance model in [,], we present in Table approximate computation and communication costs for these routines (lower order expression have been neglected).


Parallel	Computation	Communication cost
routine	cost	Latency	Bandwidth^-1
`PDGETRF`	²/₃×[(n³)/p] t	(6+ log p)n × a	(3+ ¹/₄ log p) ×[(n²)/(Öp)] b
`PDTRSM`	[(n³)/p] t	n × a	(1+ ³/₄ log p) ×[(n²)/(Öp)] b
`PDGEMM`	2×[(n³)/p] t	(1+¹/₂ log p) Öp × a	(1+¹/₂ log p) ×[(n²)/(Öp)] b
`PDGEQRF`	⁴/₃×[(n³)/p] t	3n log p × a	(³/₄ log p) ×[(n²)/(Öp)] b
`PDORMQR`	2 ×[(n³)/p] t	3n log p × a	2 log p ×[(n²)/(Öp)] b

Table 1: Approximate computation and communication costs for several parallel routines in PB-BLAS and ScaLAPACK.

A model can be easily constructed from the data in the above table if we consider the number of calls to each of these basic kernels in our parallel DARE and Stein solvers:

: PDGEDRDC: the disc function method for the DARE.
: PDGEDRNW: Newton's method for the DARE.
: PDGEDLSM: the Smith iteration for the Stein equation.

Table reports the numbers of calls for each of the DARE and Stein solvers. In the table, iter_dc, iter_nw, and iter_sm stand for the number of iterations of the corresponding iterative method.


	`PDGEDRDC`	`PDGEDRNW`	`PDGEDLSM`
Kernel
`PDGETRF`		iter_nw
`PDTRSM`	1	2×iter_nw
`PDGEMM`	2×iter_dc	7×iter_nw	3×iter_sm
`PDGEQRF`	1+iter_dc
`PDORMQR`	1+iter_dc
`PDGEDLSM`		iter_nw

Table 2: Number of calls to different parallel kernels from PB-BLAS ScaLAPACK required by the parallel DARE and Stein solvers.

We observed moderate differences between our theoretical performance model and the experimental results; these are mostly due to simplifications and approximations in the performance model (constant cost for any type of floating-point arithmetic operation -i.e., not taking into account the memory hierarchy-, linear model for communications, ideal implementation of collective communications, etc.); e.g., for large problems the theoretical execution times are within 13-17% of the real execution times for routines PDGEDLSM and PDGEDRNW. A slightly larger difference is obtained for routine PDGEDRDC.

Efficient convergence criteria for the previous iterative algorithms have been proposed in [,,].

6. EXPERIMENTAL RESULTS

All our experiments were performed on a Beowulf cluster with 32 nodes connected via a Myrinet switch. Each node is an Intel Pentium-II processor at 300MHz with 128MBytes RAM. We employ double-precision floating-point ieee arithmetic (e » 2.2 ×10^-16), and the libraries BLACS (on top of MPI), BLAS, LAPACK, and ScaLAPACK [,]. Routine DGEMM achieved on a single node a performance around 180 Mflops (millions of floating-point arithmetic operation per second). A simple ping-pong test reported, for an implementation of MPI specially tuned for this architecture, a latency a » 18 ns., and a bandwidth b^-1 » 32 Mb/s.

Our first experiment analyzes the performance of the parallel algorithms. Fig. compares the execution time of one iteration of serial specially-tuned implementations of the DARE and Stein solvers (using the kernels in the BLAS and LAPACK libraries) with those of the corresponding parallel algorithms. The problem size is fixed at n = 700, and the number of processors is p = 2,...,12. When solving the Stein equation in the iteration of Newton's method (routine PDGEDRNW) the number of Smith iterations (iter_sm) was fixed at 9 in both the serial and the parallel implementations.

Figure
Figure 1: Execution time of one iteration of the serial and parallel algorithms involved in the solution of the DARE. Legend: symbols ``- - ×--'' for PDGEDRDC, ``- ·+ ·-'' for PDGEDRNW, and ``¼o ¼'' for PDGEDLSM.

The results of this experiment also show the speed-up of the parallel algorithms. Table reports a speed-up of 7-9.5 for the largest number of processors. The higher speed-up of routine PDGEDRDC is due to the larger size of the matrices used in this case (2n ×n).


p	`PDGEDRDC`	`PDGEDRNW`	`PDGEDRSM`
2	1.85	1.55	1.57
4	3.45	2.84	2.91
6	5.19	3.73	4.05
8	6.56	4.68	4.81
10	7.94	6.09	6.34
12	9.47	7.22	7.57

Table 3: Speed-up of the parallel algorithms involved in the solution of the DARE.

Our next experiment reports the scalability of the parallel algorithms. Figure shows the Mflops per node of the serial and the parallel implementations. We evaluate the parallel algorithms on p=4, 9, 15, 25, and 30 processors, with n/Öp = 1000. The results in the figure show a high scalability of the parallel routines as the performance remains almost constant as p is increased.

Figure
Figure 2: Mflops per node for the serial and parallel algorithms involved in the solution of the DARE. Legend: symbols ``- - ×--'' for PDGEDRDC, ``- ·+ ·-'' for PDGEDRNW, and ``¼o ¼'' for PDGEDLSM.

7. CONCLUDING REMARKS

We have analyzed the parallel solution of the discrete-time algebraic Riccati equation by means of the inverse-free iteration for the matrix disc function and Newton's method. Our implementation of Newton's method employs the Smith iteration to solve the corresponding Stein equations.

Our parallel solvers are composed of highly parallel medium-grain computational kernels like, e.g., QR factorizations, linear system solvers, matrix products, etc. The experimental results on an cluster of Intel Pentium-II, connected with a Myrinet switch, show the performance and scalability of our parallel solvers.

References

[]: E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, PA, second edition, 1995.
[]: Z. Bai, J. Demmel, J. Dongarra, A. Petitet, H. Robinson, and K. Stanley. The spectral decomposition of nonsymmetric matrices on distributed memory parallel computers. SIAM J. Sci. Comput., 18:1446-1461, 1997.
[]: Z. Bai, J. Demmel, and M. Gu. An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems. Numer. Math., 76(3):279-308, 1997.
[]: A. Y. Barraud. A numerical algorithm to solve A^T X A - X = Q. IEEE Trans. Automat. Control, AC-22:883-885, 1977.
[]: P. Benner. Contributions to the Numerical Solution of Algebraic Riccati Equations and Related Eigenvalue Problems. Logos-Verlag, Berlin, Germany, 1997. Also: Dissertation, Fakultät für Mathematik, TU Chemnitz-Zwickau, 1997.
[]: P. Benner, R. Byers, E.S. Quintana-Ortí, and G. Quintana-Ortí. Solving algebraic Riccati equations on parallel computers using Newton's method with exact line search. Berichte aus der Technomathematik, Report 98-05, FB3 - Mathematik und Informatik, Universität Bremen, 28334 Bremen (Germany), August 1998. Available from http://www.math.uni-bremen.de/zetem/berichte.html, to appear in Parallel Computing.
[]: P. Benner and E.S. Quintana-Ortí. Solving stable generalized Lyapunov equations with the matrix sign function. Numer. Algorithms, 20(1):75-100, 1999.
[]: P. Benner, E.S. Quintana-Ortí, and G. Quintana-Ortí. Solving stable Stein equations on distributed memory computers. In P. Amestoy, P. Berger, M. Daydé, I. Duff, V. Frayssé, L. Giraud, and D. Ruiz, editors, EuroPar'99 Parallel Processing, number 1685 in Lecture Notes in Computer Science, pages 1120-1123. Springer-Verlag, Berlin, Heidelberg, New York, 1999.
[]: L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA, 1997.
[]: E.J. Davison and F.T. Man. The numerical solution of A¢Q + Q A = -C. IEEE Trans. Automat. Control, AC-13:448-449, 1968.
[]: J.J. Dongarra, A. Sameh, and D. Sorensen. Implementation of some concurrent algorithms for matrix factorization. Parallel Comput., 3:25-34, 1986.
[]: J.D. Gardiner and A.J. Laub. Solving the algebraic Riccati equation on a hypercube multiprocessor. In G. Fox, editor, Hypercube Concurrent Computers and Applications, Vol. II, pages 1562-1568. ACM Press, New York, 1988.
[]: A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge, MA, 1994.
[]: G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, second edition, 1989.
[]: W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1994.
[]: G. Henry and R. van de Geijn. Parallelizing the QR algorithm for the unsymmetric algebraic eigenvalue problem: myths and reality. SIAM J. Sci. Comput., 17:870-883, 1997.
[]: G. Henry, D.S. Watkins, and J.J. Dongarra. A parallel implementation of the nonsymmetric QR algorithm for distributed memory architectures. LAPACK Working Note 121, University of Tennessee at Knoxville, 1997.
[]: G.A. Hewer. An iterative technique for the computation of steady state gains for the discrete optimal regulator. IEEE Trans. Automat. Control, AC-16:382-384, 1971.
[]: D.L. Kleinman. On an iterative technique for Riccati equation computations. IEEE Trans. Automat. Control, AC-13:114-115, 1968.
[]: P. Lancaster and L. Rodman. The Algebraic Riccati Equation. Oxford University Press, Oxford, 1995.
[]: A.J. Laub and J.D. Gardiner. Hypercube implementation of some parallel algorithms in control. In M.J. Denham and A.J. Laub, editors, Advanced Computing Concepts and Techniques in Control Engineering, pages 361-390. Springer-Verlag, Berlin, 1988.
[]: A.N. Malyshev. Parallel algorithm for solving some spectral problems of linear algebra. Linear Algebra Appl., 188/189:489-520, 1993.
[]: V. Mehrmann. The Autonomous Linear Quadratic Control Problem, Theory and Numerical Solution. Number 163 in Lecture Notes in Control and Information Sciences. Springer-Verlag, Heidelberg, July 1991.
[]: E.S. Quintana-Ortí. Algoritmos Paralelos Para Resolver Ecuaciones Matriciales de Riccati en Problemas de Control. PhD thesis, Universidad Politécnica de Valencia, 1996.
[]: I.G. Rosen and C. Wang. A multi-level technique for the approximate solution of operator Lyapunov and algebraic Riccati equations. SIAM J. Numer. Anal., 32(2):514-541, 1995.
[]: M.G. Safonov and R.Y. Chiang. Model reduction for robust control: A Schur relative error method. Int. J. Adapt. Cont. and Sign. Proc., 2:259-272, 1988.
[]: G. Schelfhout. Model Reduction for Control Design. PhD thesis, Dept. Electrical Engineering, KU Leuven, 3001 Leuven-Heverlee, Belgium, 1996.
[]: V. Sima. Algorithms for Linear-Quadratic Optimization, volume 200 of Pure and Applied Mathematics. Marcel Dekker, Inc., New York, NY, 1996.
[]: R.A. Smith. Matrix equation XA + BX = C. SIAM J. Appl. Math., 16(1):198-201, 1968.
[]: X. Sun and E.S. Quintana-Ortí. Spectral division methods for block generalized Schur decompositions. PRISM Working Note #32, 1996. Available from http://www-c.mcs.anl.gov/Projects/PRISM.
[]: D.S. Watkins and L. Elsner. Chasing algorithms for the eigenvalue problem. SIAM J. Matrix Anal. Appl., 12:374-384, 1991.
[]: K. Zhou, J.C. Doyle, and K. Glover. Robust and Optimal Control. Prentice-Hall, Upper Saddle River, NJ, 1995.

Footnotes:

¹ Correspondence to: Enrique S. Quintana-Ortí , Depto. de Informática, Universidad Jaume I, 12080-Castellón, Spain.
Contract/grant sponsor: Consellerí a de Cultura, Educación y Ciencia de la Generalidad Valenciana GV99-59-1-14 and the Fundació Caixa-Castelló Bancaixa.

File translated from T_EX by T_TH, version 2.33.
On 31 Oct 2000, 09:03.

Parallel solvers for discrete-time algebraic Riccati equations

rafael mayo 1, enrique s. quintana-ortí , gregorio quintana-ortí , vicente hernández

References

Footnotes:

Parallel solvers for
discrete-time algebraic Riccati equations

rafael mayo ¹, enrique s. quintana-ortí , gregorio quintana-ortí , vicente hernández