You are on page 1of 34

Triangular Matrix Inversion

a survey of sequential approaches


Nicholas Knight
Submitted 9 December 2009
Abstract
In this paper, I explore sequential approaches to triangular matrix inversion (TMI). TMI is commonly performed when calculating the explicit inverse of a (dense)
matrix from its LU factorization (cf.

LAPACK xGETRI). Direct algorithms perform 13 n3 + O n2 flops, where the lower order terms depend
on the specific implementation. Because of the structural relationship between a triangular matrix and
its inverse, the TMI algorithm design space is particularly diverse. I will explore unblocked algorithms,
implemented using the second-level Basic Linear Algebra Subroutines (BLAS-2), and blocked and recursive algorithms, using the BLAS-3. Although these algorithms ultimately perform equal numbers of
floating point operations (flops), at least in the Big-Oh sense, their relative performances (flops per
second) still vary, due to the dierent ways the algorithms data pass through the underlying computer
memory hierarchy. The main contribution of this paper is a careful exploration of the sequential TMI
algorithm design space, with the goal of finding the best algorithm for a given architecture. I will draw
conclusions based on empirical performance measurements and ultimately compare my performance with
existing DTRTRI implementations.

Motivation
There are several approaches to calculating the inverse of a dense square matrix, namely, Gauss-Jordan
elimination, blockwise inversion (using the Schur complement), and from the LU factorization of the matrix.
Since good LU implementations are expected from mathematical software libraries (solving linear systems via
GEPP being a very popular operation), the third option is the one chosen by LAPACK xGETRI and MATLAB
inv. The method of xGETRI is to invert U explicitly, and then solve the triangular system XL = U 1 for
X = A1 . Equivalently, one could invert L explicitly, and then solve the system U X = L1 for X. Or, one
could invert both L and U explicitly, and compute the product X = U 1 L1 .1 Each of these procedures
involves at least one inversion of a triangular matrix - thus, we wish the TMI kernel to be optimized.

Algorithm design
In this section, I will outline the parameters of the sequential TMI algorithm design space. The final
algorithms I implemented may be seen in the Algorithms section.

Structure of a triangular matrix and its inverse


For obvious reasons, I only consider nonsingular (square) triangular matrices. Consider, for example, the
upper triangular matrix and its inverse, as expressed in a 2-by-2 block form in Figure 1.
Note the flow of information upward from the diagonal in the inverse: the diagonal blocks (which must
be square, but may have dierent sizes) may be inverted independently, while the superdiagonal rectangular
block of the inverse requires the information of the diagonals inverses. The transpose holds for lower
triangular matrices, but for the sake of exposition, I will use upper triangular matrices as the primary
example in this paper. This special structure2 allows for a variety of techniques where the inverse may be
1 Since the LU factorization uses partial pivoting, one must also take into account the permutation matrix P , which amounts
to swapping rows or columns at dierent points in the algorithm.
2 Really just a special case of the blockwise form of the matrix inverse.

Figure 1: An upper-triangular matrix U and its inverse U 1 , expressed in block form. Note that the diagonal
blocks are square (although possibly of dierent sizes), and the superdiagonal block is rectangular. If U is
nonsingular, then the diagonal blocks are also nonsingular.
calculated in place (that is, without extra memory required by the algorithm).
1
1
Note that my notation is ambiguous: U11
and U22
are both the
of the diagonal
blocks of U11
inverses

1
and U22 , as well as the diagonal blocks of the inverse U 1 , that is U 1 11 and U 1 22 , while U12
is the
1
1
superdiagonal block of U , that is, U
,
but
it
is
in
general
not
the
inverse
of
U
,
which
may
not
even
12
12
1
exist (U12 may be rectangular and/or singular). For the rest of the discussion I will use the notation Uij to

mean U 1 ij .
In order to design the best TMI algorithm, I should define the parameters over which I will tune, and
those I which will ignore and/or hold constant.

Unblocked, blocked, and recursive algorithms


Unblocked
I consider unblocked algorithms as those which perform BLAS-2 operations such as multiplying a vector by
a triangular matrix (xTRMV), solving a triangular system for a vector
a rank-one
(xTRSV), and
performing

update (xGER). Because these BLAS-2 operations only perform O n2 work on O n2 data, I do not expect
good reuse of data. It is this reuse of data which reduces the (expensive) movement of data between levels
of computer memory and allow more useful work (i.e., computation) to be performed, per unit time.
Blocked
Blocked algorithms perform BLAS-3 operations such as multiplying a matrix by a triangular matrix (xTRMM),
solving a triangular system for multiple
solution vectors
(xTRSM), and performing a rank-b update (xGEMM).
Since these operations perform O n3 work on O n2 data, I expect better data reuse and performance at least asymptotically. They are called blocked because they operate on multiple rows and columns of a
matrix at each iteration.

I parametrize these blocked routines by block_size as well as the algorithm choice for the base case.
Here, the base case is the explicit inversion of the square diagonal block (see the Algorithms section for
examples). I limit this parameter space with the assumption that if blocked algorithms work better on
matrices above a certain size, then when the diagonal blocks (and the block_size) reaches this size, they
too will be better served by a blocked routine. This observation in fact suggests the last approach: recursive
algorithms.
Recursive
Recursive algorithms are a variant of blocked algorithms where the block_size varies with the size of the
subproblem. The recursive, or divide-and-conquer, approach seems the most promising from the outset.
Recall that blocked routines should perform better than than unblocked based on the assumption about
data reuse (BLAS-2 vs. BLAS-3). Also note that if the subproblem fits in fast memory, then the blocked
and unblocked routines should have similar performance, since no data need be moved, and the amount
of computational work is equivalent for both types of routines. Thus, I expect my blocked routines to be
sensitive to fast memory size, primarily with respect to the block_size parameter (cache-aware). However,
my recursive algorithms should be cache-oblivious - that is, if I make the base_case parameter suciently
small, the recursion will always proceed until the subproblems fit in fast memory.
Besides the base_case parameter, I consider a split parameter. In general, a divide and conquer
algorithm splits execution into multiple subproblems at each step of execution. Typically, the number of
resulting subproblems is a power of two - I do not parametrize this, but simply assume I will always divide
into just two subproblems. However, I allow that these two subproblems need note be weighted evenly: that
is, via a split parameter, the dimensions of the matrix subproblems may be varied and the recursion may
evolve an unsymmetrical shape. I will motivate this choice further in the experimental description below.
The last parameter, as with blocked routines, is the choice of base-case algorithm. I will make the same
choice for this as I do for the blocked routines.

i-, j-, and k- variants


I also distinguish between three classes of algorithms: i-variant TMI algorithms, which form the triangular
inverse one row (of the matrix; may be a block row) at a time; j-variant algorithms, which form the inverse
one (block) column at a time; and k-variant algorithms, which traverse the diagonal of the matrix and
1
update the superdiagonal block (U12
in Figure 1) at each step. The k-variant algorithms are also called
outer-product variants, since they update the superdiagonal block with the outer-product of the row and
column that border the block. Although any i-variant algorithm on a triangular matrix is equivalent to a
j-variant algorithm on the transpose of that matrix, I will consider the two cases separately since I anticipate
dierences will arise depending on how the matrix is stored in a data structure (discussed further below).
All 3 variants take n steps to complete, where n is the dimension the matrix, measured in matrix entries
or blocks, depending on algorithm type. At a given step, the i- and j-variants read more (input) data
than data they output, whereas the k-variants have equal amounts of input and output data - in fact, they
overwrite every input with an output. This could suggest better data reuse for k-variants, except for the
fact that the outputs of the i- and j- variants are independent at each step, whereas the k-variants outputs
overlap. It is unclear which side of this trade-o has an advantage, if either.

Looking-ness and direction of progress


I have claimed from a 2-by-2 block example that information must pass from the diagonal upwards (or
northeast), in the case of an upper-triangular matrix. Thus, for any TMI algorithm on an upper-triangular
matrix, I expect that the portion of the inverse (besides the diagonal) being calculated must be receiving
information from the left and from below. Roughly speaking, I know all TMI algorithms on an uppertriangular matrix will be southwest-looking. However, I still have freedom in which way the algorithm
proceeds: in the upper-triangular example in Figure 1, note that I could either
1
1
1
1. explicitly invert U11 and U22 , and then form the product X = U11
U12 U22
, giving X = U12
, or

1
2. explicitly invert U11 , form the product Y = U11
U12 , solve the triangular system X U22 = Y , giving
1
X = U12 and then explicitly invert U22 , or
1
3. explicitly invert U22 , form the product Y = U12 U22
, solve the triangular system U12 X = Y , giving
1
X = U12 , and then explicitly invert U11 , or
1
4. solve the triangular system U11 Y = U12 , solve the triangular system X U22 = Y , giving X = U12
,
and then explicitly invert U11 and U22 .

In real arithmetic, these operations are equivalent and will all return the same matrix (inverse), assuming it
exists. However, in floating-point arithmetic, numerical roundo error will accumulate in dierent ways. I
will not undergo error analysis in this paper, other than to summarize previous results in the Related Work
section.
It is desirable to reorganize linear algebra algorithms to use matrix multiply (matmul) whenever possible,
practically because matmul is usually highly optimized in modern mathematical software packages. Thus I
expect that the first choice above would give the best performance, all other parameters held constant.
The most visible dierence in the four choices is the direction the algorithm moves. Choice 1 moves
northeast, choice 2 moves southeast, choice 3 moves northwest, and choice 4 moves southwest. Choices 1
and 4, which move perpendicularly with respect to the diagonal, can thus be implemented to move either
northwest or southeast, although I only considered the cases where the size of the explicit inversion is
minimized. This dierence will become more obvious in the algorithms below. However, I suspect the
dierence in direction of progress is insignificant compared to the fact that the dierent choices also perform
dierent operations.

BLAS library choice


At the innermost levels of my TMI algorithms, there are some combination of matrix operations (BLAS
routines). This is an external design choice I have made - ideally, in an eort to optimize my TMI algorithms I
would optimize these BLAS routines as well. However, the lower-level my implementation eorts become, the
more my design space must take the specific (computer hardware) architectural design into account. Rather
than reimplement and tune the BLAS, I have chosen to use third-party BLAS packages in my algorithms.
Many such packages exist: I have chosen three. First, the reference BLAS (netlib) - which is unoptimized
and serves as a specification for others to implement and test their (optimized) BLAS packages against. I do
not expect good performance from the reference BLAS. Secondly, I will use the ATLAS BLAS. The ATLAS
BLAS is an example of autotuning - the ATLAS BLAS package (as downloaded) consists of multiple versions
(kernels) of the BLAS routines - while installing ATLAS, the installer performs a comprehensive parameter
search and selects and customizes from its database of possible routines a set that appeared to perform the
best on the given system. I note briefly that ATLAS also provides an autotuned version of xTRTRI, which
I will revisit later. Thirdly, I will use Intels Math Kernel Library (MKL). This package is closed-source
and is advertised as optimized for Intel hardware, such as my test platform. MKL is presumably largely
hand-tuned. I suspect MKL will give the best performance, if only based on the relative financial investment,
although I am interested to see how close ATLAS approaches.
Both ATLAS and MKL provide FORTRAN-77 interfaces to the BLAS (F77BLAS) as well as C interfaces
(CBLAS). The reference BLAS is FORTRAN-only (I did not consider the reference CBLAS). I do not
suspect a significant dierence in the choice of language interface - the main dierence I am concerned with
is the data storage. The FORTRAN language stores matrices in column-major order: that is, if there are m
rows in matrix A, the element aij would be found at position i + j m in the one-dimensional array. The C
language, on the other hand, stores matrices in row-major order: that is, if there are n columns in matrix A,
the element aij would be found at position i n + j in the one-dimensional array. Although eectively one
order is just the transpose of the other, I anticipate that the choice of order could become significant when
an algorithms matrix access pattern is row-based, like i-variants, or column-based, like j-variants - that is,
if the order and the access pattern align (are parallel), I anticipate more sequential reads and writes, while
if they are perpendicular, I anticipate fewer.
The ATLAS and MKL CBLAS interfaces provide both a column-major and a row-major interface - I will
try both (along with the F77BLAS column-major interface) to compare performance.
4

Other parameters
I am only exploring the space of double-precision real (D) routines, whereas the BLAS and LAPACK provide
four versions of their routines: single- and double-precision, real and complex. I suspect that my findings
will still be valid for the other three types, but I leave this for future work.

Experimental measurements
The goal of my project is eectively to optimize DTRTRI, the standard LAPACK TMI routine. The reference
LAPACK DTRTRI is a j-variant blocked routine (type IIb, see Figure 22), with a base case DTRTI2, a
j-variant unblocked (i.e., BLAS-2) routine equivalent to my type II unblocked routine (Figure 14). The
ATLAS DTRTRI (one of the few LAPACK routines that ATLAS will optimize in addition to the BLAS) is
a recursive implementation, equivalent to my type III (Figure 31), with special closed-form base cases for
dimensions 1, 2, 3, and 4. I do not have any information on the implementation of MKL DTRTRI, although
I suspect it is recursive as well (NB: it appears otherwise, after experimentation). I will ultimately judge
the success of my design by comparing by final algorithm with these other three (all linked with the same
BLAS).

Experimental setup
I will briefly describe my experimental setup. I do this only for completeness, and I will not use these details
to draw any conclusions. I performed my experiments on an Intel Core 2 Duo P8800 processor, with 32KB
L1 cache, 3072KB L2 cache, and 4 GB DDR3 DRAM. Although there are two processor cores (each 2660
MHz), I am only running sequential (unthreaded) codes so I only expect to use one processor at a time and
do not expect any benefit from SMT. I am running the Linux operating system (Ubuntu 9.10), with a 64-bit
kernel. My routines are uniformly written in the C language, compiled with the GNU C compiler (gcc),
using the O3 optimization flag (the highest level). I built the reference BLAS with the GNU FORTRAN
compiler (gfortran), again with the O3 flag. I configured ATLAS to optimize to the 64-bit architecture,
and to use O3 on all gcc and gfortran builds. I turned o CPU throttling and frequency scaling as well,
forcing the processor to always run at its maximum frequency (of 2660 MHz). I calculated timings using
the clock_gettime() glibc function, which measures CPU time with nanosecond precision, on a per-process
granularity. I averaged every datapoint over at least four consecutive runs (or a threshold of 0.5 seconds,
whichever takes longer) to reduce variation (noise) in benchmark timings. I coded my C routines in a uniform
style with minimal argument checking and the use of preprocessor macros to swap row and column major
ordering as well as to wrap the dierent BLAS interfaces (in an eort to reduce function-call overhead). When
measuring performance, I ran my TMI routines on strictly diagonally dominant matrices (more precisely,
the upper- and lower- triangular portions of these matrices) with odiagonal entries as normally-distributed
random doubles between 1.0 and 1.0 (via the drand48() library function). My codes are available online
- see reference [10].

Unblocked routines
I implemented and benchmarked six unblocked routines on both upper- and lower-triangular matrices, using:
reference F77BLAS (column-major storage), ATLAS F77BLAS (column-major storage), ATLAS CBLAS
(both column-major and row-major storage), MKL F77BLAS (column-major storage), and MKL CBLAS
(both column-major and row-major storage). I used 150 matrices of sizes
N {31, 32, 33, 61, 62, 63, . . . , 1599, 1600, 1601}
chosen to highlight any architectural or BLAS-related side eects. The reference data appears in Figure 2,
the ATLAS data appears in Figure 3, and the MKL data appears in Figure 4. There are twelve curves in
each graph: four red curves (i-variants, type I and II, upper- and lower-triangular matrices), four orange
curves (j-variants, type I and II, upper- and lower-triangular matrices), and four blue curves (k-variants,
type I and II, upper- and lower-triangular matrices). See the Algorithms section for a description of the
routines. (Figures 11 to 16), and the Results section for further explanation of the graphs.
5

Figure 2: Performance of unblocked routines using the reference BLAS. i-variants are red, j-variants are
orange, k-variants are blue.

Figure 3: Performance of unblocked routines using the ATLAS BLAS. i-variants are red, j-variants are
orange, k-variants are blue.

Figure 4: Performance of unblocked routines using the MKL BLAS. i-variants are red, j-variants are orange,
k-variants are blue.

Blocked routines
I implemented and benchmarked twelve blocked routines on both upper- and lower-triangular matrices, using
MKL F77BLAS (column-major storage), using the j-variant, type II (Figure 14, again, with MKL F77BLAS,
column-major storage) unblocked routine to invert the diagonal blocks. I chose this as the base case routine
because it was one of the best performing unblocked routines - see the discussion for further explanation of
this choice. I used block_sizes
block_size {2, 4, 8, 16, 32, 64, 128, 256}
and 150 matrices of size
N {31, 32, 33, 61, 62, 63, . . . , 1599, 1600, 1601}
For the cases where block_size > N , results were discarded since the blocked routine inverted the entire
matrix simply using the unblocked routine. The i-variant data appears in Figure 5, the j-variant data in
Figure 6, and the k-variant data in Figure 7. See the Algorithms section for discussion of the routines
(Figures 17 to 28), and the Results section for further explanation of the graphs.

Figure 5: Performance of the three i-variant blocked routines, using MKL F77BLAS. The colors represent
powers-of-two block_sizes: red=2, orange=4, yellow=8, green=16, teal=32, dark blue=64, purple=128,
gray=256.

10

Figure 6: Performance of the three j-variant blocked routines, using MKL F77BLAS. The colors represent
powers-of-two block_sizes: red=2, orange=4, yellow=8, green=16, teal=32, dark blue=64, purple=128,
gray=256.

11

Figure 7: Performance of the six k-variant blocked routines, using MKL F77BLAS. The colors represent
powers-of-two block_sizes: red=2, orange=4, yellow=8, green=16, teal=32, dark blue=64, purple=128,
gray=256.

12

Figure 8: Performance of the four recursive routines, using MKL F77BLAS. The colors represent powers-oftwo base_case sizes: red=2, orange=4, yellow=8, green=16, teal=32, dark blue=64, purple=128, gray=256.

Recursive routines
I implemented and benchmarked four recursive routines on both upper- and lower-triangular matrices, with a
base case using the same unblocked routine as in the blocked routines above (Figure 14). I used base_cases
base_case {2, 4, 8, 16, 32, 64, 128, 256}
and 150 matrices of size
N {31, 32, 33, 61, 62, 63, . . . , 1599, 1600, 1601}
For the cases where base_case > N , results were discarded since the recursive routine performed the base
case immediately, inverting the entire matrix simply using the unblocked routine. Performance data are
presented in Figure 8. See the Algorithms section for a discussion of the routines (Figures 29 to 32).
I also tested the split parameter, for values split = {2, 4, 8}, for upper-triangular matrices. I performed
the splits so that the recursion was left-heavy, that is, the left subproblem was larger than the right
subproblem. the motivation for this choice is explained in the Results.

13

Figure 9: Performance of the four recursive routines, using MKL F77BLAS. Each graph shows the performance of the specified routine, with the split values 2/equal splitting (red), 4 (orange), and 8 (blue).

14

DTRTRI
I compared four dierent versions of LAPACK DTRTRI: the reference LAPACK version, the ATLAS version,
the MKL version, and my recursive versions (base_case = 32, split = 2, i.e., equal splitting), on the
matrix sizes
N {31, 32, 33, 61, 62, 63, . . . , 1599, 1600, 1601}
and
N = {127, 128, 129, 255, 256, 257, . . . , 6439, 6400, 6401}

I only tested upper-triangular matrices. Performance data are presented in Figure 10 (note that I only show
one of my recursive routines in the graph over the range of smaller matrices). All routines were built using
the MKL F77BLAS with column-major storage, in order to get a fair comparison.

15

Figure 10: Performance of DTRTRI: reference LAPACK version (blue), ATLAS version (orange), MKL version
(green), and my recursive algorithms (red) with base_case = 32 and split = 2. All were compiled with
the same MKL F77BLAS, column-major storage, to make the comparison as fair as possible. The top chart
is over a range of smaller matrices while the bottom plot is over a much wider range to show the splitting
between the recursive and blocked formulations. Also note that the bottom plot shows all four of my recursive
routines (in red) while the top plot only shows my type I recursive routine.

16

Discussion
Unblocked routines
In the reference and ATLAS runs (Figures 2 and 3), I found that the k-variants outperformed the other
routines for most of the matrix sizes. However, at about N = 800, the k-variant performance began to
decline precipitously, falling below the other performance curves at about N = 1200.
Note that the curves have spikes of increased or decreased performance at multiples of 32 (the granularity
of my matrix test sizes), as well as much larger dips. I chose 32 since it was a power-of-two, and computer
memory is often designed around multiples of powers of two. The performance drops (downward spikes)
probably highlight an increased frequency of (cache) conflict misses. The performance increases (upward
spikes) are harder to interpret: the huge (up to 1000 Mflops/s, reaching up to 4500 Mflops/s for some
sizes) upward spikes in the MKL data (Figure 4) are not likely due to the computer memory, but probably
are indicative that MKLs BLAS routines are optimized for multiples of 32 - the performance drops at
each side (that is, multiples of 32 1) would then indicate the overhead the fringe cases. I do not have an
interpretation for the upward spikes in the reference BLAS data, especially on the k-variant curves. The much
larger dips at multiples of 512 suggest TLB (translation lookaside buer) misses, since 512 8-byte double =
4 KB, the same size as a virtual memory page. Ideally, smaller drops would also occur at multiples of
256, with half the amplitude - this is visible in the data, especially in the case of the ATLAS k-variants,
whose performance is absolutely obliterated at these sizes. My analysis here also applies to the blocked and
recursive sections, below.
The MKL data did not demonstrate that the k-variants performed best - in fact, the best performance
in the MKL case was from the i- or j-variant parallel to the storage - in the case of row-major storage, the
i-variants performed best, and in the case of column-major storage, the j-variants performed best - across
all test sizes.
All other variables held constant, the type II versions outperformed the type I versions. Upper-triangular
matrices generally outperformed lower-triangular matrices for column-major storage, while lower-triangular
matrices generally outperformed upper-triangular matrices for row-major storage, albeit sometimes the
curves were too close to tell, and sometimes the opposite occurred for smaller matrices (N 512). Within
ATLAS and MKL runs, the CBLAS and F77BLAS performances were almost identical (but swapping the
i- and j-variants, depending on storage).
I concluded that MKL was the best choice for a BLAS library, and I arbitrarily selected the F77BLAS
(column-major storage) as the best performing. Given the correlation between good i-variant performance
and row-major storage (and good j-variant performance and column-major storage), I selected the best
unblocked routine as the j-variant, type II (Figure 14), using MKL F77BLAS (column-major storage). The
fact that type II routines outperformed type I routines likely indicated that DTRMVs performed better than
DTRSVs.

Blocked routines
Since I chose an MKL F77BLAS (column-major storage) base case routine for the inversion of the diagonal
blocks, I tested all these routines using MKL F77BLAS (column-major storage) throughout. This choice was
necessary to limit my parameter space from exploding - although it doesnt address the possibility that some
combination of ATLAS and MKL, CBLAS and F77BLAS routines may indeed be optimal. The qualitative
behavior of all runs was quite similar (see Figures 5, 6, and 7). However, I observed that the i-variants (at
least, the best ones) performed worse than the (best) j- and k-variants. This reinforces that the storage
should be parallel (or at least, not perpendicular) to the access pattern, as discussed in the unblocked case.
All of the runs demonstrated decreased performance at even multiples of 256, and smaller drops at odd
multiples of 256 - as mentioned above, this is highly suggestive of TLB misses. These become less dramatic
as N increases, and are least obvious in the type IIa i- and j- variants, and in the k-variants.
The block_size = 32 version performed the best up to about N = 256, when it was passed by
the block_size = 64 version, which performed best up to about N = 768, when it was passed by the
block_size = 128 version, which appeared to be performing best up to the last matrices in the run, although the block_size = 256 version appears to be approaching close. The trend seems to be that larger

17

block_sizes performed better asymptotically (as N increased), although at smaller N the pattern was less
clear.
Interestingly, the k-variants outperformed the j-variants. However, all curves seemed to approach the
same horizontal asymptote at about 8800 Mflops/s. For k-variant routines, lower-triangular matrices seemed
to perform slightly better, while upper-triangular matrices performed better for the i- and j-variants. These
gaps suggest that the best choice of algorithm (in this case, j- versus k-variant) depends not only on block
size but also whether the matrix is upper- or lower-triangular - however, this eect is subtle and warrants
further examination. Overall, the best performing algorithms were the type I k-variants, which performed
two DTRMMs and no DTRSMs. A similar comparison held between the type IIa and IIb j-variants. Similar to the
insight from the unblocked routines, it seemed that DTRMM-based routines performed better than DTRSM-based
routines, but now both were outperformed by the k-variants, which perform DGEMMs.
The best routine from this experiment was the type I k-variant (Figures 23 and 24). However, this
designation also depends on choice of the block_size parameter, which I conclude should generally be
proportional to matrix size, as well as whether the matrix is upper- or lower-triangular.

Recursive routines
The observation that block_size should be chosen proportionally to matrix size suggests that a recursive
(divide-and-conquer) approach may be best. Indeed, the recursive data (Figure 8) behave qualitatively almost
identically to the best curves from the blocked data. A choice of a recursive base_case = 32 appears to
be optimal over the whole range of matrix dimensions. As N increases, however, the curves overlap and it
is dicult to distinguish which, if any, are actually the best choice. This perhaps indicates that the work of
doing the initial DTRMMs and DTRSMs on the larger subproblems near the top of the recursion tree outweigh
the dierence in base-case size. This makes sense, considering the cubic cost of all these operations (more
precisely, cubic in matrix dimension N )
Between the four varieties of recursive algorithms, the top performer was the type I routine (Figure 29),
which performed two DTRMMs and no DTRSMs, which is no surprise given the trend I noted in the last two
sections. The splitting between upper- and lower-triangular matrix inversion performance appears to be
independent of block size for the recursive routines. This splitting is the greatest for the type I and the type
IIb versions, and the smallest for the type III versions.
The split parameter data (Figure 9) shows that uneven splitting leads to poor performance. These tests
were only performed with upper-triangular matrices and left-heavy recursion. This choice corresponds to a
column-heavy splitting in the upper-triangular case - that is, the superdiagonal rectangular blocks become
tall and skinny. I had hoped this would permit better sequential access given my choice of column-major
storage. More subtly, this splitting permits more large west operations. That is, assuming DTRMM is always
a better choice (performance-wise) than DTRSM, the type IIb routines (which performs a DTRMM to the west)
would show an improvement, while the type IIa (which performs a DTRSM to the west) would show a decline
in performance. As it turned out, the performance deteriorated almost uniformly in the four cases, although
slightly less so in the case of the type I routine. This result suggests that it is best to maximize the size of
the problem in the first step of the recursion, (i.e., the biggest two DTRMMs and/or DTRSMs) - and that this
factor is much more important than the dierence between DTRMM and DTRSM.

DTRTRI performance
As shown in Figure 10, the best performer for larger matrices appeared to be the ATLAS routine, which uses
a recursive, type III approach (like Figure 32) with a base_case size of 4, and special closed form base
cases for N = 1, 2, 3, 4 matrices, probably to reduce the overhead of the unblocked function calls. For larger
matrices, my recursive routines are a very close second-place, and actually beat the ATLAS performance by
a significant amount (18% on average) for matrices smaller than N = 512. This is likely due to the overhead
of the extra function calls from the smaller base_case - but for larger matrices, this cost is outweighed
by the work done on larger blocks. The MKL curve is almost exactly identical to the reference LAPACK
curve, suggesting that MKL uses a similar (or identical) blocked approach. LAPACK DTRTRI uses a blocked
j-variant type II routine (like Figure 22) that determines the block_size via another subroutine (ILAENV),
and uses an unblocked j-variant type II called DTRTI2 (like Figure 14) as the base case routine.
18

From these data, I conclude that a recursive formulation is the best choice for all matrix sizes in my
experimental range (N [31, 6401]). The fact that I beat ATLAS for smaller matrices perhaps suggests that
my larger base_case size (base_case = 32) in the range N [31, 480] is advantageous (my base_case = 4
recursive routines performed almost identically to the ATLAS DTRTRI in this range).
Note the widening gap between my routines and ATLAS, and MKL and reference LAPACK. The recursive
performances increase with matrix dimension much more strongly than the blocked routines. The theoretical
peak of my machine is about 10.6 Gflops/s - this is determined by the clock speed (2.66 GHz) times the
SIMD speedup (128-bit SIMD registers and 64-bit double-precision 2x speedup) times the speedup of the
independent multiplication and addition pipelines ( another 2x speedup, assuming the computation has
an equal number of multiplies and adds). Other testing has revealed an MKL DGEMM performance upwards
of 9.7 Gflops/s, and the largest DTRTRI performance was close to 9 Gflops/s, and increasing.

Related and future work


I did not discuss the numerical stability of my proposed algorithms. [3] performs error analysis for the
unblocked and blocked routines (although they only show their work for the j-variants), and show that all
of these routines except for the type IIb blocked routines satisfy componentwise error bounds on the forward
error analysis.[4] demonstrate that a type I recursive formulation is stable, in a logarithmic sense. Stability
properties for the other three recursive formulations, as well as explicit proofs for the unblocked i- and
k-variants should be developed.
This paper focused only sequential methods. The special structure seen in Figure 1 indicated that the
diagonal inverses may be performed independently, and the divide-and-conquer algorithms took advantage
of this fact to split into subproblems. These divide-and-conquer algorithms are parallelizable - that is,
separate processing units (processor cores, nodes, etc.) may work independently on each subproblem, only
communicating at the beginning and end of each step. Two of the references [1, 2] discuss parallelization of
TMI.

Acknowledgments
I would like to thank Jim Demmel for his Numerical Linear Algebra class (for which this paper was a project),
as well as Andrew Waterman and Scott Beamer for their insights into the architectural eects in some of
my performance data. I would also like to thank the members of the BeBOP group for their helpful inputs.
Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching
funding by U.C. Discovery (Award #DIG07-10227).

Appendix: Algorithms
Unblocked
Unblocked routines are presented in Figures 11, 12, 13, 14, 15, and 16. I only present upper-triangular
versions, although the actual implementations handle both upper- and lower-triangular cases.

19

1. d

1
d

2. x d x
3. x x\C
Figure 11: Unblocked i-variant: type I

20

1. d

1
d

2. x x C
3. x d x
Figure 12: Unblocked i-variant: type II

1. d

1
d

2. y d y
3. y C\y
Figure 13: Unblocked j-variant: type I

21

1. d

1
d

2. y C y
3. y d y
Figure 14: Unblocked j-variant: type II

1. d

1
d

2. y d y
3. Z Z + y x
4. x d x
Figure 15: Unblocked k-variant: type I

22

1. d

1
d

2. x d x
3. Z Z + y x
4. y d x
Figure 16: Unblocked k-variant: type II

23

1. D D1
2. X D X
3. X X\C
Figure 17: Blocked i-variant: type I

Blocked
Unblocked routines are presented in Figures 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, and 28. The kvariants (the last six routines) are probably better considered as just three routines, since the movement
in the opposite direction really only amounts to a simple reordering of operations. Again, I only present
upper-triangular versions, although the actual implementations handle both upper- and lower-triangular
cases.

24

1. D D1
2. X D X
3. X X C
Figure 18: Blocked i-variant: type IIa

1. X X C
2. X D\X
3. D D1
Figure 19: Blocked i-variant: type IIb

25

1. D D1
2. Y Y D
3. Y C\Y
Figure 20: Blocked j-variant: type I

1. D D1
2. Y Y D
3. Y C Y
Figure 21: Blocked j-variant: type IIa

26

1. Y Y \D
2. Y C Y
3. D D1
Figure 22: Blocked j-variant: type IIb

1. D D1
2. Y Y D
3. Z Z + Y X
4. X D X
Figure 23: Blocked k-variant: type I (down)

27

1. D D1
2. X D X
3. Z Z + Y X
4. Y Y D
Figure 24: Blocked k-variant: type I (up)

1. Y Y \D
2. D D1
3. Z Z + Y X
4. X D X
Figure 25: Blocked k-variant: type II (down)

28

1. X D\X
2. D D1
3. Z Z + Y X
4. Y Y D
Figure 26: Blocked k-variant: type II (up)

1. Y Y \D
2. Z Z + Y X
3. X D\X
4. D D1
Figure 27: Blocked k-variant: type III (down)

29

1. X D\X
2. Z Z + Y X
3. Y Y \D
4. D D1
Figure 28: Blocked k-variant: type III (up)

30

1. Recurse in S
2. X X S
3. Recurse in W
4. X W X
Figure 29: Recursive, type I

Recursive
Recursive routines are presented in Figures 29, 30, 31, and 32. Again, I only present upper-triangular
versions, although the actual implementations handle both upper- and lower-triangular cases. The letters m
and n indicate the subproblem weighting which is parametrized by split.

31

1. Recurse in S
2. X X S
3. X W \X
4. Recurse in W
Figure 30: Recursive, type IIa

1. X X\S
2. Recurse in S
3. Recurse in W
4. X W X
Figure 31: Recursive, type IIb

32

1. X X\S
2. X W \X
3. Recurse in S
4. Recurse in W
Figure 32: Recursive, type III

33

References
[1]

Karlsson, L. Computing Explicit Matrix Inverses by Recursion. Masters Thesis, Ume University, (2006).

[2]

Nasri, W. and Mahjoub, Z. Optimal parallelization of a recursive algorithm for triangular matrix
inversion on MIMD computers. Parallel Computing 27, 1767-1782, (2001).

[3]

Du Croz, J.J. and Higham, N.J. Stability of Methods for Matrix Inversion. IMA Journal of
Numerical Analysis 12, 1-19, (1992).

[4]

Demmel, J., Dumitriu, I., and Holtz, O. Fast linear algebra is stable. Numerische Mathematik
108, 59-91, (2007).

[5]

Prokop, H. Cache-Oblivious Algorithms. Masters Thesis, Massachusetts Institute of Technology, (1999).

[6]

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D. LAPACK Users Guide, Third
Edition. Society for Industrial and Applied Mathematics, (1999).

[7]

Whaley,
C.
ATLAS
documentation,
atlas.sourceforge.net/faq.html

[8]

MKL 10.2 Reference Manual, accessed online at


http://software.intel.com/sites/products/documentation/hpc/mkl/mklman.pdf

[9]

Demmel, J. Numerical Linear Algebra. Society for Industrial and Applied Mathematics, (1997).

[10]

My TMI routines (C implementation) and testing routines:


http://www.cs.berkeley.edu/~knight/tmi/

34

accesed

online

at

http://math-

You might also like