This is part three in a series on basic cache optimization. Please start with part 1.
The second optimization improved the speed of the overall calculation. But, like the first optimization, it required transposing the second matrix. Transposition requires time and memory. However, dividing the rows into blocks was a good start.
In order to explain the third optimization, we need a cache more like one used in a real CPU. The cache now holds eighteen elements. The memory manager now pulls in nine elements at a time. The example matrices are the same. On paper:
and in RAM:
In the second optimization, each row was divided into two halves. Each half was used to calculate a partial dot product. As the calculation continued, the partial dot products were summed. Once all of the partial dot products were calculated and summed, C = A * B was complete.
Dividing the rows into halves divided, or partitioned, them into blocks. In this case, each block was one row high by three columns wide. While BT wasn’t explicitly partitioned by the code, the memory manager effectively partitioned BT by loading three elements of each row at a time.
A larger cache can hold more elements. The 18 element cache could be used to hold an entire row of A and BT. But that still requires transposing B to create BT. It would be better if B didn’t need to be transposed in order to use the cache effectively. The two matrices are once again A and B, both on paper:
and in RAM:
The second optimization divided each row into three element blocks. This was possible because a dot product can be calculated in parts. However, a row divided into blocks is also a matrix. A three element block is a 1 x 3 matrix.
This means A and B can be divided into sizes other than 1 x 3. The formula for multiplying a matrix that has been divided into blocks is the same as the normal matrix-matrix multiplication formula. The only difference is that the elements of A, B, and C are now matrices. Partitioning A and B into 3x3 submatrices:
turns them into a 3 x 2 matrix and a 2 x 2 matrix:
Since multiplying two matrices that are composed of submatrices follows the normal multiplication rules, A * B = C looks like:
Following the rules, C[0,0] = A[0,0] * B[0,0] + A[0,1] * B[1,0]. Matrix multiplication has already been defined, so calculating A[0,0] * B[0,0] and A[0,1] * B[1,0] is simple. Adding two matrices requires adding corresponding elements, which is also trivial.
One advantage of this is the submatrix size can be tailored to the cache size. Two 3 x 3 matrices will fit into an eighteen element cache. This makes multiplying two 3 x 3 matrices very fast, much like dividing rows into three element pieces made calculating dot products very fast.
Another advantage is the order of operations can be tailored to use the cache efficiently. Just like the second optimization’s dot products, the submatrix multiplications can be partially calculated. Instead of loading A[0,0] and B[0,0] into cache and multiplying them, then loading A[0,1] and B[0,1] and multiplying them, the code can load A[0,0] into cache and keep it there. It multiplies A[0,0] by B[0,0] and stores a temporary result. Then it multiplies A[0,0] by B[0,1] and stores a temporary result. Only then does it move on to A[0,1]. This minimizes how many times each submatrix is loaded into cache.
The final advantage is this simplifies writing a fast matrix-matrix multiplication routine. Instead of trying to come up with code that is fast for any size matrix, time can be spent coming up with an insanely fast routine for multiplying 3 x 3 matrices. Or 10 x 10. Or whatever size fits the cache best. Once this insanely fast routine is complete, any pair of matrices can be partitioned into blocks of this optimal size.
Unfortunately, getting the data in a block from RAM into cache is not trivial. If the data is stored in a standard row-major or column-major array, special code is needed to copy the data from the main matrix array in RAM to specialized submatrix arrays that can be loaded into cache. If this code is slow, block multiplication will be slower than transposing the second matrix. Fast array copy code often requires programming language features unavailable in most recent programming languages.
The other way of handling this is to store the data in submatrix-major order. This has its own drawbacks that are covered in the conclusion. For an implementation that works in most languages, please see Implementing a High Performance Matrix-Matrix Block Multiply.
Conclusion and concerns are in part four.