I have written a matrix-matrix multiplication testbed. The source code is available in my public repository. It explores far more than what was covered in this series, including the impact of accumulators, indexers, parallelism, underlying data structures, and so forth. The system does manage maximum theoretical, non-SIMD CPU performance. The code is in C# but easily converts to C++ and the concepts are applicable to most languages. If there is enough demand, a C++ version of the test bed will be released. The solution loads and compiles under Xamarin Studio and runs under Mono.
The test system is a 3.4 GHz Core i7-2600K “Sandy Bridge” CPU with 16 GB of RAM. Turbo Boost was disabled to allow a more accurate estimate of the potential performance of the system. Hyper-Threading was also disabled. Only the operations required for multiplication were measured. Creating and filling test matrices with random data was not included in the performance figures. For methods that required a matrix transposition, the transposition time was included. All matrix elements are double-precision (64-bit) floating point values.
The performance figures compare a straightforward matrix-matrix multiplication routine to a transposition routine and a block multiplication routine. The latter two correspond to the first and third optimizations. All routines use accumulators.
I have not written a version of the second optimization for reasons beyond the scope of this article. The first two routines use an array of arrays to hold the matrix elements. The third uses an array of arrays of 1024 element one-dimensional arrays. This is the equivalent of a 32 x 32 block size.
Performance figures shown are a percentage of the test system’s theoretical maximum possible MFLOPS without SIMD. The theoretical maximum is approximately 3400 MFLOPS for single-core routines and 13600 MFLOPS for four-core. Both values were treated as exact for calculation purposes.
The MFLOPS of each operation was calculated using the average time from 100 runs for 1000 x 1000 matrices, 20 runs for 5000 x 5000, and 5 runs for 10000 x 10000 and 15000 x 15000. The number of floating point operations was calculated using the minimum possible for a square matrix of size N x N: 2 * N^3 - N^2. This slightly understates the actual number of operations performed, thus understating the actual MFLOPS.
1000 x 1000
Given the performance improvement, the complexity of the block algorithm compared to the transpose algorithm may not seem to justify the time investment. However, once larger matrices are multiplied, the benefits become apparent.
For comparison purposes, here is Math.Net’s Intel MKL-based matrix-matrix multiplication vs the block algorithm. This chart shows MFLOPS, not percentage of maximum performance. Both algorithms use all four cores. The theoretical maximum of the test machine using SIMD on four cores is 109 GFLOPS.