In Implementing a High Performance Matrix-Matrix Block Multiply (IaHPM), I listed the basic block multiplication code and said:
While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly.
To demonstrate how well compilers optimize matrix-matrix multiplication codes, including basic block multiplication, I benchmarked four versions of my matrix-matrix multiplication test bed. Under Linux (Fedora Core 18), I compiled the C++ version using GNU’s G++ 4.7 and Intel’s ICPC 13.1.1 from Composer XE 2013 for Linux. Under Windows 7 SP1, I compiled the Visual C++ and C# versions using Visual Studio 2012.
Both operating systems were fully updated when I ran the benchmarks and were run on the same bare hardware. ICPC and G++ used the -O3 flag. Visual C++ used the /O2 flag and architecture-specific optimizations were not specified. C# had optimization on. All of these compilers can use SIMD instructions to a greater (ICPC) or lesser (C#) extent. Given the test results, I suspect ICPC was using SIMD in a couple test cases. I may revisit these tests with SIMD explicitly turned off in the future.
I benchmarked three matrix sizes: 1000 x 1000, 5000 x 5000, and 15000 x 15000. For the 1K and 5K tests, I used three element storage types: one-dimensional array (1D), array of arrays (AA), and array of arrays of 1024 element one-dimensional arrays (FM).
For 1D and AA, I ran three multiplication methods. Transpose, where the second matrix is transposed and a modified multiplication algorithm is used. Block, using the basic block multiply code presented in IaHPM. Transpose Block, where the second matrix is transposed and a modified block multiply code is used. The FM multiplication used an algorithm similar to the modified block multiply algorithm presented at the end of IaHPM. For the 15K tests, I only ran the fastest method for each of the Windows compilers.
All algorithms used accumulators. Block routines that use MIN had the MIN calculated outside of the for loop test. The FM routines had the inner loop unrolled. Values slightly over 100% may be due to inaccurate CPU speed values. Values significantly over 100% may be due to SIMD instruction use.
1000 x 1000
The 1000 x 1000 case is somewhat trivial: all of the elements necessary for a dot product will fit into the L1 cache. ICPC does very well at optimizing the transpose methods. Note that C# did not optimize the block methods very well. G++ and VC++ recognize the block code: the block routines are nearly the same speed as the basic transpose method.
5000 x 5000
ICPC optimized the transpose block methods well. However, G++ and VC++ take the lead using the specialized FM data structure and routine.
15000 x 15000
ICPC is still in third place, but not by much. C# makes a surprisingly strong showing. I may update this chart with the VC++ and C# values for the other multiplication methods in the future.
Is creating specialized data structures and calculation methods worth the effort? It depends on the situation. For matrix-matrix multiplication, probably not. ICPC did a great job on the transpose block algorithm. This allows the use of a one-dimensional array, which allows many standard libraries to be used.
Note: Please do not use your own matrix-matrix multiplication routines in a production system without having an amazingly good reason to do so. Use ATLAS, Intel’s MKL, cuBLAS, Math.Net, or some other BLAS implementation.
The critical question when it comes to optimizing compilers is: how easy is it to get an optimizing compiler to optimize your code. Compilers only optimize situations they recognize. If it optimizes your code as written, that’s great. If it doesn’t, is it better to spend time working with the compiler or creating better data structures and code?
There isn’t an easy answer to this question. If your routine has to work across multiple compilers, it may be better to work on the implementation details. If you know you’ll be using an optimizing compiler on a regular basis, it may be better to learn the intricacies of that compiler. If processing speed is the competitive advantage for your research or business, it may be best to do both.
To get an optimizing compiler to optimize a piece of code, the code may need to be written in a specific way. Additionally, optimal performance may require testing different combinations of compiler flags. As the above results show, just putting an existing piece of code into an optimizing compiler using a “normal” set of compiler flags does not guarantee good results.