Optimizing VASP performance by tuning the compilation using the GNU compilers

2014-07-25

Introduction

VASP is distributed as source code which enables end-users and system administrators to build their own VASP executable. Herein, I would like to share some of my findings when doing performance optimization of the VASP executable.

In this post, I have used my four year old Linux Debian desktop computer for testing which contains an Intel Core 2 Q9550 CPU with 4 cores. Newer processors can succesfully employ certain compile flags which boost the performance even more. I will discuss this in a later post.

Effect of optimization flags and compiler version

In this post, I have only tested the GNU compiler using the latest version of the last four major branches, which are

  • 4.6.4
  • 4.7.4
  • 4.8.3
  • 4.9.1

in conjunction with OpenMPI 1.8.1 and for the optimization flag, I used -O0 (no optimizations), -O1, -O2 and -O3. The results are depicted in the two graphs below. In these graphs, the average calculation time in seconds is depicted based on 7 calculations. The upper and lower bounds show the statistical error.


Figure 1: Graph depicting elapsed execution time for a small benchmark calculation. There is a huge performance increase from enabling optimization (i.e. -O1 and higher).


Figure 2: Zoom of Figure 1. Although there is not much difference in execution time for -O2 and -O3, we found that GCC-4.8.3 at -O3 gave the best results.

For the discussion below, we choose to continue using only the GCC-4.8.3 compiler and the -O3 optimization level.

Profiling

In order to further enhance the performance, it is interesting to investigate what actually determines the execution times. In other words, what percent of the computational time is actually spend in which functions. As such, we have constructed a call graph using gprof and gprof2dot.py. From this graph, we learn that the major parts of the execution time are determined by the BLAS(/LAPACK) and FFTW routines. Of course, this was also already mentioned in the vasp manual.


Figure 3: Call graph of a VASP run. The execution time is mainly determined by the BLAS(/LAPACK) and FFTW routines.

Investigating ATLAS and OpenBLAS

We have compiled VASP using ATLAS 3.10.1 (which was the default BLAS library on my system) and OpenBLAS 1.13. Both these packages have optimized BLAS as well as LAPACK routines. Furthermore, we have used ScaLAPACK 2.0.2 for the parallelisation. The average computational times and statistical averages based on 5 runs of these routines were as follows:

BLAS/LAPACK library Average computation time Statistical error
ATLAS 3.10.1 95.539 s +/- 4.857
OpenBLAS 1.13 88.844 s +/- 1.477

To summarize: we found that OpenBLAS is about 7% faster than ATLAS on this machine.

Tuning FFTW

Finally, we can have an additional performance boost by optimizing the cache size in the FFT algorithm. A cache size very close to the L1 cache of the processor is recommended. Here, we have just tested a range of values to see what the effect is.


Figure 4: Normalized execution times with varying FFTW cache sizes. A lower value means a faster process. We note that the best results are found with either a 4kb or 5kb cache; although the default value of 2kb is already very close to the optimal settings.

From the graph we note that the effect of the cache size is very limited. Possibly, the FFT algorithm is already highly optimized. Nevertheless, we do see a very small improvement over the default value of 2kb when using 4kb of 5kb.

Conclusion

To summarize, this extensive study has taught us that the VASP code and underlying libraries are already to large extend optimized. By tweaking the default compilation parameters, we were able to book an additional performance boost of slightly below 10%, but this was mainly because the ATLAS library was considered the default value on my system and simply replacing this with OpenBLAS gave already a 7% boost. Changing from -O2 to -O3, using a newer version of the GNU compiler (4.8 instead of the default 4.7 version on Debian) and changing the FFT cache size only yielded a minor performance boost. In the future, I am going to test the Intel Compilers on this system and conduct further test on a newer generation Intel processor, because things have changed a lot in processor architecture over the past four years.

Further reading

If you have questions or comments, feel free to drop a line! Like what you read? Share this page with your friends and colleagues.

Comments

Question:
What is the answer to Five + One?
Please answer with a whole number, i.e. 2, 3, 5, 8,...