2017-12-27

# Introduction

I was recently involved in a research endeavor where we had to generate and analyze a substantial amount of CHGCAR and LOCPOT files. Each of these are relatively big in size (~150MB) and were consuming a lot of free space on our hard drives. Since we employ periodic back-upping of our data; you can imagine that things started to pile up and clutter. We looked into BZIP2 and ZIP to reduce the size, but found out early on that we only obtained compression ratios of about 3 (i.e. the zip-archive is only three times smaller than the original file). In this blogpost, I will show you the steps we took to get improved compression ratios. Using lossy compression, we were able to get compression ratios of over 30 with decent accuracy of the original data.

As mentioned in the intro, we did not gain much by simply storing the CHGCAR or LOCPOT files in an archive. Since these files are human-readable, we can easily get better compression results by storing the values in binary format.

If we for instance consider the following lines inside the CHGCAR file:


112  128  288
0.46419088767E-03 0.22525899949E-02 0.39391537405E-03 -.28562888829E-02 0.43623947025E-02
0.50580398663E-02 0.31440260409E-02 -.60090684201E-04 0.66497361263E-03 0.25302278248E-02
0.52120528002E-02 0.50812471621E-02 -.19041084105E-02 -.10359924871E-02 0.23382124189E-02



we can see that each line contains 5 floating point values. Each of these floating point values takes about 18 bytes of storage. If we would store the human-readable number as a float, it would only take up 4 bytes of storage, a reduction of almost a factor of 5. In turn, if we do this for all the numbers inside the CHGCAR file and then compress the binary file, we were able to obtain compression ratios of ~10. But can we do better?

# Discrete Cosine Transform

So far, we have only considered storing the density files in a lossless format. This means that we do not loose any information upon compression and decompression of the data. The opposite is to store data in a lossy fashion, wherein the decompressed data is no longer identical to the original. JPEG is an example of such a lossy compression technique. The underlying algorithm behind JPEG is a Discrete Cosine Transformation (DCT), which is really well explained in this Computerphile YouTube movie.

What the DCT does is fit the finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. For example, if we have a set of 64 data points, then the DCT gives us a set of 64 coefficients corresponding to the cosines at different frequencies. The DCT can be done one-dimensional, two-dimensional in the case of images, three-dimensional in our case and in theory even beyond three-dimensional. The DCT itself is lossless, that is, from the set of coefficients, using a inverse discrete cosine transformation, the original data set can be obtained. The way we can employ this algorithm to obtain increased compression, is to drop the higher order frequencies in the DCT, similarly to how higher order frequencies are dropped in an MP3.

Let's exemplify this. Assume that we have 3D dataset of 2x2x2 grid points (8 grid points in total) with the values


d[0][0][0] = 0.84
d[0][0][1] = 0.39
d[0][1][0] = 0.78
d[0][1][1] = 0.80
d[1][0][0] = 0.91
d[1][0][1] = 0.20
d[1][1][0] = 0.34
d[1][1][1] = 0.77



The discrete cosine transformation in three dimensions of this dataset gives the following coefficients:


c[0][0][0] =  0.63
c[0][0][1] =  0.13
c[0][1][0] = -0.06
c[0][1][1] =  0.40
c[1][0][0] =  0.11
c[1][0][1] =  0.04
c[1][1][0] = -0.09
c[1][1][1] = -0.24



Let us now set the last coefficient to zero and perform a inverse DCT. We obtain the following values:


d[0][0][0] = 0.93
d[0][0][1] = 0.31
d[0][1][0] = 0.70
d[0][1][1] = 0.88
d[1][0][0] = 0.83
d[1][0][1] = 0.28
d[1][1][0] = 0.42
d[1][1][1] = 0.68



Despite that these values are off, they qualitatively resemble the original data. This example teaches illustrates the trade-off: we are able to increase compression performance but we have to pay in terms of accuracy.

# The mathematics

Given a cubic block of NxNxN grid points, then each DCT coefficient is calculated using the formula below

$$c_{i,j,k} = s_{i,j,k} \sum_{l=0}^{N-1} \sum_{m=0}^{N-1} \sum_{n=0}^{N-1} d_{l,m,n} \cdot \cos \left( \frac{\pi}{N} (i + \frac{1}{2}) l \right) \cdot \cos \left( \frac{\pi}{N} (j + \frac{1}{2}) m \right) \cdot \cos \left( \frac{\pi}{N} (k + \frac{1}{2}) n \right)$$

wherein $s_{i,j,k}$ is a scaling constant and $d_{l,m,n}$ is the value at grid point (l,m,n). Thus, for each coefficient (of which there are $N^{3}$), we have to loop over all the $N^{3}$ data points. The scaling constant is given by

$$s_{i,j,k} = s_{i} \cdot s_{j} \cdot s_{k}$$

where

$$s_{i} = \begin{cases} \frac{1}{N},& \text{if } i\geq 0 \\ \frac{2}{N},& \text{otherwise} \end{cases} .$$

This scaling is in principle arbitrary, but we introduce it to ensure that the inverse discrete cosine transformation is given by

$$d_{i,j,k} = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \sum_{k=0}^{N-1} c_{l,m,n} \cdot \cos \left( \frac{\pi}{N} (l + \frac{1}{2}) i \right) \cdot \cos \left( \frac{\pi}{N} (m + \frac{1}{2}) j \right) \cdot \cos \left( \frac{\pi}{N} (n+ \frac{1}{2}) k \right) .$$

# Block size and quality

The above DCT transformation can be done for any size $N$, but it turns out that performing the transformation on typical data sets of 300x300x300 grid points is very expensive. So instead of calculating the DCT for the whole grid, we divide the grid into smaller chunks. This strategy is similar to that employed in JPEG compression, but instead of having 2D chunks, we have 3D chunks. These chunks are called blocks and it turns out that for CHGCAR and LOCPOT files, blocks of 4x4x4 turn out to give the best results (see below for the results). Each (4x4x4) block thus contains 64 values and the corresponding DCT contains 64 coefficients. We can now drop a number of coefficients corresponding to the higher frequencies of the DCT.

To formalize this, let us introduce the concept of a quality value. A quality value of q means that all coefficients corresponding to those triplet of cosines which frequencies where i+j+k <= q holds, are kept and all other coefficients are dropped. This means that the quality values are not transferable between block sizes as the block size determines the maximum values of i,j,k.

# The algorithm and the code

The whole procedure can be summarized in the following six steps:

1. Read in the CHGCAR / LOCPOT
2. Divide the large grid into smaller chunks
3. Perform the DCT on each of these chunks
4. Drop the higher order coefficients based on the quality parameter
5. Store the other coefficients in binary format
6. Compress the final result

I have written a small C++ program called den2bin that allows you to perform compression and decompression on CHGCAR and LOCPOT files. The program depends on the following development packages: boost, bz2, glm and tclap, which are typically readily available by your package manager. Also, you need to have CMake and the GNU compiler installed. On Ubuntu or Debian, you can get these as follows:


sudo apt-get install build-essential cmake libboost-all-dev libbz2-dev libglm-dev libtclap-dev



To compile the code, simply clone the repository and compile it using CMake:


git clone https://github.com/ifilot/den2bin
cd den2bin
mkdir build
cd build
cmake ../src
make -j5



The code checks if your compiler supports OpenMP for shared memory multiprocessing. If so, the program automatically makes use of all the available cores in your machine.

Use the following commands to compress and decompress using the DCT:


# to pack (note the "-l")
./den2bin -i *DENSITY_FILENAME* -o *BINARY_FILENAME* -m *MESSAGE_HEADER* -l -b *BLOCKSIZE* -q *QUALITY*

# to unpack (same as for the lossless version)
./den2bin -x -i *BINARY_FILENAME* -o *DENSITY_FILENAME*



In the README.md in the Github repository, detailed information about all the program option and directives is provided.

If you do not like to compile the program yourself, there is also the option to download the binary from the repository by following these instructions.

# Results

To get an idea of the capabilities of our tool, I have compressed a CHGCAR file corresponding to the 1PI orbital of CO based on my previous blog post on visualizing the electron density of the binding orbital of CO.

We are going to compare the compression ratios and data loss (purely qualitatively by looking at a contour plot) as a function of blocksize and quality. In this comparison, we have used cubic blocks of 2x2x2, 4x4x4, 8x8x8 and 16x16x16 grid points. The results are visualized in Figure 1. A high resolution PNG version of this image can be found here.

From Figure 1, it can be seen that when using a quality of 1 (i.e. only a single DCT coefficient), we obtain a kind of down-scaling of the image as a single DCT coefficient only contains the average value for the whole data block. Increasing the quality factor decreases the compression ratio but increases the data quality (and hence the contour plot). Similar to JPEG, there is an optimum for the block size. Whereas this optimum is 8x8 for JPEG, I would say the sweet spot for VASP density files is using a block size of 4x4x4 and a quality of 4. Nevertheless, do not take my word for it but test it for yourself.

Figure 1: Compression factor versus Blocksize and quality for the 1PI orbital of CO.

# Closing remarks

If you like the program, feel free to star the repository on Github. Bug reports or other comments to improve the program are always highly appreciated. Please use the Issues functionality of Github for that. I am also interested in hearing your experience with the program. Feel free to share in the comments where you have used the program for and at what block size and quality settings you found the compression to be ideal for your branch of work.

# Bonus: more examples

Below are some additional examples to illustrate the correlation between compression efficiency and data set accuracy.

Figure 2: Compression factor versus Blocksize and quality for a CHGCAR file.

Figure 3: Compression factor versus Blocksize and quality for a LOCPOT file.

#### Drop a line

Question:
What is the answer to Seven + Ten?

45306
9
28-07-2014
3916
1
22-03-2015
587
0
29-12-2017