Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing

Large-scale electronic structure simulations coupled to an empirical modeling approach are critical as they present a robust way to predict various quantum phe-nomena in realistically sized nanoscale structures that are hard to be handled with density functional theory. For tight-binding (TB) simulations of electronic structures that normally involve multimillion atomic systems for a direct comparison to experimentally realizable nanoscale materials and devices, we show that graphical processing unit (GPU) devices help in saving computing costs in terms of time and energy consumption. With a short introduction of the major numerical method adopted for TB simulations of electronic structures, this work presents a detailed description for the strategies to drive performance enhancement with GPU devices against traditional clusters of multicore processors. While this work only uses TB electronic structure simulations for benchmark tests, it can be also utilized as a practical guideline to enhance performance of numerical operations that involve large-scale sparse matrices.


Introduction
As the dimension of functional semiconductor devices are scaled down to decananometer (nm) sizes, the underlying material can no longer be considered continuous. The number of atoms in the active device region becomes countable in the range of ~50 K to a few millions, and their local arrangements in interfaces, alloys, and strained systems give non-negligible effects on device characteristics [1][2][3]. Also, most experimentally realized structures are not infinitely periodic, but are finite in sizes; such geometries call for a local orbital basis, rather than a plane wave basis which implies infinite periodicity. As full ab-initio methods such as density functional theory are in principle hard to simulate electronic structures of such a huge and discrete atomic structures [4,5], the necessity of atomistic approaches based on an empirical modeling method is quite huge.
The spds* 10-band tight-binding (TB) approach, which employs a set of 10 localized orbital bases to describe a single atom, has been extensively used to explain

Methodology
Electronic structures of target nanostructures are described with a sp 3 d 5 s * TB model [6,9,10], which employs 10 orthogonal orbital bases to represent a single atom assuming nearest-neighbor couplings. As shown in Figure 1 (top), simulation domains are decomposed in a multidimensional manner with MPI and OpenMP. Hamiltonian matrices, which are stored in compressed sparse row (CSR) format [14], are then decomposed in a row-wise manner. The Schrödinger equation solver, which computes normal eigenvalue problems in a numerical perspective, is developed with the Lanczos method [15] whose computational bottleneck comes from sparse matrix-vector multiplication (SpMV) [12,15]. The basic idea for the performance improvement would thus be to perform SpMV with a simultaneous utilization of host CPUs and GPU devices, where Figure 1 (bottom) illustrates this idea. Here, each GPU device has a block matrix belonging to an MPI process, which sends/receives input (Vin)/output (Vout) vectors to/from the associated GPU device. Each MPI process does not need to send the whole Vin since multiplication in an MPI process can be done with only three block vectors of Vin (1 in itself, 2 in its neighbor processes) as our TB model assumes nearest-neighbor couplings. Upon the completion of multiplication, a GPU device just needs to send 1 block of Vout) back to its associated MPI process. Host CPUs and GPU devices can thus perform multiplication simultaneously with no heavy overhead of data transfer. In the next subsections, we present further detailed methodologies for (1) the simultaneous utilization of host CPUs and GPU devices and (2) the implementation of efficient SpMV CUDA kernels.

Simultaneous utilization of both CPU and GPU
The following are two ways of how to efficiently utilize the resources of both CPUs and GPUs. One takes the pageable memory when transferring data between CPU and GPU, whereas the other uses the pinned memory. The memory can be allocated in the pinned memory with the cudaMallocHost function of CUDA library, which prevents the memory from being paged out and therefore improves the speed of data transfer between host and GPU devices. The pageable (non-pinned) memory can be used with the malloc function of standard C library. This subsession will discuss in detail how SpMV can be done with the abovementioned two ways.
Computations of SpMV in host CPU and GPU devices are overlapped in default since the GPU kernel function is called in a non-blocking manner such that its execution can be done in parallel with the execution in CPU code. As shown in Figure 2(a), however, data transfer between host and (GPU) device memory cannot be overlapped with the CPU computation when the pageable memory is used. As depicted in Figure 2(b), the pinned memory enables the CPU calculation to be overlapped with data transfer [16]. Another merit that can be obtained with the pinned memory is that the effective bandwidth of data transfer itself is increased by ~3 times compared to the one obtained with the pageable memory, because the bandwidth of the PCI-E bus to connect CPU and GPU is not fully exploited with the pageable memory [16]. The pinned memory can be used with the cudaMallocHost (CUDA library) instead of malloc function. As memory offload is much faster and communication hiding is possible, the pinned memory would be superior to the page memory for saving the wall time of SpMV calculations with GPU devices.

CUDA SpMV kernels
We have developed three different CUDA SpMV kernels as illustrated in Figure 3: (i) a basic kernel that allocates a single CUDA thread per a row in the matrix (naïve), (ii) a kernel that always allocates a single WARP to the SPMV operation for a single row in the matrix (warp1) [17], and (iii) a kernel that dynamically allocates multiple WARPs to the SPMV operation for a single row in the matrix (warp2) [18].
Firstly, the naive kernel is a straightforward approach to map a single CUDA thread to a single row in the matrix. Because the Q-AND code uses the CSR format to describe the Hamiltonian matrix, SpMV operations need an indirect addressing step for every single scalar operation needed for multiplications. Consecutive threads therefore have to access irregularly strided memory as illustrated in Figure 3(a). As noted by Harris [19], such access patterns would degrade performance, because then successive threads may not be able to access the global memory simultaneously to read non-zero elements of the matrix (Figure 3(a)).
Secondly, the warp1 kernel uses the maximum number of CUDA threads of a single WARP for multiplications of a single row in the matrix. A WARP is defined as the group of threads and consists of contiguous threads (32 threads for Tesla K40 devices) [19,20]. Since threads in a single WARP can access the global memory simultaneously, we can reduce the number of transactions that are required to access the global memory, and therefore we expect non-negligible performance enhancement for SpMV operations against the naïve kernel (Figure 3(b)) [18]. However, this solution may not be the best one, since we always use a single WARP for a single matrix row, and therefore we may have idle threads (or WARPs) if the maximum number of WARPs supported by a single GPU device is larger than the number of rows of a block matrix that are belonging to that GPU device.  Thirdly, the warp2 kernel dynamically allocates WARPs to a single matrix row as depicted in Figure 3(c). Here, the number of WARPs allocated to a single matrix row is dynamically determined by counting the number of corresponding non-zero elements, i.e., an integer value that is closest to "the number of non-zero elements in one matrix row/the number of threads per a single WARP. "We note the warp2 kernel would be optimal for our problem since the TB Hamiltonian matrix normally has non-zero elements that are larger than 32.
In addition, we can increase the performance by utilizing the texture memory for the vector data retrieval, where texture memory, which is a read-only memory, is cached on-chip and provides higher effective bandwidth by reducing memory requests toward off-chip DRAM. Since the in/out vectors are irregularly accessed by threads from the global memory of GPU devices, the performance improvement could be driven by applying the texture memory. The following section reveals the result of the evaluation. Conceptual scheme of three SpMV CUDA kernels. (a) A basic kernel that maps a single thread to a single row in the matrix for SpMV (naive). Here, consecutive threads (t 0 , t 1 ,…,t n − 1 ) access nonconsecutive words. (b) A kernel that uses WARP statically (warp1). It always allocates a single WARP (32 threads in Tesla K40) to a single row in the matrix. Consecutive threads (t 0 , t 1 ,…,t n − 1 ) access consecutive words. (c) A kernel that uses WARP dynamically (warp2). It dynamically allocates WARPs to a single row considering the sparsity of matrix.

Results and discussions
This section discusses the performance evaluation of our solver from following perspectives: (i) the strong/weak scalability of end-to-end simulations and the optimal GPU load (i.e., the portion of SpMV calculation allocated to GPU devices that shows the best speed), (ii) impact of the pinned memory on computing performance, (iii) performances of the three different CUDA SpMV kernels, and (iv) energy efficiency and economic benefit of GPU computing for electronic structure simulations against the results obtained with CPU computing only. All the benchmark tests are performed on the two test-bed machines: one is the K40 test-bed including three computing nodes connected with an infinite-band 4 × FDR (56 Gbps) network, where each node has two Intel Xeon CPUs E5-2650 v3 [21], two NVIDIA Tesla K40 GPU cards [20], and one 128G DDR3 1867 MHz memory, and another is a P100 test-bed including one node with same configuration except two NVIDIA Tesla P100 cards [22]. The codes are compiled with CUDA 8.0 library, Intel C++ compiler 16.0, and OpenMPI 1.10.2. Si:P quantum dots (QDs), which are defined to be huge silicon (Si) layers encapsulating a single phosphorus (P) atom and are studied with a 10-band TB model for designs of Si-based quantum computers [8,9,23], are used as target devices for all the benchmark tests.

Evaluation of utilization of both GPUs and CPUs
Using the pinned memory and the warp2 kernel with texture memory, simulations are performed for Si:P QDs with a convergence criterion of 10 −8 eV and are completed when 10 4 Lanczos iterations are reached or 10 lowest energy levels in conduction band are found. Every bar graph of Figure 4 has the following six elements: MPI communication (Comm), data transfer from host to GPU device (CopyIn), SpMV + data transfer from GPU device to host (SpMV + CopyOut), dot product (VVDot), memory operations (MemOp), and other portions (Others). Note that SpMV + CopyOut includes the time required for SpMV on GPU devices and data transfer from GPU device to host memory, while SpMV on CPUs is performed at the same time. Figure 4(a) and (b) presents the strong and weak scalability of the end-to-end simulation at the K40 and P100 test-beds, respectively. The Si:P QD tested for the strong scalability has a cuboid Si layer that consists of a total of 30 × 80 × 80 [100] unit cells and has a dimension of ~16 nm × 43 nm × 43 nm (about 1.5 million atoms). The problem size for the weak scalability test is 15n × 80 × 80 [100] unit cells (n × 7.5 × 10 5 atoms), where n denotes the number of MPI ranks. As we use 10 bases to describe a single atom, the degrees of freedom (DOF) of corresponding Hamiltonian matrices are ~15 million (for strong scalability) and ~7.5 million/rank (for weak scalability), respectively. Here, we see that the strong scalability is generally quite good, where each MPI rank is mapped to 10 CPU cores and one GPU card. The job using six MPI ranks is 2.34 times faster than the one executed with two ranks for the 30 × 80 × 80 unitcells at the K40 test-bed. It shows nice scales according to the number of cores, because a significant portion of the wall time is taken by SpMV that would give a nice scalability as the matrix has a block-tridiagonal shape, and therefore the burden of MPI communications would not become a serious problem. The weak scalability also shows good since the wall time is not significantly affected by MPI communications.
In addition, Figure 4(c) and (d) illustrates the performance comparison in terms of the wall time according to the GPU load at the K40 and P100 test-beds, respectively. The QD considered for the experiment here has 30 × 80 × 80 unit cells. The elapsed time is described as a function of computing load for SpMV on GPU devices (GPU load). As described in the previous section, SpMV is the most time-consuming part such that it takes about 56% of the wall time when CPUs perform all the multiplications (GPU load is zero) at the K40 test-bed. However, as the GPU load increases, SpMV takes less time and shows the best speed when the GPU load is ~70%. This optimal GPU load depends on the hardware performance of GPU devices such that, at the P100 test-bed (with same host CPUs), it is ~90%. The speedup becomes 1.44× and 1.7× for the target simulation at the K40 and P100 test-beds, respectively, against the case when GPU load is zero (only CPUs are used for simulations).
Then let us discuss why this optimal GPU load becomes about 70 and 90% at the K40 and P100 test-beds, respectively. Since the performance of SpMV depends on various factors such as computing units, memory bandwidth and latency, network speed, and PCI-E bandwidth between host and GPU device, it is not easy to clearly extract the exact value of the optimal GPU load. However, the "ideal value" of the GPU load could be approximately calculated using only the theoretical peak performance of computing units, because the performance of SpMV would be maximized when both CPUs and GPUs complete computing operations at the same time. If we denote the peak performance (in the unit of floating point operations per second (FLOPS)) of host CPUs and PCI-E

unit cells at the K40 test-bed. (b) Weak scalability of computing 15 × 80 × 80 unit cells per single MPI process at the K40 test-bed. (c) Performance of computing 30 × 80 × 80 unit cells as a function of the GPU load at the K40 test-bed. (d) Performance comparison of computing 30 × 80 × 80 unit cells as a function of the GPU load at the P100 test-bed.
connected devices by P H and P D , respectively, the optimal GPU load (x) can be calculated as the following equation (Eq. (1)): Since a single computing node of the K40 test-bed has a P H of about 0.736 × 10 12 FLOPS for twenty CPU cores of Xeon E5-2650 v3 [21], and a P D of about 2.620 × 10 12 FLOPS for two Tesla K40 devices [20], x is derived to about 78.1%, which it is a little higher than the measured value (70%) due to the ignorance of other factors (memory bandwidth, etc.). For the P100 test-bed (P D of about 10.600× 10 12 FLOPS) [22], x is also evaluated to about 93.5%, while we find it at ~90%. Even though the derived values are not strictly accurate, we can still explain why the optimal GPU load of the K40 and P100 test-beds turns out to be higher than the one (~65%) measured with Xeon Phi Knights Corner coprocessors [12].

Effects of pinned memory on performance
As explained in the previous section, the pinned memory may make a non-negligible impact on the overall performance of large-scale simulations. Figure 5(a) shows the performance measured with the pinned and pageable memory at a 70% (K40) and 90% (P100) GPU load, where a single computing node is used with the warp2 kernel and texture memory. The Si:P QD has 30 × 80 × 80 unit cells. Results indicate that the case with pinned memory shows better performance than the one with pageable memory due to the following two points: (1) The reduction of CopyIn time due to the increased bandwidth of PCI-E bus with pinned memory and (2) SpMV + CopyOut time as communication hiding behind the computation. We observed the effective bandwidth of PCI-E communication is ~3.31 GB/s with the pageable memory on every test-bed, while it reaches ~10.40 GB/s with the pinned memory, driving ~3.14 speedup in data transfer. The effective speed of SpMV operations increases by a factor of 1.36 and 1.19 with pinned memory compared to the speed with pageable memory at the K40 and P100 test-beds, respectively, since utilization of the pinned can overlap computation and data transfer. The performance for end-to-end simulations therefore becomes 1.27 and 1.21 times faster with pinned memory against the performance obtained with pageable memory at 70% GPU load (K40) and 90% GPU load (P100), respectively.  Though here we only focused on the performance of PCI-E communications, it is possible to estimate the performance benefit that may be obtained with NVLink communications. For this purpose, we investigate how the bandwidth of communications between CPU and GPU affects the overall performance, where we find that the overall speedup is ~1.21× due to the ~3.1× enhancement of PCI-E bandwidth on the effects of pinned memory in PCI-E add-in P100 devices. Because the bandwidth improvement with NVLink connectivity between CPU and GPU is ~3× compared to the PCI-E [24], we may roughly expect that there will be another ~1.06× speedup for the end-to-end simulation with NVLink add-in P100 devices.

Performance analysis of SpMV CUDA kernels
Here we investigate the performance of three different SpMV CUDA kernels and present a short discussion about the effects of the texture memory on the performance. Figure 5(b) shows the performance of the three SpMV implementations at the single computing node of the K40 and P100 test-beds, where the pinned memory is utilized with a 70% (K40) and 90% (P100) GPU load. The Si:P QD for target simulations has 30 × 80 × 80 unit cells. The grid/block size is set to 21,000/256 and 672,000/256 of the naive and warp1 kernel, respectively. For the warp2 kernel, the grid/block size is set to 30/1024 and 112/1024 at the K40 and P100 test-beds, respectively, since the number of streaming multiprocessors is 56 for P100 devices, while it is 15 for K40 (the grid size is set to an integer multiple of the number of available streaming multiprocessors).
Among the naive, warp1, and warp2 kernel, the warp2 outperforms as expected. The speedup of the warp2 kernel is the 7.96/6.73 (K40/P100) and 1.12/1.24 compared to the naive and the warp1 kernel, respectively. The huge performance enhancement that is particularly achieved against the naive kernel reflects the importance of coalescing global memory access as Liu et al. also reported that the effective bandwidth is poor for large strided memory access [18]. The warp2 kernel also works faster than the warp1 kernel since less threads would be idle with dynamic allocations as discussed in the "Methodology" section. While multiple WARPs can be involved to process a single row in the matrix (and threads in a single WARP can concurrently access the global memory), there is an inter-WARP time lag (only a single WARP can process multiplications at a time). The performance gain, however, is remarkable such that 15-20% of the wall time is reduced with a dynamic allocation of WARPs.
All the three kernels show faster operations in P100 devices than in K40 devices, and the speedup in P100 devices turns out to be ~2.77 on average. Although the peak performance of P100 devices is 4.05× higher than that of K40 devices (in FLOPS for double precision), the measured average performance gain (2.77) is much lower than this value (4.05), since the performance of SpMV is mainly limited by the bandwidth of global memory rather than the core clock of GPU devices [25]. Figure 5(b) also shows the performance difference created by utilization of the texture memory for retrieval of vector data retrieval. With the texture memory, the speed of the warp2 kernel improves by a factor of 1.06 (K40) and 1.19 (P100) against the case without the texture memory, since the texture memory enables fast random accesses to vector data and uses a cache to provide broad bandwidth.

Energy efficiency and economic benefits of GPU computing
Not only is the elapsed time an important metric, but also the energy efficiency is a significant one to explore. The power usage of host and the two PCI-E connected devices is evaluated as a function of elapsed time (Figure 6), where we consider the power consumed by host (CPU packages with off-chip DRAMs) and Tesla GPU devices. The power usage in host and GPU devices is measured with Intel Running Average Power Limit (RAPL) library [26] and NVIDIA Management Library (NVML) [27], respectively. Figure 6(a), (b) and (c) shows the real-time power consumption of a single computing node at a 0% GPU load (CPU only), 70% GPU load with K40, and 90% GPU load with P100 GPU devices, respectively. A Si:P QD consisting of 30 × 80 × 80 unit cells is simulated with the warp2 kernel where pinned memory and texture memory are used. Here, all the results show similar patterns such that (i) the power consumption starts to increase during the initial processes of electronic structure simulations, i.e., matrix construction that requires memory access to store non-zero elements, row/column indices. (ii) The power usage then shows a rapid oscillation during the process of Lanczos iterations, and (iii) it finally returns to the normal (standby) value when all the calculation is finished. Figure 6(d) informs that the average instantaneous power consumption of a single computing node with K40 and P100 devices is 157.58 and 117.55 Watt, whereas the host of test-beds uses 279.76 and 270.05 Watt, respectively. Figure 6(e) shows the total energy consumed by the end-to-end simulation, which can be calculated by multiplying the time-averaged power usage by the wall time. During the execution in a single computing node of the K40 test-bed, CPUs and GPUs consume about 542.32 and 305.40 KJ, respectively, while corresponding values with P100 devices become 331.44 and 144.33 KJ, respectively. ~614.18KJ is consumed for the CPU-only case. Compared to the results measured with K40 GPU devices, a single computing node with P100 devices consumes ~1.34× less energy, while it finishes the target simulation ~2.88× faster. Figure 6(f) shows the total energy consumed by the three SpMV kernels in the single computing node of the K40 and P100 test-beds, where the pinned memory is utilized with a 70% (K40) and 90% (P100) GPU load for simulations of a Si:P QD consisting of 30 × 80 × 80 unit cells. Coalescence of global memory access (Figure 3(c)) drives a significant performance improvement, such that the warp2 kernel not only shows the smallest energy consumption but  has the best wall-time performance among the three kernels. Reduction in energy consumption of the warp2 kernel turns out to be 2.99×/1.59× (K40/P100) and 1.12×/1.09× compared to the naive and the warp1 kernel, respectively. Now let us talk about the energy efficiency of our code for electronic structure simulations. Without losing generality, we roughly can define the energy efficiency as the rate of SpMV operations performed for the unit power consumption (1 W). The rate of SpMV operations can be estimated by the ratio of the total number of floating point operations that a single simulation performs (NF) and the total wall time taken until the simulation is completed. Therefore, the energy efficiency of simulations (η) can be approximated with Eq. (2): where E total and T total represent the total energy consumption and wall time required to complete a single simulation, respectively. From the derived equation, the energy efficiency could be calculated for K40 and P100 devices and host CPUs. Although it is not easy to exactly quantify NF, we can at least compare the energy efficiency of different computing devices by assuming that NF would be linearly proportional to the workload of SpMV allocated to specific computing platforms. By setting the energy efficiency of CPU-only computing to 1, the energy efficiency of K40 devices (70% GPU load) and P100 devices (90% GPU load) would be ~1.43 and 3.81, respectively. We conclude the energy efficiency of P100 devices is ~2.66× and 3.81× better than that of K40 and CPU devices for the target simulation. Note that these quantities are hard to be obtained just with officially known hardware specifications [20][21][22].
Finally, let us close this section with a short discussion about the economic benefits that can be delivered by GPU computing for TB simulations. Since GPU devices are not cheap [28,29], it would be interesting to compare the "time saving" achieved by a single US dollar spent for additional GPU devices. As we already have shown in Figure 4, the CPU-only simulation of 30 × 80 × 80 unit cells is finished in ~2476 s, which is average of the results measured with K40 and P100 devices (Figure 5(c) and (d)). While the simulation is finished in ~1701 s with K40 at the optimal GPU load (70%), we must additionally pay ~4.6 K US dollars to buy two K40 GPU devices [28]. With P100 devices, the simulation takes ~1472 s at the optimal GPU load (90%) and requires ~14.7 K US dollars to buy two P100 GPU devices [29]. Thereby, we get ~0.17 and ~0.07 s/USD for K40 and P100 devices, respectively. While the performance enhancement driven by GPU computing may be impressive in the perspective of computing time, we claim more expensive devices may not always deliver better economic benefits. Readers are therefore strongly encouraged to build a careful budget plan whether they are thinking to buy new GPU devices.

Conclusion
The cost efficiency of general-purpose graphical processing unit (GPU) devices for tight-binding (TB) simulations of extremely large-scale electronic structures has been examined with a focus on the speed and the amount of energy consumption. Technical strategies used to exploit the strength of GPU-coupled offload computing have been elaborated in detail with a short but clear description of the main numerical method employed to tackle large-scale Schrödinger equations. Benchmark tests have been performed against realistically sized solid Si:P quantum dot devices that contain several million atoms. Tesla K40 and latest P100 GPU devices are © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. considered as the test platform. The technics we employed for the efficient offload computing of large-scale TB simulations drive a non-negligible enhancement of the computing speed. Compared to the performance tested with Intel Xeon V3 host CPU only, K40 and P100 devices can achieve up to ~2× and ~6× speedup for sparse matrix-vector multiplication (SpMV), which is the numerical operation needed to solve electronic structures. In terms of the amount of total energy consumption, however, K40 shows worse performance compared to the CPU-only case, while P100 still holds the strength.