Parallel Algorithm Analysis in Reactor Core Calculation

The reactor core system consists of many materials, involving multi-physics processes, and can be analyzed and simulated at multi-scales. With the evolution of cluster computer, traditional programs and models could be translated into new program structure and modified in detail, so that more complex problems can be solved. Based on existing theory, programs of sub-channels analysis, two-dimensional (2D) method of characteristic (MOC), fuel temperature approximation, and three-dimensional (3D) discrete ordinate method (SN) are developed and analyzed. The different approach is that the reusable software structure of core calculation is established, with more well-defined storage of nuclear data, control layers, and more effective parallel algorithm for computation. The features of parallel algorithm for these programs are listed succinctly in the discussion. Additionally, the corresponding testing on parallel algorithm and computing results are given.


Introduction
There are various heterogeneous phenomena in reactor core, which are related to multi-physics and multi-scales modeling and simulation [1]. The core program or software is the important approximation for the reactor core; therefore, mathematics models, numerical algorithms, preprocessing, post-processing, and visualization are sometimes managed in the similar software structure [2]. Based on the existing conditions of calculation, high-performance-computing technology is an important mean to accomplish the simulation of core programs by the cluster computer. For instance, there are several supercomputers (cluster) in ANL and ORNL that provide high-capacity resources for reactor engineering calculation and nuclear-related simulation [3]. Once the numerical algorithm is fixed, the time-to-solution becomes the necessary consideration during the research, and the approach pattern can be summed up in three steps [4] in this field: 1. Develop the single discipline program or multidisciplinary coupling code for the reactor core, which enable higher accuracy.
2. Propose and implement parallel algorithm for the program.
3. Use parallel hardware to run and carry out numerical experiments within acceptable requirements, and then match the reference solution or experimental data.
According to the modeling and simulation for reactor and the features of multiphysics and multi-scale models, the reactor calculation is classified into core calculation and out-core calculation [5]. The core calculation covers considerable subjects, for instance, reactor physics, thermal-hydraulics, dynamics, and fuel performance, for example. For the purpose of research and engineering, different discrete systems and solvers are implemented so as to reveal different phenomena in the core [6][7][8]. As an example, the DENOVO transport software of ORNL utilizes 160,000 processors to run the benchmark core example in Jaguar supercomputer; up to now the computer has been upgraded to TITAN [9].
On the one hand, the simulation could be accelerated through effective parallelization in different models [10]. On the other hand, the parallel computing makes researchers focus on more research fields except mathematics and physics [11], which constructs one view from nuclear data to top-level application. With the attempt of the M&S content, the core software could be ascribed as the expression and translation of the set of state variables and data structure [12]. Thus, the conjunction of the core problem and parallel computing is from the new requirements of application; furthermore it is driven by the underlying hardware innovation. The Berkeley Parallel Computing Laboratory reviews parallel computing and points out that the parallel elements of different algorithms are the methodology of application driven and reuse. Since various features exist in different layers of software, the parallel algorithm should be designed specifically so as to adapt the parallel hardware [13]. The chapter describes and explains the software structure, underlying nuclear data, and parallel algorithm from a systematic view in the domain of reactor core calculation. Section 2 abstracts the software structure for core calculation. Then, Section 3 explains the basic elements of nuclear data and its association according to the software structure. After that, Section 4 lists the concept and steps of parallel algorithm analysis for core problem in practice. Sample programs are analyzed and discussed concisely in Section 4, and conclusions are given in Section 5.

Software structure
As the complex system, the reactor core could be understood by various parts, such as concept models, numerical methods, model uncertainty, and model reuse. Algorithms and software are the center of digital simulation so that the calculation rule is regulated by the program technologies. Sometimes the fixed software structure could be abstracted, and common algorithms could be extracted to be close to the trend of reactor core simulation, which forms one united manner (also known as framework method [14]).
According to the description [12], one general software structure could be divided into seven layers, which covers all aspects from the application to the computer operating system by Table 1.
Then, the numerical calculation system is abstracted into multiple layers, which are implemented through controls, modularity, and dynamic interaction. So participants from different professional backgrounds pay more attention to their own business [15]. The core calculation learns from the above contents, and the software structure could be simplified in Table 2.
It is effective to develop numerical software based on reusable components or framework, so the discrete model could be modeled through fixed process. On the one hand, the usage leads to reuse. On the other hand, new programming pattern could be merged. For instance, with the help of the underlying framework, MC neutron transport combines the discrete model and algorithm analysis process and then uses numerical components and structure to accomplish the transport task easily in the reactor core [16]. Since this integrated software structure method restricts the programming manner of the core calculation and provides reuse mechanism, then in practical, the prototype software NAC4R is designed and constructed that contains basic data operation, basic algebra operation and linear system solver, and so on. As an example of the above data, NAC4R can be used for the parallel analysis process of sample programs that explains the effectiveness and efficiency of the corresponding software structure.

Nuclear data process
The underlying nuclear data is the basis of core calculation and also the fundamental elements in the proposed software structure. The reactor physics data is divided into four categories by A. Trkov, namely, (1) basic nuclear data, (2) evaluated nuclear data files, (3) processed nuclear data, and (4) averaged nuclear data [17]. They are the similar data in the thermal-hydraulics analysis, which correspond to the physical property information of the fluid, such as the light water property data in the core calculation [18]. These different types of data involve representation, storage, and computer algorithms. For example, the cross-sectional library of deterministic reactor physics calculation needs group-wise data, such as few-group homogenized cross section in diffusion calculation, with the affection of multiple state variables that include temperature, density, and so on [19]. The class of used nuclear data can be directly computed in the core calculation or be generated in advance. The pre-generated nuclear data (database) is commonly used in a two-step method of core calculation, one is multidimensional interpolation table, the other is parameterized library, and they have different effects on computation and parallel algorithm [20].

Multidimensional interpolation table
The table is the indirect format of cross-sectional parameters, with the grids inside the value range for each state variable. The core calculation uses interpolation algorithm to obtain the nuclear data.
Assume there is only one state variable ρ (such as coolant density), and its value range is a, b ½ , then the grid of a single state variable is divided as in Eq. (1): where ρ i is not necessarily uniform distribution. The corresponding parameters at each discrete grid can be obtained by the lattice program as follows in Eq. (2): In this way, the continuous parameter values can be calculated by polynomial interpolation method in the range of state variable. The coefficients of interpolation polynomials are stored in the table:

Parameterized library
The library method is the complex function model of nuclear data that is related to various state variables. To build an accurate model, many factors need to be considered, and the range of each state variable is also defined separately: Generally, the tabulation method uses space for time, and the method is relatively simple, but the storage mode of nuclear data has a greater impact on the parallel algorithm of the core calculation. The method consumes more computing time, but the data storage is smaller, and it is easy to improve and modify the model so that there is less impact on the parallel algorithm.
The software implementation of the nuclear data corresponds to the model layer and model framework layer mentioned in Section 2. Figure 1 shows two abstract forms of the nuclear data. The left side that represents the processing of each kind of nuclear data is independent, and the right side explains the design of consistent data attributes, and formats are independent underneath and dependent on top in the fixed software structure so that each type of nuclear data only needs format conversion and minor programming. This understanding of the underlying nuclear database focuses on software reuse and performance, which can fully reuse various known nuclear data that have fine readability and correctness. This association also enables independent design and programming of core calculation at all levels, which makes the following parallel algorithm analysis match it.

Parallel algorithm analysis
Amdahl law stipulates the speedup ratio limit for parallelism, and Gustafson law quantifies the acceleration effect [21]. To be equivalent to Amdahl and Gustafson, measurability of calculation models the performance. Assume that the running time of one computing problem is T N, x ð Þ, which represents the running time for the scale of problem x. Then the running efficiency could be utilized to prove the effectiveness of parallel core program from two categories.

Strong measurability
The change of computing time varies with the increase of processor cores, also is known as speedup ratio:

Weak measurability
The problem size of each processor or process is fixed, and then the statistic of the running time varies as the processors change: The deterministic calculation of the reactor core mainly makes the S strong,N as the variation of processor cores. In practical deterministic core, computation has features that the problem scale is related to the physical form. For instance, when the characteristic rays are fixed on every processor, the MOC transport would increase new characteristic rays and achieve new arrangement if the processors are added. The data expression and computational content may change with different program patterns. On the contrary, nondeterministic calculation of reactor core, such as the MC method, the computational features would remain unchanged so that the problem takes more emphasis on the statistics of weak measurability.
The steps proposed by Ian Foster is the most typical method for the parallel algorithm design and programming [22] (also known as PCAM method). 1. Partition identifies computing content that can be executed in parallel.
2. Communication determines the communication data of each parallel task.

Aggregation merges and adjusts parallel tasks and communication.
4. Mapping assigns the aggregated task to entities on which actual parallel programming techniques depend (process or thread).
At present, the cluster computers are often in heterogeneous development environment, and parallel algorithms are divided into many kinds such as pipeline parallelism, mater-slave parallelism, and so on. To facilitate engineering specifications, the usually analysis steps of parallel algorithm are divided into two cases in Table 3.
It can be concluded that the parallel analysis includes interpretation, design, and performance verification. The parallel analysis in reactor core calculation summarized here emphasizes to meet the application requirements, and then the common features of the algorithm are usually abstracted and recorded concisely, so that the analysis can be understood clearly and simply.

Samples research
Interconnections exist in the multi-physics process and software structure in reactor core. For solving according to physics-mathematics model, various core software has been developed. There are two main ideas for developing these procedures. One is to make full use of the existing programs in the world. The other is to redesign and program the program structure and function. The key parallel algorithms of single discipline sample program CORTH, KYLIN2, K-MOD, and Hydra are identified and programmed, which come from the typical examples of the reactor core system, computing work through the research career of the author.

Designed and implemented algorithm
Algorithm is not designed and implemented Working step S1: executes and obtains the statistical parameters of parallel performance on a fixed running platform S2: judges whether the requirements are met. If not, go to step 3. Otherwise, ends the analysis S3: designs or improves the existing parallel algorithm and then goes to step 1 S1: executes and obtains the statistical parameters of parallel performance on a fixed running platform S2: identifies the parallel features of the algorithm S3: determines whether the features have reusable patterns in the software structure (here refers to NAC4R). If so, reuses them directly, turns to step 5, if not, turns to step 4 S4: the new parallel algorithm is designed and programmed S5: runs on the fixed platform and obtains the statistical parameters S6: judges whether the requirements are met. If not, goes to step 3. Otherwise, ends the analysis and merges the new parallel algorithm into NAC4R as a reusable pattern Key issues 1. The performance is closely related to the problem 2. The features depend on the abstraction of actual core numerical calculation that is often distinguished by the knowledge and experience of developers Table 3.
Two different analysis steps.

Sub-channels analysis
The CORTH program utilizes the uniform flow model of four equations to process two-phase flow in the reactor core. The energy conservative equation could be abbreviated as the manner of structured grids, which is easy to use in finite difference and form the linear system. For the description, CORTH could be divided as components in Table 4.
The features of the parallel algorithm are listed summarily to make the program refinement practical.
1. The reactor core is divided along the radial direction so that each process stores the data of some sub-channels. The overlap method could also be used to reduce data communication.
2. The solution of linear system solving can be quick in the model. The coefficient matrix of the conservative system is symmetric, and it can be constructed simultaneously together with the right-hand vector of the system.

The PETSc and Aztec parallel solver can be ported and optimized inside NAC4R
so that solvers could be used and hybrid programmed directly and easily.
The bottleneck of the parallel algorithm design is related to the components of the coefficient update, linear system solve, and mesh matching. Parallel scheme was realized in NAC4R with the features mentioned above. The ACP1000 model (177 fuel assemblies, there are 264 fuels and 25 guide tubes in each assembly, axial segments are 54, and there are 50868 sub-channels in radial direction) is used for performance validation. Cluster processors are chosen as 4, 6, 18, 27, 36, 54, and 108, which is the divisor of 50868 as the Aztec solver needs that the number of processes matches the tasks. The expected performance of strong measurability is improved by more than five times.
Performance conclusion: The trend of processor change shows that using about 30 processors, it has met the expectation in Figure 2, with the speedup ration about eight times. Not only the conservation system can be constructed in parallel, but the Aztec solver can also solve the matrix with the size of 50868 efficiently.

2D MOC
The KYLIN2 program carries out the 2D neutron transport calculation by MOC over the assembly region. In the condition of steady situation, multigroup neutron transport equation could be expressed by the ODE along the characteristic ray. We can obtain the direct analytical solution of the ODE as in Eq. (7): where P i and Q i denote the macro-transport cross section and the neutron source, respectively. The length S m,i,k is the segment that is truncated with the position m-th azimuth angle, k-th characteristic ray, and the i-th mesh. When source iterative method is used, the iterative procedure will repeat the above calculation process after assigning initial guess values to the flux. As shown in Table 5, sample program KYLIN2 could be abstracted into different parts.
Sweep computation follows the loop structure, with each characteristic ray, neutron groups, azimuth angles, and polar angles. Moreover, the intermediate solution of KYLIN2 is related to the angular direction so that the structure can be described as follows. set current 0 set mesh_angular_flux 0 for (g, ray, p) in (Groups, Rays, Polars): current computeForwardCurrent(g, ray, p) mesh_angular_flux sweepForward(g, ray, p, current) current computeBackwardCurrent(g, ray, p) mesh_angular_flux sweepBackward(g, ray, p, current) end for  The above components are redeveloped in NAC4R, and the features of the algorithm are listed here:

Model equation Function
1. Iterate that each ray could be in parallel if some critical operations are added, so both MPI and OpenMP can be used for the sweep process.
2. If rays are non-modular arranged, the differences of working load of each ray are not obvious. Since the assembly problem sometimes uses the nuclear data with some energy groups, parallelism on neutron groups could be a better choice, and OpenMP can avoid data communication during the sweep procedure.
3. The callings of exponential function will affect the efficiency. 4. In order to adapt coarse mesh acceleration, reserved interfaces should be kept inside NAC4R so that parallel linear system solvers are used.
Here OpenMP guide sentences are utilized to implement parallelism based on loop structure. The critical operation exists in the cumulative sum of forward and backward sweep when each ray is accessed. C5G7-MOX assembly and ACP1000-UO2 assembly examples are used to test, and more than five times better performance are expected. The statistical calculation time is shown in Table 6.
Performance conclusion: OpenMP parallelism (multi-threads) can achieve the rational accelerated effect and meet the expectation. There are more speedup ratios for the ACP1000-UO2 as the neutron group parallelism strategy is used, and some fluctuation exists in the performance gain of shared memory parallelism since there are some management overheads by multi-threads.

Fuel temperature approximation
The calculation of the fuel temperature usually involves the temperature of the outer surface and the center, and then the effective temperature of the fuel could be obtained for rod-like geometry. The geometry is divided into different sections to solve the heat conduction equation in radial direction, with different approximations are taken account. The fuel rod of PWR is usually delimited as three sections in some programs, and the 1D conduction equation is constructed at each area. After knowing the boundary temperature, the Rowland's equation could be used to compute the effective temperature.
Fuel temperature approximation is necessary for the reactor core coupling calculation, but reasonable boundaries come from other discipline procedure. K-MOD coupling program has been realized, and according to the above description, fuel temperature approximation could be abstracted in Table 7.
Thermal feedback controller carries out the function of management and as an external module that will be called repeatedly in K-MOD. The features of the parallel algorithm are less obvious and can be listed in two points: (1) the computing granularity is small and the single cyclic time is short and (2) the computing pattern could be abstracted and reused if lower level parallel method is researched; thus, SIMD technology is suitable for programming. In the sample program K-MOD, fuel temperature calculation could be abstracted into three classes, such as op_basic represents the basic arithmetic, but op_math stands for the compound operation, such as square, logarithm, and so forth. The former deals with integer data, while the latter two deal with floating-point operations. The problem of rod inserting and lifting in real PWR core is tested and compared. The similar imitated structure solves 50,000 times the same as the K-MOD program, with the SIMD hand optimization, which is based on data structure named __m128i and __m128. The time results are shown in Table 8.
Performance conclusion: The statistical results of multiple runs show that SIMD scheme has performance bonus for the third class of calculation content; however, other classes get no gain. Looking up Intel-related documents, it is found that SIMD programming depends on the specific hardware and instruction set usage and depends on the compiler behavior, which is not suitable for manual SIMD rewriting with care in practice.

3D SN
The sweep operation of discrete ordinate method iterates every angle direction in each octant, which has famous KBA parallel sweep algorithm for structured grids. 3D SN method translates Boltzmann transport equation into the matrix system of flux moment as in Eqs. (8) and (9) in the sample program Hydra:

Model equation Function
Thermal feedback controller where Q and L denote the source term and the difference operator, respectively. Discrete operator of flux moment matrix M need to multiply the cross-sectional data, such as the scattering matrix S. The total meshes are I, J, K ð Þin each angle direction, and then every section I a , J b , K ð Þthat are divided could be mapped into corresponding processes in parallel. According to the above description, SN transport calculation could be abstracted as in Table 9.
There are multiple sweeping levels in the SN method, and the section I a , J b , K ð Þ could be represented by the loop structure in every octant, and the sweeping procedure can be described as follows: I,J,K 0,1,2 for k in K: Since there is a lack of parallel programming for this loop structure in Hydra, the features of the algorithm are listed.
1. The extrapolation model, such as the diamond weight difference model, has an independent subroutine T, which is only related to mesh i, j, k ð Þ. Flux moment variables ϕ up and ϕ psi depend on the calculation order, so the sweep is one pipeline structure on the I a , J b ð Þ.
2. When the scale of loop structure is small, shared memory parallelism is fit for pipeline structure to improve efficiency, which has theoretical gain ij iþjÀ1 .
According to the identified features, OpenMP guide sentences are utilized, and lock array done [I][J] is used to ensure data consistency.  Table 10 collects the running time of two benchmark example, which utilizes the optimization level -O0, and only one MPI process is fixed during the experiment.

Model equation Function
Performance conclusion: The speedup ratio is almost 1.8 and 1.6, which illuminates that the pipeline structure can gain performance bonus with small scale I a , J b ð Þ for the SN method, such as the sample program Hydra. It can be further expanded to the parallel strategy of MPI + OpenMP.

Conclusion
Digital computers are used to simulate and analyze reactor core. For introducing parallel computing and promoting the efficiency, the core calculation with the integrated software structure is explored, which can provide the methodology of nuclear data, control layers, and parallel algorithms. The main software structure is discussed, and multiple parallel algorithms are analyzed concisely for sample core programs, with the corresponding conclusions as follows: 1. Integrated software structure (framework method) is benefited for the development of core software, and the similarity achieves the goal of reuse. For example, the underlying nuclear data could be managed as the effective manner and mapped to application. With the development of prototype software NAC4R, the research could be carried out in different control layers and demonstrate validity.
2. Parallel algorithms are refined as well as identify the parallel features that exist in the core model based on the existing theory. The features of four sample programs are described succinctly, which work well with the software structure. Parallel programming enables the expected efficiency improvement if some rules are obeyed, and the relevant performance results are given summarily.
It is necessary to establish systematic knowledge and suggestions for software approach and parallel algorithm development in reactor core calculation. On the one hand, it can integrate reusable patterns into computation, such as the instance of prototype software NAC4R. On the other hand, it can explore and record more specific parallel algorithms in this field.

Acknowledgements
The author would like to thank relevant scholars for their suggestions in NPIC, and also thank for the sample programs and computing resources provided by the 9th professional group in the design institute of NPIC. The research is supported by the discovery fund of NPIC (Number: 2019-06).