Numerical Simulation of the Frank-Kamenetskii PDE: GPU vs. CPU Computing

The efficient solution of the Frank-Kamenetskii partial differential equation through the implementation of parallelized numerical algorithms or GPUs (Graphics Processing Units) in MATLAB is a natural progressionof the workwhich has been conducted in an area of practical import. There is an on-going interest in the mathematics describing thermal explosions due to the significance of the applications of such models one example is the chemical processes which occur in grain silos. Solutions which pertain to the different geometries of such a physical process have different physical interpretations, however in this chapter we will consider the Frank-Kamenetskii partial differential equation within the context of the mathematical theory of combustion which according to Frank-Kamenetskii [16] deals with the combined systems of equations of chemical kinetics and of heat transfer and diffusion. A physical explanation of such a system is often a gas confined within a vessel which then reacts chemically, heating up until it either attains a steady state or explodes.

parallelized architecture as we shall later discuss. We will also consider Rosenbrock methods which are iterative in nature and as was indicated in Harley [22] showed drastically increased running times as the time period over which the problem was considered increased. Further pitfalls which arise when trying to obtain a solution for the partial differential equation in question when using numerical techniques is firstly the singularity which exists at x = 0. This complexity may be dealt with through the use of a Maclaurin expansion which splits the problem into two cases: x = 0 and x = 0.
The second hurdle is the nonlinear source term which may be dealt with using different techniques. In this chapter we will implement the Newton method which acts as an updating mechanism for the nonlinear source term and in so doing maintains the implicit nature of the scheme in a consistent fashion. While the incorporation of the Newton method leads to an increase in the computation time for the Crank-Nicolson difference scheme (see [22]) there is also an increase in the accuracy and stability of the solution. As such we find that the algorithms we are attempting to employ in the solution of this partial differential equation would benefit from the processing power of a GPU.
In this chapter we will focus on the implementation of the Crank-Nicolson implicit method, employed with and without the Newton method, and two Rosenbrock methods, namely ROS2 and ROWDA3. We consider the effectiveness of running the algorithms on the GPU rather than the CPU and discuss whether these algorithms can in fact be parallelized effectively.

Model
The steady state formulation of the equation to be considered in this chapter was described by Frank-Kamenetskii [16] who later also considered the time development of such a reaction. The reaction rate depends on the temperature in a nonlinear fashion, generally given by Arrhenius' law. This nonlinearity is an important characteristic of the combustion phenomena since without it the critical condition for inflammation would disappear causing the idea of combustion to lose its meaning [16]. Thus, in the case of a thermal explosion, the Arrhenius law is maintained by the introduction of the exponential term which acts as a source for the heat generated by the chemical reaction. As such we are able to write an equation modelling the dimensionless temperature distribution in a vessel as where u is a function of the spatial variable x and time t and the Frank-Kamenetskii parameter δ is given by The value of the Frank-Kamanetskii parameter [16] δ is related to the critical temperature at which ignition of a thermal explosion takes place and is thus also referred to as the critical value. At values below its critical value δ cr a steady state is reached for a given geometry and set of boundary conditions whereas an explosion ensues for values above it. The Laplacian operator takes the form where k is indicative of the shape of the vessel within which the chemical reaction takes place: k = 0, 1 and 2 represent an infinite slab, infinite circular cylinder and sphere, respectively. The dimensionless parameter = RT 0 E is introduced as the critical activation parameter. To be able to speak of combustion the condition 1 must be satisfied due to the fact that the ambient temperature can normally be seen as much smaller in magnitude than the ignition temperature [16]. Equation (1) for = 0 was derived by Frank-Kamenetskii [16]. Further work was done by Steggerda [31] on Frank-Kamenetskii's original criterion for a thermal explosion showing that a more detailed consideration of the situation is possible. For small x a solution was derived for the cylindrical system by Rice [27], Bodington and Gray [6] and Chambré [10]. While the steady state case -often termed the Lane-Emden equation of the second kind -has been considered extensively, the time dependent case is also of import and has been studied in [2], [32] and [33].
In this chapter we consider numerical solutions for equation (1) modelling a thermal explosion within a cylindrical vessel, i.e. k = 1. A thermal explosion occurs when the heat generated by a chemical reaction is far greater than the heat lost to the walls of the vessel in which the reaction is taking place. As such this equation is subject to certain boundary conditions given at the walls of the vessel. The appropriate boundary conditions for this problem are where R = 1 is the radius of the cylinder. The boundary conditions (4) and (5) imply that the temperature at the vessel walls is kept fixed and the solution is symmetric about the origin.
Frank-Kamenetskii [16] obtained a steady state solution to this problem with = 0. Zeldovich et al. [34] considered similarity solutions admitted by (1) for k = 1 that exhibit blow-up in finite time. These kinds of solutions, while noteworthy, have limited significance due to the restricted form of the initial profiles compatible with the similarity solutions. These solutions correspond to very special initial conditions for the temperature evolution profile, limiting the degree to which results obtained in this manner are applicable. This disadvantage has been noted by Anderson et al. [3] while analytically investigating the time evolution of the one-dimensional temperature profile in a fusion reactor plasma. A solution which also models blow-up in finite time has been obtained by Harley and Momoniat [18] via nonlocal symmetries of the steady-state equation.
In Harley [21] a Crank-Nicolson-and hopscotch scheme were implemented for equation (1) subject to (4) and (5) where δ = 1 and = 0. The nonlinear source term was kept explicit when the Crank-Nicolson method was employed, as commented on by Britz et al. [9] in whose work the nonlinear term was incorporated in an implicit manner in a style more consistent with the Crank-Nicolson method. Britz et al. [9] implemented the Crank-Nicolson scheme with the Newton iteration and showed that it outperformed the explicit implementation of the nonlinearity as in [21] in terms of accuracy. However it does require more computer time as would be expected.
In recent work (see [22]) the Crank-Nicolson method was implemented with the Newton iteration as done by Britz et al. [9] by computing a correction set in each iteration to obtain approximate values of the dependent variable at the next time step. The efficiency of the Crank-Nicolson scheme, hopscotch scheme (both of these methods were implemented with an explicit and then an implicit discretisation of the source term) and two versions of the Rosenbrock method were compared [22]. Using the pdepe function in MATLAB and the steady state solution obtained by Frank-Kamenetskii [16] as a means of comparison, it was found that the incorporation of the Newton method for the Crank-Nicolson-and hopscotch scheme led to increased running times as T, where 0 ≤ t ≤ T, increased.
Furthermore, it was shown that while the Crank-Nicolson-and hopscotch method (with or without the implementation of the Newton method) performed well in terms of accuracy for T = 0.3 and 0.5, they were in fact able to outperform pdepe at T = 4. The Rosenbrock methods employed (ROS2 and ROWDA3) performed similarly with regards to accuracy, however showed almost an exponential increase in their running times as T increased, indicating that using the Crank-Nicolson-or hopscotch scheme may be more efficient. Thus, given that the Rosenbrock methods performed even poorer with regards to running time, it seems reasonable to suggest that implementing the Crank-Nicolson-or hopscotch scheme with a Newton iteration is most ideal. The Crank-Nicolson method using the Newton method as a means of maintaining the implicit nature of the source term in the difference scheme has been used by Anderson and Zienkiewicz [2]. In Harley [21] and Abd El-Salam and Shehata [1] the discretisation of the exponential terms were kept explicit, thereby removing the nonlinearity.
As a consequence of these findings and due to the complexity created by the nonlinear source term which serves a critical function in the model, further work regarding faster algorithms for the solution of such an equation are of interest. This chapter will not consider the hopscotch scheme directly as an appropriate method for the solution of the Frank-Kamenetskii partial differential equation due to work done by Feldberg [15] which indicated that for large values of β = t x 2 the algorithm produces the problem of propagational inadequacy which leads to inaccuracies -similar results were obtained in [22]. Given the improved accuracy of the Crank-Nicolson method incorporating the Newton method [22] -the order of the error for this method is O( t 2 ) which is only approximately the case for the Crank-Nicolson method without the Newton iteration incorporated [9] -it seems more fitting to consider an improvement in the computing time of this method. Hence a consideration of such an improvement on the algorithm's current running time will be the focus of this chapter. The means by which we wish to accomplish this is through the use of the the Parallel Computing Toolbox in MATLAB. It is hoped that this is the next step towards creating fast and effective numerical algorithms for the solution of a partial differential equation such as the one originating from the work of Frank-Kamenetskii [16].

Executing MATLAB on a GPU
The advantage of using the Parallel Computing Toolbox in MATLAB is the fact that it allows one to solve computationally and data-intensive problems using multicore processors, GPUs, and computer clusters. In this manner one can parallelize numerical algorithms, and in so doing MATLAB applications, without CUDA or MPI programming. Parallelized algorithms such as parfor, used within the context of what is usually a for loop, allows you to offload work from one MATLAB session (the client) to other MATLAB sessions, called workers. You can use multiple workers to take advantage of parallel processing and in this way improve the performance of such loop execution by allowing several MATLAB workers to execute individual loop iterations simultaneously. In this context however we are not able to implement in-built MATLAB functions such as parfor due to the numerical algorithms which we have chosen to consider. The CN-and CN N method, both implicit, loop through the index m until t 0 + m t = T. These iterative steps are not independent of each other, i.e to obtain data at the m + 1 th step the data at the m th step is required. In a similar fashion the ROS2 and ROWDA3 methods also iterate through dependent loops to obtain a solution. As such we attempt to run the code directly on the GPU instead of the CPU in order to decrease the running time of the algorithms.
The Parallel Computing Toolbox in MATLAB allows one to create data on and transfer it to the GPU so that the resulting GPU array can then be used as an input to enhance built-in functions that support them. The first thing to consider when implementing computations on the GPU is keeping the data on the GPU so that we do not have to transfer it back and forth for each operation -this can be done through the use of the gpuArray command. In this manner computations with such input arguments run on the GPU because the input arguments are already in the GPU memory. One then retrieves the results from the GPU to the MATLAB workspace via the gather command. Having to recall the results from the GPU is costly in terms of computing time and can in certain instances make the implementation of an algorithm on the GPU less efficient than one would expect. Furthermore, the manner in which one codes algorithms for GPUs is of vital importance given certain limitations to the manner in which functions of the Toolbox may be implemented (see [25]). More importantly however, is whether the method employed can allow for the necessary adjustments in order to improve its performance. In this chapter we will see that there are some problems with implementing the kind of algorithms considered here on the GPU.
In this chapter we are employing MATLAB under Windows 7 (64 bits) on a PC equipped with an i7 2.2 GHz processor with 32 GB of RAM.

Crank-Nicolson implicit scheme
We will implement the Crank-Nicolson method while maintaing the explicit nature of the nonlinear source term and also apply the method by computing a correction set in each iteration to obtain approximate values of the dependent variable at the next time step through the use of the Newton method [26]. The methodology will be explained briefly here; the reader is referred to [7][8][9] for clarification.
When implementing the Crank-Nicolson method we employ the following central-difference approximations for the second-and first-order spatial derivatives respectively while a forward-difference approximation is used for the time derivative. We implement a Crank-Nicolson scheme by approximating the second-derivative on the right-hand side of (1) by the implicit Crank-Nicolson [12] approximation In a similar fashion the first-derivative on the right-hand side becomes To impose zero-shear boundary conditions at the edges we approximate the spatial first-derivative by the central-difference approximation (7) which leads to the following condition u m −1 = u m 1 .
As mentioned before the boundary condition (5a) at x 0 = 0 can pose a problem for the solution of equation (1). One could discretise it directly as a forward difference formula, such as the three-point approximation −3u m 0 + 4u m 1 − u m 2 = 0, and add this to the set of equations to solve when using the Crank-Nicolson scheme. Alternatively one could use the more accurate symmetric approximation, u m −1 = u m 1 , which introduces a 'fictitious point' at x = − x. This however, would lead to another problem due to the singularity in the differential equation at x 0 = 0. Instead we choose to overcome this difficulty by using the Maclaurin expansion which simplifies the equation for the case x 0 = 0. It has been noted by Britz et al. [9] that using (12) turns out to be more convenient and accurate. Due to the fact that the point x 0 = 0 would lead to a singularity in equation (1) we structure the code to account for two instances: x = 0 and x = 0. Using (12) for equation (1) we attain the following approximation to equation (1) at x 0 = 0. This approximation has been taken into account in the system given by (16) below. Such an approximation has been used in many numerical algorithms. In Crank and Furzeland [13], for instance, they presented a modified finite-difference method which eliminates inaccuracies that occur in the standard numerical solution near singularities. The approximation has also been used by Harley and Momoniat [19] to generate a consistency criteria for initial values at x 0 = 0 for a Lane-Emden equation of the second-kind. From the equation under consideration (1) an initial condition for u(x, t) is obtained at x 0 = 0 giving the following as the initial difference scheme with β = t x 2 . Implementing the difference approximations discussed above we obtain the general numerical scheme where x n = n x and β = t x 2 such that γ n = β 1 − 1 2n and λ n = β 1 + 1 2n . This difference scheme (15), including the initial difference condition (14), form a system of equations which are to be solved iteratively.

As indicated by the boundary conditions (5a) and (5b) we consider the problem for
In a similar fashion the domain [0, T] is sub-divided into M intervals of equal length, t, through which the scheme iterates. The system will iterate until t m + t = T, i.e. for M = T/ t steps. The system generated by (15) can be written in compact form as and is solved as follows The inverse of A is calculated using the \ operator in MATLAB which is more efficient than the inv function. The nonlinear term on the m + 1 th level is dealt with through an implementation of the Newton method [26] in an iterative fashion as done by Britz et al. [9] and discussed in [8]. The system Jδu = −F(u) is solved where F is the set of difference equations created as per (16) such that F(u) = 0. The starting vector at t = 0 is chosen as per the initial condition (4) such that u = 0. The Newton iteration converges within 2-3 steps given that changes are usually relatively small.

Rosenbrock method
We now consider two particular Rosenbrock methods, ROS2 and ROWDA3, as a means of comparison for the effectiveness of the methods discussed in the previous section. The Rosenbrock methods belong to the class of linearly implicit Runge -Kutta methods [11,17]. They were used successfully for the numerical solution of non-electrochemical stiff partial differential equations, including equations of interest to electrochemistry. For further information regarding the particulars of such methods interested readers are referred to the numerical literature of [17,[28][29][30].
The reason for the use of the Rosenbrock methods in this paper is the ease with which they are able to deal with the nonlinear source term and the fact that no Newton iterations are necessary. The advantages of these methods are great efficiency, stability and a smooth error response if ROS2 or ROWDA3 are used (see [4] for instance) and the ease with which they are able to handle time-dependent and/or nonlinear systems.
We consider equation (1) as the following system which along with 0 = u N can be written in the compact form where S = diag(1, 1, 1, ..., 1, 0) is the selection matrix containing zeros in those positions where the set of differential algebraic equations has an algebraic equation (i.e. zero on the left-hand side of (18)) and unity in those positions corresponding to the ordinary differential equations. The function F(t, u) can be written as: F(t, u) = Ju + s where the matrix J is the Jacobian and the vector s arises from the constant terms of the set of differential algebraic equations. We can thus write F(t, u) = Ju + s as 20) such that In order to implement the Rosenbrock methods a number s of k i vectors are computed with s the order chosen. The general equation given by (22) is solved iteratively to obtain each vector k i for all i specified which will then be used to update the vector u for the next time step. We use the notation employed in [7] for the general equation to be used to obtain the values for where we define M = S κ − tF u were the function F is applied at partly augmented t and u values and the time derivative F t is zero in this case since the system does not include functions of time. Having calculated the s k i vectors the solution is obtained from where the m i are weighting factors included in the tables of constants specified for each method (see [4] and [7]).
In this chapter we implement the ROS2 and ROWDA3 methods though there are other variants of the Rosenbrock methods. Lang [24] described a L-stable second-order formula called ROS2. A third-order variant thereof is called ROWDA3 and described by Roche [28] and later made more efficient by Lang [23]. The latter is a method favoured by Bieniasz who introduced Rosenbrock methods to electrochemical digital simulation [4,5]. For a more detailed explanation and discussion regarding the method and its advantages refer to [7]. The focus of the work done here is with regards to whether the Rosenbrock algorithms lend themselves toward parallelized implementation. It has already been noted that functions such as the parfor command cannot be used in this instance. It now remains to consider the method's performance when run on a GPU via the MATLAB Parallel Computing Toolbox.

Discussion of numerical results
The results noticed, as per Table 1, indicate the extent to which implementing the code on the GPU slows down overall performance of the CN, CN N , ROS2 and ROWDA3 methods. The question is why this would be the case. In Table 1 the results for the different methods run on a CPU were obtained by running the code on one CPU only instead of all of those available to MATLAB on the computer employed. This was done to get a better understanding of the one-on-one performance between the processing units, and yet implementing the code on the GPU still led to poor performance.
To gain a better understanding of these results we consider the baseline structure for our CN code: A = gpuArray(A); B = gpuArray(B); u0 = gpuArray(u0); for m = 1 : T b = delta. * dt. * exp((1 + eps. * u0).\(u0)); u0 = mldivide(A, (B * u0 + b)); end In doing so, we realise that the main components thereof are matrix and elementwise vector operations. In order to understand why we are achieving the results we do (see Table 1) we run a few simple 'test'-codes to consider the speed taken by the CPU vs. the GPU to perform such elementary operations as C\d and d. * f where C is a matrix and d and f are vectors. In Figure 1 we see the speed of the CPU over the GPU computed as CPU running time GPU running time . You will notice that as the size of the matrix and corresponding vector increases so too does the speed at which the GPU is able to compute C\d allowing it to overtake the CPU. This is what one would expect given that the GPU will only 'kick in' once the CPU is overloaded with data structures too large for it to compute effectively. Thus the efficiency in terms of running time of the code provided above is heavily dependent upon the size of the matrices A and B. At this juncture it is important to remember that we are considering the range x ∈ [0, 1] with x = 0.1 which means that our A matrix is a 10 × 10 matrix and as such not large enough to give the GPU the chance to expose its ability to improve the performance of the algorithm. The reason for the choice in the size of the matrix for the problem considered is twofold: (1) it is due to the influence of the ratio β = t/ x 2 which one usually tries to keep close to 1 for reasons of stability, and (2) the limitations of memory of the PC being used. The next step in this evaluative process is to now consider the speed at which vector operations are performed. This was done in a similar fashion to the previous case by considering the speed taken by the CPU and GPU to perform the elementwise operation d. * f where d and f are vectors. The ratios of the speeds CPU running time GPU running time were also considered for the in-built function arrayfun which performs elementwise operations on all its inputs. It can clearly be seen in Figure 2 that the in-built function outperforms the normal . * operation. What is interesting in this case is that the size of the vector required for the GPU to outperform the CPU is very large -we considered vectors of sizes between 200 000 and 201 000 as indicated. For smaller vector lengths the GPU is completely outperformed by the speed at which calculations are done on the CPU. As such, to improve the speed at which these vector calculations are performed we would either (1) have to diminish x to the degree needed to obtain vectors of the required length (2) or be required to move the vectors from the GPU memory to the CPU memory every time calculations need to be made. The first approach would require a memory capacity beyond that of the computer used here and the second would greatly increase the running time of the algorithm and as such is not worth implementing.
As a means of further investigation we consider the CN code as a test case for the use of the arrayfun function. Obviously implementing this in-built function as follows A = gpuArray(A); B = gpuArray(B); u0 = gpuArray(u0); u0 = arrayfun(@myCrank, u0, A, B)     where the @myCrank function performs the loop through m, instead of the code presented previously produces incorrect results. The results obtained do however support our findings that the arrayfun function is able to increase the speed with which elementwise operations are performed. In this instance arrayfun is computing on the GPU since the inputs are all GPU array objects. We found for T = 10 with t = 0.0001 as we decreased x that computing on the GPU was faster than doing so on the CPU: for x = 0.1 and 0.01 the ratios were CPU running time GPU running time = 1.5709 and CPU running time GPU running time = 6.5906 respectively. This makes sense given that smaller values of x would increase the sizes of the matrices A and B and the vectors b and u0. As such, it seems likely that using a PC with a greater memory capacity would lead to the GPU outperforming the CPU by a large margin as x decreases.

Influence of changing parameter values on the running time of the algorithms
Just a few brief comments upon the results obtained for CN, CN N , ROS2 and ROWDA3 for varying values of and δ will be made in this section. Firstly we considered the schemes for δ = 1 and = 0.01, 0.05, 0.1 and 0.25 and then we also considered the case for = 0 with δ = 0.1, 1, 2 and 3. The reader will notice considering Tables 2 and 3 that there does not seem to be any noticeable trend to the results obtained. As such the values of and δ do not seem to have a meaningful impact on the speed at which the algorithms compute.

Concluding remarks
The implementation of numerical algorithms such as those considered in this chapter are widely used for the solution of many differential equations which model physical processes and applications. As such it is of vital importance that we be able to perform such calculations at high speed given the requirement of fine grids to improve accuracy. It is in this context that the use of GPUs becomes of prime importance. However it is not simply a matter of running the algorithm on the GPU -the method employed needs to lend itself to being adjusted in the required manner so that the parallel processing power of the GPU may be taken advantage of. Though we found that the numerical methods considered here were not entirely suited to being implemented on the GPU as we would have hoped we were able to explain why this was the case. This work has investigated the effectiveness of matrix and elementwise operations when run on a GPU vs. a CPU and found that the speed taken to do such operations heavily relies on the choice of x. It was discovered that the introduction of the nonlinear source term is problematic due to the length of time taken to do elementwise calculations on the GPU. While matrix operations were also shown to be slow it was more specifically this aspect of the code which increased the running time.
We also discovered the power of the in-built function arrayfun which was able to improve upon the performance of the GPU with regards to computing time to the degree that it outperformed the CPU even for a grid with 'large' x, i.e. small matrices and vectors within the computations. As the grid became finer the performance of the GPU over the CPU improved, indicating the impact of the size of the matrices upon which computations are being performed and the degree to which arrayfun is able to improve computations occurring on the GPU. Thus, the manner in which arrayfun computes elementwise is extremely efficient and if such a structure could be developed for matrix operations then that would truly allow the performance of the GPU to overtake that of CPU computing.
What the work in this chapter has shown is that the structures of the GPU and the Parallel Computing Toolbox in MATLAB are such that while certain algorithms have the ability to be adjusted for improved performance not all methods do. In particular it seems clear that implicit methods with matrix and vector operations will in fact run much slower on the GPU than the CPU. Thus whether GPU computing is able to improve the performance of a numerical scheme is very much dependent upon the type of computations which need to be done. In our case we discovered that the implicit and nonlinear nature of our numerical schemes do not lend themselves towards improved performance via the implementation of the parallel processing power of a GPU.