Running times per iteration of for the relevant methods implemented for, , and.
The efficient solution of the Frank-Kamenetskii partial differential equation through the implementation of parallelized numerical algorithms or GPUs (Graphics Processing Units) in MATLAB is a natural progression of the work which has been conducted in an area of practical import. There is an on-going interest in the mathematics describing thermal explosions due to the significance of the applications of such models - one example is the chemical processes which occur in grain silos. Solutions which pertain to the different geometries of such a physical process have different physical interpretations, however in this chapter we will consider the Frank-Kamenetskii partial differential equation within the context of the mathematical theory of combustion which according to Frank-Kamenetskii  deals with the combined systems of equations of chemical kinetics and of heat transfer and diffusion. A physical explanation of such a system is often a gas confined within a vessel which then reacts chemically, heating up until it either attains a steady state or explodes.
The focus of this chapter is to investigate the performance of the parallelization power of the GPU vs. the computing power of the CPU within the context of the solution of the Frank-Kamenetskii partial differential equation. GPU computing is the use of a GPU as a co-processor to accelerate CPUs (Central Processing Units) for general purpose scientific and engineering computing. The GPU accelerates applications running on the CPU by offloading some of the compute-intensive and time consuming portions of the code. The rest of the application still runs on the CPU. The reason why the application is seen to run faster is because it is using the extreme parallel processing power of the GPU to boost performance. A CPU consists of 4 to 8 CPU cores while the GPU consists of 100s of smaller cores. Together they operate to crunch through the data in the application and as such it is this massive parallel architecture which gives the GPU its high compute performance.
The methods which will be investigated in this research are implicit methods, such as the Crank-Nicolson method () and the Crank-Nicolson method incorporating the Newton method () . These algorithms pose a serious challenge to the implementation of parallelized architecture as we shall later discuss. We will also consider Rosenbrock methods which are iterative in nature and as was indicated in Harley  showed drastically increased running times as the time period over which the problem was considered increased. Further pitfalls which arise when trying to obtain a solution for the partial differential equation in question when using numerical techniques is firstly the singularity which exists at. This complexity may be dealt with through the use of a Maclaurin expansion which splits the problem into two cases: and.
The second hurdle is the nonlinear source term which may be dealt with using different techniques. In this chapter we will implement the Newton method which acts as an updating mechanism for the nonlinear source term and in so doing maintains the implicit nature of the scheme in a consistent fashion. While the incorporation of the Newton method leads to an increase in the computation time for the Crank-Nicolson difference scheme (see ) there is also an increase in the accuracy and stability of the solution. As such we find that the algorithms we are attempting to employ in the solution of this partial differential equation would benefit from the processing power of a GPU.
In this chapter we will focus on the implementation of the Crank-Nicolson implicit method, employed with and without the Newton method, and two Rosenbrock methods, namely ROS2 and ROWDA3. We consider the effectiveness of running the algorithms on the GPU rather than the CPU and discuss whether these algorithms can in fact be parallelized effectively.
The steady state formulation of the equation to be considered in this chapter was described by Frank-Kamenetskii  who later also considered the time development of such a reaction. The reaction rate depends on the temperature in a nonlinear fashion, generally given by Arrhenius’ law. This nonlinearity is an important characteristic of the combustion phenomena since without it the critical condition for inflammation would disappear causing the idea of combustion to lose its meaning . Thus, in the case of a thermal explosion, the Arrhenius law is maintained by the introduction of the exponential term which acts as a source for the heat generated by the chemical reaction. As such we are able to write an equation modelling the dimensionless temperature distribution in a vessel as
where is a function of the spatial variable and time and the Frank-Kamenetskii parameter is given by
The value of the Frank-Kamanetskii parameter  is related to the critical temperature at which ignition of a thermal explosion takes place and is thus also referred to as the critical value. At values below its critical value a steady state is reached for a given geometry and set of boundary conditions whereas an explosion ensues for values above it. The Laplacian operator takes the form
where is indicative of the shape of the vessel within which the chemical reaction takes place: and represent an infinite slab, infinite circular cylinder and sphere, respectively. The dimensionless parameter is introduced as the critical activation parameter. To be able to speak of combustion the condition must be satisfied due to the fact that the ambient temperature can normally be seen as much smaller in magnitude than the ignition temperature . Equation (1) for was derived by Frank-Kamenetskii . Further work was done by Steggerda  on Frank-Kamenetskii’s original criterion for a thermal explosion showing that a more detailed consideration of the situation is possible. For small a solution was derived for the cylindrical system by Rice , Bodington and Gray  and Chambr. While the steady state case - often termed the Lane-Emden equation of the second kind - has been considered extensively, the time dependent case is also of import and has been studied in ,  and .
In this chapter we consider numerical solutions for equation (1) modelling a thermal explosion within a cylindrical vessel, i.e.. A thermal explosion occurs when the heat generated by a chemical reaction is far greater than the heat lost to the walls of the vessel in which the reaction is taking place. As such this equation is subject to certain boundary conditions given at the walls of the vessel. The appropriate boundary conditions for this problem are
where is the radius of the cylinder. The boundary conditions (4) and (5) imply that the temperature at the vessel walls is kept fixed and the solution is symmetric about the origin.
Frank-Kamenetskii  obtained a steady state solution to this problem with. Zeldovich et al.  considered similarity solutions admitted by (1) for that exhibit blow-up in finite time. These kinds of solutions, while noteworthy, have limited significance due to the restricted form of the initial profiles compatible with the similarity solutions. These solutions correspond to very special initial conditions for the temperature evolution profile, limiting the degree to which results obtained in this manner are applicable. This disadvantage has been noted by Anderson et al.  while analytically investigating the time evolution of the one-dimensional temperature profile in a fusion reactor plasma. A solution which also models blow-up in finite time has been obtained by Harley and Momoniat  via nonlocal symmetries of the steady-state equation.
In Harley  a Crank-Nicolson- and hopscotch scheme were implemented for equation (1) subject to (4) and (5) where and. The nonlinear source term was kept explicit when the Crank-Nicolson method was employed, as commented on by Britz et al.  in whose work the nonlinear term was incorporated in an implicit manner in a style more consistent with the Crank-Nicolson method. Britz et al.  implemented the Crank-Nicolson scheme with the Newton iteration and showed that it outperformed the explicit implementation of the nonlinearity as in  in terms of accuracy. However it does require more computer time as would be expected.
In recent work (see ) the Crank-Nicolson method was implemented with the Newton iteration as done by Britz et al.  by computing a correction set in each iteration to obtain approximate values of the dependent variable at the next time step. The efficiency of the Crank-Nicolson scheme, hopscotch scheme (both of these methods were implemented with an explicit and then an implicit discretisation of the source term) and two versions of the Rosenbrock method were compared . Using the pdepe function in MATLAB and the steady state solution obtained by Frank-Kamenetskii  as a means of comparison, it was found that the incorporation of the Newton method for the Crank-Nicolson- and hopscotch scheme led to increased running times as, where, increased.
Furthermore, it was shown that while the Crank-Nicolson- and hopscotch method (with or without the implementation of the Newton method) performed well in terms of accuracy for and, they were in fact able to outperform pdepe at. The Rosenbrock methods employed (ROS2 and ROWDA3) performed similarly with regards to accuracy, however showed almost an exponential increase in their running times as increased, indicating that using the Crank-Nicolson- or hopscotch scheme may be more efficient. Thus, given that the Rosenbrock methods performed even poorer with regards to running time, it seems reasonable to suggest that implementing the Crank-Nicolson- or hopscotch scheme with a Newton iteration is most ideal. The Crank-Nicolson method using the Newton method as a means of maintaining the implicit nature of the source term in the difference scheme has been used by Anderson and Zienkiewicz . In Harley  and Abd El-Salam and Shehata  the discretisation of the exponential terms were kept explicit, thereby removing the nonlinearity.
As a consequence of these findings and due to the complexity created by the nonlinear source term which serves a critical function in the model, further work regarding faster algorithms for the solution of such an equation are of interest. This chapter will not consider the hopscotch scheme directly as an appropriate method for the solution of the Frank-Kamenetskii partial differential equation due to work done by Feldberg  which indicated that for large values of the algorithm produces the problem of propagational inadequacy which leads to inaccuracies - similar results were obtained in . Given the improved accuracy of the Crank-Nicolson method incorporating the Newton method  - the order of the error for this method is which is only approximately the case for the Crank-Nicolson method without the Newton iteration incorporated  - it seems more fitting to consider an improvement in the computing time of this method. Hence a consideration of such an improvement on the algorithm’s current running time will be the focus of this chapter. The means by which we wish to accomplish this is through the use of the the Parallel Computing Toolbox in MATLAB. It is hoped that this is the next step towards creating fast and effective numerical algorithms for the solution of a partial differential equation such as the one originating from the work of Frank-Kamenetskii .
3. Executing MATLAB on a GPU
The advantage of using the Parallel Computing Toolbox in MATLAB is the fact that it allows one to solve computationally and data-intensive problems using multicore processors, GPUs, and computer clusters. In this manner one can parallelize numerical algorithms, and in so doing MATLAB applications, without CUDA or MPI programming. Parallelized algorithms such as parfor, used within the context of what is usually a for loop, allows you to offload work from one MATLAB session (the client) to other MATLAB sessions, called workers. You can use multiple workers to take advantage of parallel processing and in this way improve the performance of such loop execution by allowing several MATLAB workers to execute individual loop iterations simultaneously. In this context however we are not able to implement in-built MATLAB functions such as parfor due to the numerical algorithms which we have chosen to consider. The - and method, both implicit, loop through the index until. These iterative steps are not independent of each other, i.e to obtain data at the step the data at the step is required. In a similar fashion the ROS2 and ROWDA3 methods also iterate through dependent loops to obtain a solution. As such we attempt to run the code directly on the GPU instead of the CPU in order to decrease the running time of the algorithms.
The Parallel Computing Toolbox in MATLAB allows one to create data on and transfer it to the GPU so that the resulting GPU array can then be used as an input to enhance built-in functions that support them. The first thing to consider when implementing computations on the GPU is keeping the data on the GPU so that we do not have to transfer it back and forth for each operation - this can be done through the use of the gpuArray command. In this manner computations with such input arguments run on the GPU because the input arguments are already in the GPU memory. One then retrieves the results from the GPU to the MATLAB workspace via the gather command. Having to recall the results from the GPU is costly in terms of computing time and can in certain instances make the implementation of an algorithm on the GPU less efficient than one would expect. Furthermore, the manner in which one codes algorithms for GPUs is of vital importance given certain limitations to the manner in which functions of the Toolbox may be implemented (see ). More importantly however, is whether the method employed can allow for the necessary adjustments in order to improve its performance. In this chapter we will see that there are some problems with implementing the kind of algorithms considered here on the GPU.
In this chapter we are employing MATLAB under Windows 7 (64 bits) on a PC equipped with an i7 2.2 GHz processor with 32 GB of RAM.
3.1. Crank-Nicolson implicit scheme
We will implement the Crank-Nicolson method while maintaing the explicit nature of the nonlinear source term and also apply the method by computing a correction set in each iteration to obtain approximate values of the dependent variable at the next time step through the use of the Newton method . The methodology will be explained briefly here; the reader is referred to [7, 8, 9] for clarification.
When implementing the Crank-Nicolson method we employ the following central-difference approximations for the second-and first-order spatial derivatives respectively
while a forward-difference approximation
is used for the time derivative. We implement a Crank-Nicolson scheme by approximating the second-derivative on the right-hand side of (1) by the implicit Crank-Nicolson  approximation
In a similar fashion the first-derivative on the right-hand side becomes
To impose zero-shear boundary conditions at the edges we approximate the spatial first-derivative by the central-difference approximation (7) which leads to the following condition
As mentioned before the boundary condition (5a) at can pose a problem for the solution of equation (1). One could discretise it directly as a forward difference formula, such as the three-point approximation, and add this to the set of equations to solve when using the Crank-Nicolson scheme. Alternatively one could use the more accurate symmetric approximation, , which introduces a ’fictitious point’ at. This however, would lead to another problem due to the singularity in the differential equation at. Instead we choose to overcome this difficulty by using the Maclaurin expansion
which simplifies the equation for the case. It has been noted by Britz et al.  that using (12) turns out to be more convenient and accurate. Due to the fact that the point would lead to a singularity in equation (1) we structure the code to account for two instances: and. Using (12) for equation (1) we attain the following approximation
to equation (1) at. This approximation has been taken into account in the system given by (16) below. Such an approximation has been used in many numerical algorithms. In Crank and Furzeland , for instance, they presented a modified finite-difference method which eliminates inaccuracies that occur in the standard numerical solution near singularities. The approximation has also been used by Harley and Momoniat  to generate a consistency criteria for initial values at for a Lane-Emden equation of the second-kind. From the equation under consideration (1) an initial condition for is obtained at giving the following
as the initial difference scheme with. Implementing the difference approximations discussed above we obtain the general numerical scheme
where and such that and. This difference scheme (15), including the initial difference condition (14), form a system of equations which are to be solved iteratively.
As indicated by the boundary conditions (5a) and (5b) we consider the problem for and. The domain is sub-divided into equidistant intervals termed, i.e. where. In a similar fashion the domain is sub-divided into intervals of equal length, , through which the scheme iterates. The system will iterate until, i.e. for steps. The system generated by (15) can be written in compact form as
and is solved as follows
The inverse of is calculated using the operator in MATLAB which is more efficient than the inv function. The nonlinear term on the level is dealt with through an implementation of the Newton method  in an iterative fashion as done by Britz et al.  and discussed in . The system is solved where is the set of difference equations created as per (16) such that. The starting vector at is chosen as per the initial condition (4) such that. The Newton iteration converges within 2-3 steps given that changes are usually relatively small.
3.2. Rosenbrock method
We now consider two particular Rosenbrock methods, ROS2 and ROWDA3, as a means of comparison for the effectiveness of the methods discussed in the previous section. The Rosenbrock methods belong to the class of linearly implicit Runge - Kutta methods [11, 17]. They were used successfully for the numerical solution of non-electrochemical stiff partial differential equations, including equations of interest to electrochemistry. For further information regarding the particulars of such methods interested readers are referred to the numerical literature of [17, 28, 29, 30].
The reason for the use of the Rosenbrock methods in this paper is the ease with which they are able to deal with the nonlinear source term and the fact that no Newton iterations are necessary. The advantages of these methods are great efficiency, stability and a smooth error response if ROS2 or ROWDA3 are used (see  for instance) and the ease with which they are able to handle time-dependent and/or nonlinear systems.
We consider equation (1) as the following system
which along with can be written in the compact form
where is the selection matrix containing zeros in those positions where the set of differential algebraic equations has an algebraic equation (i.e. zero on the left-hand side of (18)) and unity in those positions corresponding to the ordinary differential equations. The function can be written as: where the matrix is the Jacobian and the vector arises from the constant terms of the set of differential algebraic equations. We can thus write as
In order to implement the Rosenbrock methods a number of vectors are computed with the order chosen. The general equation given by (22) is solved iteratively to obtain each vector for all specified which will then be used to update the vector for the next time step. We use the notation employed in  for the general equation to be used to obtain the values for
where we define were the function is applied at partly augmented and values and the time derivative is zero in this case since the system does not include functions of time. Having calculated the vectors the solution is obtained from
In this chapter we implement the ROS2 and ROWDA3 methods though there are other variants of the Rosenbrock methods. Lang  described a L-stable second-order formula called ROS2. A third-order variant thereof is called ROWDA3 and described by Roche  and later made more efficient by Lang . The latter is a method favoured by Bieniasz who introduced Rosenbrock methods to electrochemical digital simulation [4, 5]. For a more detailed explanation and discussion regarding the method and its advantages refer to . The focus of the work done here is with regards to whether the Rosenbrock algorithms lend themselves toward parallelized implementation. It has already been noted that functions such as the parfor command cannot be used in this instance. It now remains to consider the method’s performance when run on a GPU via the MATLAB Parallel Computing Toolbox.
4. Discussion of numerical results
The results noticed, as per Table 1, indicate the extent to which implementing the code on the GPU slows down overall performance of the, , ROS2 and ROWDA3 methods. The question is why this would be the case. In Table 1 the results for the different methods run on a CPU were obtained by running the code on one CPU only instead of all of those available to MATLAB on the computer employed. This was done to get a better understanding of the one-on-one performance between the processing units, and yet implementing the code on the GPU still led to poor performance.
To gain a better understanding of these results we consider the baseline structure for our code:
In doing so, we realise that the main components thereof are matrix and elementwise vector operations. In order to understand why we are achieving the results we do (see Table 1) we run a few simple ’test’-codes to consider the speed taken by the CPU vs. the GPU to perform such elementary operations as and where is a matrix and and are vectors. In Figure 1 we see the speed of the CPU over the GPU computed as. You will notice that as the size of the matrix and corresponding vector increases so too does the speed at which the GPU is able to compute allowing it to overtake the CPU. This is what one would expect given that the GPU will only ’kick in’ once the CPU is overloaded with data structures too large for it to compute effectively. Thus the efficiency in terms of running time of the code provided above is heavily dependent upon the size of the matrices and. At this juncture it is important to remember that we are considering the range with which means that our matrix is a matrix and as such not large enough to give the GPU the chance to expose its ability to improve the performance of the algorithm. The reason for the choice in the size of the matrix for the problem considered is twofold: (1) it is due to the influence of the ratio which one usually tries to keep close to for reasons of stability, and (2) the limitations of memory of the PC being used.
The next step in this evaluative process is to now consider the speed at which vector operations are performed. This was done in a similar fashion to the previous case by considering the speed taken by the CPU and GPU to perform the elementwise operation where and are vectors. The ratios of the speeds were also considered for the in-built function arrayfun which performs elementwise operations on all its inputs. It can clearly be seen in Figure 2 that the in-built function outperforms the normal operation. What is interesting in this case is that the size of the vector required for the GPU to outperform the CPU is very large - we considered vectors of sizes between 200 000 and 201 000 as indicated. For smaller vector lengths the GPU is completely outperformed by the speed at which calculations are done on the CPU. As such, to improve the speed at which these vector calculations are performed we would either (1) have to diminish to the degree needed to obtain vectors of the required length (2) or be required to move the vectors from the GPU memory to the CPU memory every time calculations need to be made. The first approach would require a memory capacity beyond that of the computer used here and the second would greatly increase the running time of the algorithm and as such is not worth implementing.
As a means of further investigation we consider the code as a test case for the use of the arrayfun function. Obviously implementing this in-built function as followswhere the @myCrank function performs the loop through, instead of the code presented previously produces incorrect results. The results obtained do however support our findings that the arrayfun function is able to increase the speed with which elementwise operations are performed. In this instance arrayfun is computing on the GPU since the inputs are all GPU array objects. We found for with as we decreased that computing on the GPU was faster than doing so on the CPU: for and the ratios were and respectively. This makes sense given that smaller values of would increase the sizes of the matrices and and the vectors and. As such, it seems likely that using a PC with a greater memory capacity would lead to the GPU outperforming the CPU by a large margin as decreases.
4.1. Influence of changing parameter values on the running time of the algorithms
Just a few brief comments upon the results obtained for, , ROS2 and ROWDA3 for varying values of and will be made in this section. Firstly we considered the schemes for and and and then we also considered the case for with and. The reader will notice considering Tables 2 and 3 that there does not seem to be any noticeable trend to the results obtained. As such the values of and do not seem to have a meaningful impact on the speed at which the algorithms compute.
5. Concluding remarks
The implementation of numerical algorithms such as those considered in this chapter are widely used for the solution of many differential equations which model physical processes and applications. As such it is of vital importance that we be able to perform such calculations at high speed given the requirement of fine grids to improve accuracy. It is in this context that the use of GPUs becomes of prime importance. However it is not simply a matter of running the algorithm on the GPU - the method employed needs to lend itself to being adjusted in the required manner so that the parallel processing power of the GPU may be taken advantage of. Though we found that the numerical methods considered here were not entirely suited to being implemented on the GPU as we would have hoped we were able to explain why this was the case.
This work has investigated the effectiveness of matrix and elementwise operations when run on a GPU vs. a CPU and found that the speed taken to do such operations heavily relies on the choice of. It was discovered that the introduction of the nonlinear source term is problematic due to the length of time taken to do elementwise calculations on the GPU. While matrix operations were also shown to be slow it was more specifically this aspect of the code which increased the running time.
We also discovered the power of the in-built function arrayfun which was able to improve upon the performance of the GPU with regards to computing time to the degree that it outperformed the CPU even for a grid with ’large’, i.e. small matrices and vectors within the computations. As the grid became finer the performance of the GPU over the CPU improved, indicating the impact of the size of the matrices upon which computations are being performed and the degree to which arrayfun is able to improve computations occurring on the GPU. Thus, the manner in which arrayfun computes elementwise is extremely efficient and if such a structure could be developed for matrix operations then that would truly allow the performance of the GPU to overtake that of CPU computing.
What the work in this chapter has shown is that the structures of the GPU and the Parallel Computing Toolbox in MATLAB are such that while certain algorithms have the ability to be adjusted for improved performance not all methods do. In particular it seems clear that implicit methods with matrix and vector operations will in fact run much slower on the GPU than the CPU. Thus whether GPU computing is able to improve the performance of a numerical scheme is very much dependent upon the type of computations which need to be done. In our case we discovered that the implicit and nonlinear nature of our numerical schemes do not lend themselves towards improved performance via the implementation of the parallel processing power of a GPU.
I would like to thank Mr. Dario Fanucchi for invaluable discussions.