A Robust and Flexible Control System to Reduce Environmental Effects of Thermal Power Plants

Regulations on environmental effects due to such issues as nitrogen oxide (NOx) and carbon monoxide (CO) emissions from thermal power plants have become stricter[1]; hence the need for compliance with these regulations has been increasing. To meet this need, several technologies with respect to fuel combustion, exhaust gas treatment and operational control have been developed[2-4]. The technologies for the fuel combustion and the exhaust gas treatment include a low NOx burner and an air quality control system, and they are capable of reducing impact on the environment as physical and chemical implementation methods. The operational control technology for the thermal power plants is constantly required to receive changes in operational conditions. It is difficult to realize operational control which responds to combustion properties. To overcome this issue, the operational control must be able to reduce NOx and CO emissions flexibly in accordance with such changes. Robustness is also required in such control because the measured NOx and CO data often include noise. Therefore, a robust and flexible plant control system is strongly desired to reduce environmental effects from thermal power plants efficiently. Several studies have proposed plant control technologies to reduce the environmental effects[4-10]. These technologies are classified into two types of methods: model based and non-model based methods. The former methods include an optimization algorithm and a numerical model to estimate plant properties using neural networks (NNs)[11,12] and multivariable model predictive control[13]. The optimization algorithm searches for optimal control signals to reduce NOx and CO emissions using the numerical model. The latter methods have no models and they generates the optimal control signals by fuzzy logic[14]. A fuzzy logic controller outputs the optimal control signals for multivariable inputs using fuzzy rule bases. The fuzzy rule bases are based on a priori knowledge of plant control, and they can be tuned by parameters. These technologies require the measured plant data for initial tuning of the model properties and the parameters of rules when the technologies are installed in plants. It usually takes some time to collect enough plant data. In addition, the search for control


Introduction
Regulations on environmental effects due to such issues as nitrogen oxide (NOx) and carbon monoxide (CO) emissions from thermal power plants have become stricter[1]; hence the need for compliance with these regulations has been increasing. To meet this need, several technologies with respect to fuel combustion, exhaust gas treatment and operational control have been developed [2][3][4]. The technologies for the fuel combustion and the exhaust gas treatment include a low NOx burner and an air quality control system, and they are capable of reducing impact on the environment as physical and chemical implementation methods. The operational control technology for the thermal power plants is constantly required to receive changes in operational conditions. It is difficult to realize operational control which responds to combustion properties. To overcome this issue, the operational control must be able to reduce NOx and CO emissions flexibly in accordance with such changes. Robustness is also required in such control because the measured NOx and CO data often include noise. Therefore, a robust and flexible plant control system is strongly desired to reduce environmental effects from thermal power plants efficiently. Several studies have proposed plant control technologies to reduce the environmental effects [4][5][6][7][8][9][10]. These technologies are classified into two types of methods: model based and non-model based methods. The former methods include an optimization algorithm and a numerical model to estimate plant properties using neural networks (NNs) [11,12] and multivariable model predictive control [13]. The optimization algorithm searches for optimal control signals to reduce NOx and CO emissions using the numerical model. The latter methods have no models and they generates the optimal control signals by fuzzy logic [14]. A fuzzy logic controller outputs the optimal control signals for multivariable inputs using fuzzy rule bases. The fuzzy rule bases are based on a priori knowledge of plant control, and they can be tuned by parameters. These technologies require the measured plant data for initial tuning of the model properties and the parameters of rules when the technologies are installed in plants. It usually takes some time to collect enough plant data. In addition, the search for control A Robust and Flexible Control System to Reduce Environmental Effects of Thermal Power Plants 307 estimation accuracy. This method adjusts radii parameters considering distances among the learning data. Consequently, the Gaussian basis functions can cover the input space properly and both high estimation accuracy and practical computational speed are achieved. The second problem is to improve flexibility of the learning algorithm. Performance of the RL depends on the definition of a reward function equivalent to an evaluation function. The reward function has to be defined so that the RL algorithm can obtain the desired goal for the problem. As for application of the RL to thermal power plant control, the properties of the model changes in accordance with operational changes, thus the reward function has to be changed flexibly for the operational changes. However, it is quite difficult to prepare the reward functions for all patterns of operational conditions in advance. To overcome this second problem, the authors introduce a reward function which has variable parameters and they proposed an automatic reward adjustment method [22]. The proposed method adjusts the variable parameters of the reward function automatically based on the NOx and CO emissions obtained in the learning process. As a result, the proposed method can obtain proper reward functions for all kinds of operational conditions. The following sections outline the proposed control system and its newly proposed methods. Simulations clarify the advantages of the proposed system with respect to the following points: estimation accuracy and computational time of the RBF network, flexibility of the control logic and robustness in control for the noise of data. Figure 1 shows the basic structure of the proposed control system. This system consists of a plant property estimation part and an operation optimizing part. The plant property estimation part includes a statistical model and measurement and numerical calculation DBs. The statistical model estimates the NOx and CO emission properties in thermal power plants. It is difficult to express these properties as mathematical equations because they have strong nonlinearities. The proposed system employs the RBF network as the statistical model which can estimate NOx and CO emissions for control variables using data stored in the DBs. The measurement DBs store the measured NOx and CO data for some control variables, and the numerical calculation DB stores data consisting of NOx and CO values for control variables calculated by the combustion analysis [15]. The control variables correspond to input of the statistical model, and the estimated NOx and CO emissions correspond to output of it. The statistical model can be modified by measured data obtained during the plant operations. Conventional studies have been made about the model based control technology to reduce environmental effects from thermal power plants [4,[6][7][8], but none of them considered employing not only the measured DB, but also the numerical calculation DBs. As the model can be tuned using the calculation DBs in advance, it is not necessary to take times for initial tuning at the time of installation. In addition, it is possible to tune the model after the installation by the data in the measured DB. The operation optimizing part includes a RL agent, a reward calculation module, a reward adjustment module and a learning result DB. The learning procedure is as follows. First, the statistical model calculates and outputs the model outputs for the model inputs changed by the RL agent. Secondly, the reward calculation module calculates a reward using the model inputs and gives it to the RL agent. Thirdly, the RL agent learns its control logic. Learning results are stored in the learning result DB, and they are converted into modification signals. The control signals of the plant are generated by adding the modification signals to original control signals of the basic controller. The reward adjustment module adjusts reward parameters using the model outputs and the calculated reward. Normalized Gaussian network (NGnet) [23] has been employed as the structure of the RL agent. The learning algorithm of the NGnet is an actor-critic learning method [18], and it is appropriate for learning in a continuous environment.

RBF network
The basic structure of the RBF network is shown in Fig. 2. The RBF network has three layers: an input layer, a hidden layer with Gaussian function, and an output layer. First, the Jdimensional vector is input in the input layer. Secondly, Gaussian function values are calculated using the input in the hidden layer. Finally, the P -dimensional vector is calculated by the Gaussian function values and weight parameters in the output layer. The RBF network is preferred for constructing a response surface due to the following properties.  The RBF network avoids overfitting by the parameter of weight decay [16] to reduce the influences of noise included in the learning data.  The RBF network does not need iterative calculations for learning of weight parameters like back propagation does [12]. Here, the input and output of the RBF network are denoted as Here, () l h x is the Gaussian function value of the l -th basis function, M N is the number of basis functions, lp u is the weight parameter between the hidden layer and output layer and l c , l r are center coordinates and radius of the l -th basis function, respectively. The parameters l c and l r should be determined appropriately because they have much influence on estimation accuracy. In this chapter, the center coordinates are set to the learning data, and the radii are adjusted by the proposed radius adjusting method described later.
Here, D N is the number of learning data and  is a weight decay reducing influences of noise included in learning data. The proposed control system can realize a robust control by tuning this parameter in accordance with the learning data. Then, the following matrices are defined. 11 In Eq. (3), both sides are partially differentiated by lp u and Eqs. (4)-(7) are substituted, then Eq. (8) is obtained [16]. The learning of the RBF network can be described as the calculation of the weight matrix U given by Eq. (8).

Reinforcement learning 2.3.1 Basic algorithm
The NGnet for learning of the RL agent learns its action, i.e., control logic, and state value by putting Gaussian basis functions on its state space. Here, the state space is a mapping space to identify its status in the learning environment. The state value is a degree to evaluate how desirable the agent is in its current state. NGnet is known to be able to learn faster than other RL algorithms such as tile coding [18] because of the following features.   (14).
Here, ,, ijk denote the subscripts of the basis functions of the agent, inputs and actor outputs, respectively. , JK also denote the dimensions of the inputs and actor outputs. In this chapter, the input of the statistical model is defined as becoming equal to that of the RL agent. In other words, the RL agent outputs the control bias to the input condition x . The reward is calculated based on the results of control, i.e., the outputs of the statistical model obtained after the control. n is normalized noise whose average is 0 and variance is 1.  is a noise ratio.

Learning algorithm
Learning of NGnet is executed by the following procedures: updating the weight parameters , Here,  is a discount ratio for the future reward. The actor of NGnet learns its actions to improve () V x , and the critic of NGnet also learns to estimate () V x appropriately. , ki i wv are updated by Eqs. (16) and (17) using  .
Here, A  and C  denote the learning rates of ki w and i v , respectively. The other learning procedures execute adding/deleting the Gaussian basis functions and tuning of i μ , 2 i σ so that the NGnet can obtain enough resolutions to learn its state space.
The proposed control system employs the following algorithm: the sizes of basis functions of the NGnet are initialized to 0, and new basis functions are added adaptively in its learning.

Basis Addition Algorithm
Step 1. If the current basis function size L N satisfies max LL NN  , then the algorithm goes to Step 2. Otherwise, it terminates. Step 4. Otherwise, it terminates. Step 4. If min   is satisfied, the algorithm goes to Step 5. Otherwise, it terminates.
Step 5. A basis function whose center and radius is set to x and i σ is added to NGnet, then the algorithm terminates. Here, max min , L Na and min  denote maximum basis function size, threshold value of activation and threshold value of TD error, respectively. This algorithm adds new basis functions in the regions of the state space which are not sufficiently covered with learned basis functions. In addition, the maximum basis function size max L N is set because it might be possible to add unnecessary basis functions by increasing variation of the TD error due to the proposed automatic reward adjustment method described later. Therefore, the agent can put only the necessary basis functions in its state space.

Learning flow of the proposed control system
The learning algorithm flow of the proposed control system consists of the following steps.

Learning Algorithm of the Proposed Control System
Step 1. Initialize learning parameters of the RBF network and RL.

A Robust and Flexible Control System to Reduce Environmental Effects of Thermal Power Plants 313
Step 2. Adjust radii of the RBF network.
Step 3. Calculate weight parameters of the RBF network.
Step 4. Determine initial control variables.
Step 5. Change control variables by the RL agent.
Step 6. Calculate model outputs by the RBF network.
Step 9. Update weight parameters of the RL agent.
Step 10. Add new basis functions of the RL agent.
Step 11. If the terminal condition of the episode is reached, go to Step 12. Otherwise, return to Step 5.
Step 12. Adjust the reward parameters.
Step 13. If the terminal condition of learning is reached, terminate the algorithm. Otherwise, return to Step 4. In the above algorithm, an episode terminates after executing the processes between Step 5 and Step 10 for S times, and a trial of learning terminates after executing the processes between Step 4 and Step 12 for T times.

Basic concepts
In the proposed control system, the outputs of the RBF network are calculated by the Gaussian basis functions according to the input space. To obtain high estimation accuracy, the radii should be adjusted so that the basis functions can cover the space sufficiently. The proposed method focuses on the covering rate of the basis functions on the input space. It adjusts the radii based on the distances between a randomly generated input and the center of the basis functions selected to surround the input, where the learning data are located. As a result, the radii of basis functions whose distances to other data are short become small, and vice versa.

Algorithm of the proposed method
The algorithm of the proposed method consists of the following steps.

Algorithm of the Radius Adjustment Method
Step 1. Initialize the radii and adjusting parameters.
Step 2. Generate an input randomly.
Step 3. Select pairs of learning data by the k-SN (k-surrounded neighbor) method [24].
Step 4. Exclude the selected data from the data candidates for selection.
Step 5. If there are no data candidates, go to Step 4. Otherwise, return to Step 3.
Step 6. If there are no selected data, go to Step 8. Otherwise, go to Step 7.
Step 7. Update radii of the selected data Step 8. If n reaches N , terminate the algorithm. Otherwise, increment n and return to Step 2. In Step 1, the radii are initialized as a small value. In  x is excluded. In this way, the radii of basis functions in an interpolative relation with inputs are adjusted, then the basis functions can cover the input space sufficiently. This selection continues until all the data candidates have been selected. In Step 7, the radii ( 12 , mm rr) set at the selected data are adjusted by Eqs. (19) and (20 (20) Here, rad  is an initial step size parameter of radius, and  is a decay rate of the step size parameter ( 0 1  ). The second term in the right sides of both Eqs. (19) and (20) decays as iteration n increases, then the radii finally converge to certain values. These steps are iterated until n reaches N , then the radii are adjusted to certain values according to the distribution of learning data.

Simulations
In this section, some simulations are executed in order to evaluate the performances of the proposed radius adjustment method. The proposed method is compared with two conventional radius adjustment methods with respect to estimation accuracy and computational time using the test function data.

Simulation conditions
Simulations are executed in the following steps: a) determination of radii, b) calculation of weight parameters, and c) evaluation of estimation error. In step a), the proposed method, the Cross Validation (CV) method [11] and the radius equation method [20] are used to determine radii. The CV method adjusts radii with regression, and the radius equation method adjusts radii without regression. (See appendix). In step b), the weight parameters of the RBF network are calculated by Eq. (8). In step c), the estimation errors between the outputs of the RBF network and the test data are evaluated.
In the case of plant control, the shape of the response surface changes according to the plant properties, input dimensions and numbers of learning data. In order to simulate various response surfaces, the learning data are created for different test functions, input dimensions and numbers of data. The test functions 1 () F x and 2 () F x ( [ 5 , 5 ]   x ) described as Eqs. (21) and (22) are used in the simulations. These functions are often used as benchmark problems of RBF networks [20].     Figure 6 shows the adjustment history of 10 typical radii selected from those of 300 Gaussian basis functions corresponding to the numbers of learning data in case 5. In this figure, the radii soon increase with iteration but converge into different values. The reason why the adjusted radii converge into different values is that the proposed method adjusts the radii based on the distribution of learning data. For the data whose distances to other data are short, the distances between the learning data and n x become short. Consequently, the radii of the data in the region become shorter than those in the region whose distances are long. It is also confirmed by comparing Figs. 5 and 6 that the convergence of radii due to the decay of n  contributes to the convergence of RMSE.
Here, () ti h c is the Gaussian function value of the basis function whose center is t c and ci r is the radius to which a certain constant value is set. In this simulation, ci r is set to 3.89 considering the range of input values.

Fig. 7. Relation between the crowded index and radial values
In Fig. 7, the radial values with low crowded index are larger than those with high crowded index because the basis functions in the region where distances to data are long need to cover a wider input space. This result indicates that the proposed radius adjustment algorithm works properly.    [20]. Therefore, it is difficult to apply it to plant control where the learning data usually have deviations of crowded index like Fig. 7. The proposed method can adjust the radii considering the distribution of the learning data, thus the RMSEs are an average of 33.9[%] better compared to those from the radius equation. The proposed method also has the same performances as the CV method.

Basic concepts
When the RL is applied to the thermal power plant control, it is necessary to design the reward so that it can be given to the agent instantly in order to adapt to the plant properties which change from hour to hour. So far, studies with respect to designing reward of the RL have reported [25,26] that high flexibility could be realized by switching or adjusting the reward in accordance with change of the agent's objectives and situations. However, it would be difficult to apply this to thermal power plant control which needs instant reward designing for changes of plant properties because the reward design and its switching or adjusting depend on a priori knowledge. The proposed control system defines a reward function which does not depend on the learning object and proposes an automatic reward adjustment method which adjusts the parameters of the reward function adaptively based on the plant property information obtained in the learning. It is possible to use the same reward function for different operating conditions and control objectives in this method, and the reward function is adjusted in accordance with learning progress. Therefore, it is expected possible to construct a flexible plant control system without manual reward design.

Definition of reward
The statistical model in the proposed control system has a unique characteristic due to specifications of applied plants, kinds of environmental effects and operating conditions. In case such a model is used for learning, the reward function should be generalized because it is difficult to design unique reward functions for various plant properties in real time. Thus the authors have defined the reward function as Eq. (26).

Algorithm of the proposed reward adjustment method
The proposed reward adjustment method adjusts the reward parameters  ,  using the model outputs which are obtained during the learning so that the agent can get the proper reward for (1) characteristics of the learning object and (2) progress of learning. Here, (1) means that this method can adjust the reward properly for the statistical models whose optimal control conditions and NOx/CO properties are different by adjusting  ,  . (2) means that this method makes it easier for the agent to get the reward and accelerate learning at the early stage, while also making the conditions to get the reward stricter and improving the agent's learning accuracy. The reward parameters are updated based on the sum of weighted model outputs f obtained in each episode and the best f value obtained during the past episodes. Hereafter, the sum of weighted model outputs and the reward parameters at episode t are denoted as ,

Simulations
In this section, simulations are described to evaluate the performances of the proposed control system with the automatic reward adjustment method when it is applied to virtual plant models configured on the basis of experimental data. The simulations incorporate changes of the plant operations several times and the data for the RBF network. The evaluations focus on the flexibility in control of the proposed reward adjustment method for the change of the operational conditions. In addition, the robustness in control for the statistical model including noise by tuning the weight decay parameter of RBF network is also studied. Figure 10 shows the basic structure of the simulation. The objective of the simulation is to reduce NOx and CO emissions from a virtual coal-fired boiler model (statistical model) constructed with three numerical calculation DBs. The RL agent learns how to control three operational parameters with respect to air mass flow supplied to the boiler. Therefore, input and output dimensions ( , JP) of the control system are 3 and 2, respectively. The input values are normalized into the range of 01

Simulation conditions
[,] . The three numerical calculation DBs have different operational conditions, and each DB has 63 data whose input-output conditions are different. These data include some noise similar to the actual plant data.  The simulations are executed by two reward settings: the variable reward for the proposed reward adjustment method (proposed method) and the fixed reward (conventional method). Both reward settings are done under two conditions where the weight decay  for the RBF network is set to 0, 0.01 to evaluate the robustness of control by  settings. The RL agent learns at the times when operational conditions or control goals (0:00, 2:00 and 4:00) are changed, and the control interval is 10 minutes. Hence it is possible to control the boiler 11 times in each period. Parameter conditions of learning are shown in Table 6. These conditions are set using prior experimental results. The parameter conditions of reward are shown in Table 7. The parameters (  ,   ,   ,   ) of the proposed method are also set properly using prior experiments. In the conventional method, the values of ,   are fixed to their initial values which are optimal for the first operational condition in Table 5 Figure 11 shows the time series of normalized f as a result of controls by the two methods,

Results and discussion
where the initial value at 0:00 is determined as the base. There are four graphs in Fig. 11 with combinations of the two objectives of simulations and  settings. The optimal f value in each period is shown as well.  conventional method in the case of  =0.01 are discussed. The initial f values at 0:00 of these methods have offsets with the optimal values, but they are decreased for control and finally converged near the optimal values. This is because the reward functions used in each method are appropriate to learn the optimal control logic. The RL agent relearns its control logic when the statistical model and its optimal f values are changed at 2:00 by the change of operational conditions or control goals. However, the f values of the conventional method after 11 control times still have offsets from the optimal values, while the proposed method can obtain the optimal values after 11 times. The initial reward setting of the conventional method would be inappropriate for the next operational condition. Similar results of control are obtained for the same reason after changing the statistical model at 4:00. As discussed above, the plant control system by the conventional method has a possibility to deteriorate the control performances in thermal power plants for which operational conditions and control goals are changed frequently. Therefore, the proposed reward adjustment method is effective for the plant control, which can adjust the reward function flexibly for such changes.
Next, the robustness of the proposed control system by weight decay (  ) tuning is discussed. In Fig. 11, every f value of the proposed method can reach nearly the optimal value when  is 0.01, whereas f converges into the values larger than the optimal values when  is 0 for 2:00-6:00 in (a) and 2:00~4:00 in (b). The RBF network cannot learn with considered the influences of noise included in the learning data when  is 0 [16]. The response surface is created to fit the noised data closely and many local minimum values are generated in it compared with the response surface of 01 . 0   . This is because the learned control logic is converged each local minimum. The above results show that the RBF network can avoid overfitting by tuning  properly and the proposed control system can control thermal power plants robustly. Finally, the learning processes of f and reward parameters of the proposed method are studied. Fig. 12 shows the ,, , f   values for episodes in learning at the operational changes at 0:00 and 2:00 when  is 0.01. In the early stage of learning (episodes 1-500), the  parameter in each case increases nearby 0.9 because the f value does not decrease due to www.intechopen.com  29) and (30). As a result, these parameters converge into different values. These adjustment results of reward parameters for different statistical models can be discussed as follows. By analysis of the characteristics of these statistical models, it seems that the gradient of f in operation A is larger than that of operation B because operation A has a larger difference between the maximum and minimum value of f than operation B . When the gradient of f is larger, f will vary significantly for each control thus it is necessary to set  larger so that the agent can get the reward easily. On the other hand, it is useless to set  larger in the statistical model in operation B for which the gradient of f is small. As for the results of adjustment of , ,   in Fig. 12, the reward function of operation A certainly becomes easier to give the reward due to the larger  than for operation B . Therefore, the above results show that the proposed method can obtain the appropriate reward function flexibly in accordance with the properties of the statistical models.

Conclusions
This chapter presented a plant control system to reduce NOx and CO emissions exhausted by thermal power plants. The proposed control system generates optimal control signals by that the RL agent which learns optimal control logic using the statistical model to estimate the NOx and CO properties. The proposed control system requires flexibility for the change of plant operation conditions and robustness for noise of the measured data. In addition, the statistical model should be able to be tuned by the measured data within a practical computational time. To overcome these problems the authors proposed two novel methods, the adaptive radius adjustment method of the RBF network and the automatic reward adjustment method. The simulations clarified the proposed methods provided high estimation accuracy of the statistical model within practical computational time, flexible control by RL for various changes of plant properties and robustness for the plant data with noise. These advantages led to the conclusion that the proposed plant control system would be effective for reducing environmental effects.

A.1 Cross Validation (CV) method
The cross validation (CV) method is one of the conventional radius adjustment methods for the RBF network with regression and it adjusts radii by error evaluations. In this method, a datum is excluded from the learning data and the estimation error at the excluded datum is