InTech uses cookies to offer you the best online experience. By continuing to use our site, you agree to our Privacy Policy.

Computer and Information Science » Artificial Intelligence » "Reinforcement Learning", book edited by Cornelius Weber, Mark Elshaw and Norbert Michael Mayer, ISBN 978-3-902613-14-1, Published: January 1, 2008 under CC BY-NC-SA 3.0 license. © The Author(s).

Chapter 17

Reinforcement Learning-Based Supervisory Control Strategy for a Rotary Kiln Process

By Xiaojie Zhou, Heng Yue and Tianyou Chai
DOI: 10.5772/5288

Article top

Reinforcement Learning-Based Supervisory Control Strategy for a Rotary Kiln Process

Xiaojie Zhou1, Heng Yue and Tianyou Chai

1. Introduction

Rotary kiln is a kind of large scale sintering device widely used in metallurgical, cement, refractory materials, chemical and environment protection industries. Its complicated working mechanism includes physical change and chemical reaction of material, procedure of combustion, thermal transmission among gaseous fluid, solid material fluid and the liner. The automation problem of such processes remains unsolved because of the following inherent complexities. A rotary kiln is a typical distributed parameter system with correlative temperature distribution of gaseous phase and solid phase along its axis direction. Limited by device rotation and technical design, sensors and actuators can be installed only at the kiln head and kiln tail, and lumped parameter control strategies are employed to deal with distributed parameter problems. Thus the rotary kiln process is a multivariable nonlinear system with strong coupling, large lag and uncertain disturbances. Moreover, the key controlled variable of burning zone temperature is measured with serious disturbances. Most of rotary kilns are still under manual control with human operator observing the burning status. As a result, the product quality is hard to be kept consistent and energy consumption remains high, kiln liner is easy to wear out, the kiln running rate and yield is low.

Although several advanced control strategies including fuzzy control (Holmblad & Østergaard, 1995), intelligent control (Jarvensivu et al., 2001a; Jarvensivu et al., 2001b) and predictive control (Zanovello & Budman, 1999) have been introduced into process control of rotary kiln, all these researches focused on stabilizing some key controlled variables but are valid only for cases that boundary conditions do not change frequently. As a matter of fact, the boundary conditions of a rotary kiln often change. For example, the material load, water content and components of the raw material slurry vary frequently and severely. Moreover, the offline analysis data of components of raw material slurry reach the operator with large time delay. Thus conventional control strategy cannot reach automatic control and keep the product quality consistent. To deal with the complexity of operation conditions, the authors have proposed an intelligent control system based on human-machine interaction for an alumina rotary kiln in (Zhou et al., 2004; Zhou et al., 2006), in which human intervention function was design so that, if the operation condition changed largely, the human operator observing burning status can intervene the control actions when the system is in the automatic control mode to enhance the adaptability of the control system.

This chapter develops a supervisory control approach for burning zone temperature based on Q-learning, in which the signals of human intervention are viewed as the reinforcement learning signals. Section 2 makes brief descriptions of process and supervisory control system architecture. Section 3 discusses the detailed methodology of Q-learning-based supervisory control approach. The implementation and industrial applications are shown in Section 4. Finally, Section 5 draws the conclusion.

2. Process description and supervisory control system architecture

The alumina rotary kiln process is described as follows. Raw material slurry is sprayed into the rotary kiln from upper end (the kiln tail). At the lower end (the kiln head), the coal powders from the coal injector and the primary air from the air blower are mixed into bi-phase fuel flow, which is sprayed into the kiln head hood and combusts with the secondary air, which comes from the cooler. The heated gas was brought to the kiln tail by the induced draft fan, while the material moves to the kiln head via the rotation of the kiln and its self weight, in counter direction with the gas. After the material passes through the drying zone, pre-heating zone, decomposing zone, burning zone and cooling zone in sequence, soluble sodium aluminate is generated in the clinker, which is the product of the kiln process. This process aims to reach high digesting rate of alumina in the following digestion procedure.

media/image1.png

Figure 1.

Schematic diagram of the alumina rotary kiln.

The control problem of quality index of kiln production is how to keep the liter weight of clinker being qualified under fluctuated boundary conditions and operating conditions. The liter weight of clinker is hard to measure online and cannot be controlled directly. This paper employs the following strategy to deal with this problem. Some online measurable technologic parameters with closed relations to the final quality index are chosen and controlled into certain ranges governed by technical requirement so that the quality index control is realized indirectly.

In the sintering process, the normal range of sintering temperature T sinter of raw material depends upon components of raw material slurry. Variations of components of raw material slurry require corresponding variations of sintering temperature. Inconsistency of real sintering temperature range with requirement of raw material will results in over burning or under burning, and clinker quality is not satisfactory. Thus we conclude that components of raw material slurry and sintering temperature are the main aspects influencing clinker quality. Besides, other factors include particle size of raw material and residing time under T sinter . The relationship between desired T sinter and components of raw material slurry can be viewed as a unknown nonlinear function

Tsinter=f([A/S],[N/R],[C/S],[F/A])
(1)

where [A/S] is the alumina silica ratio of raw material slurry, [N/R] is the alkali ratio, [C/S] is the calcium silica ratio, [F/A] is the iron alumina ratio. Among them, the alumina silica ratio of raw material slurry has the strongest influence on T sinter , the latter must be enhanced along with the enhancement of the former.

From above analysis, one may conclude that there are two key issues about the control problem of quality index of kiln production. One is how to keep the kiln temperature distribution satisfing technical requirement under fluctuated boundary conditions and operating conditions, i.e. how to keep burning zone temperature, kiln tail temperature and residual oxygen content in combustion gas in their technical required ranges. The other is how to adjust the setpoint range of burning zone temperature so that the liter weight of clinker may be kept qualified under fluctuated boundary conditions and operating conditions.

media/image3.jpeg

Figure 2.

General structure of the supervisory control system for rotary kiln process.

This paper has constructed a supervisory control system consisting of a supervisory level and a process control level, whose general structure is shown in Fig. 2. The final target of this supervisory control system is to keep the production quality index, i.e. the clinker unit weight, being acceptable even if the boundary conditions changed. The related process control strategies in process control level include, 1) a hybrid intelligent temperature controller was designed, which coordinated the coal feeding u 1, damper position of the induced draft fan u 2, and primary air flow u 3 to make the burning zone temperature T BZ , the kiln tail temperature T BE , and the residual oxygen content in combustion gas OX satisfy technical requirements; T BZ is indirectly measured by an infrared pyrometer located at kiln head hood, and T BE is obtained through a thermocouple; 2) individual PI controllers were assigned to basic loops of primary air flow, primary air pressure and flow rate of raw material slurry; and 3) a human-machine interaction(HMI) mechanism was designed so that certain human interventions to coal feeding control from experienced operator can be introduced in the mode of automatic control when the operating conditions changed significantly. The aforementioned process control strategies were depicted in our previous study (Zhou et al., 2004).

The main part of the supervisory level is an intelligent setting model of T BZ , which adjusts the setpoint range of T BZ according to the variations of components of raw material slurry. The setpoints of T BE , OX, primary air pressure, flow rate of raw material slurry and the kiln rotary speed n are given by the operators according to production scheduling and production experiences.

The intelligent setting model of burning zone temperature consists of a pre-setting model of burning zone temperature, a compensation model and a setting selector mechanism. The pre-setting model is to give the upper and lower limits of setpoint range of burning zone temperature, denoted by T0BZ_SPHI and T0BZ_SPLO , calculating from the offline analysis data of components of raw material slurry. The fuzzy clustering analysis combined with case-based inference learning is employed to build up the pre-setting model of burning zone temperature. The core of the pre-setting model is a case base containing different upper and lower limits of setpoint range of burning zone temperature corresponding to different components of raw material slurry. Such case base is established through fuzzy clustering based data mining from vast process data samples under various components of raw material slurry. Details are not described in this paper.

As a matter of fact, the main problem we are facing is that the components of raw material slurry often change due to unstable raw material mixing process and the offline analysis data reach to the operator with large time delay so that the operator or the pre-setting model cannot directly adjust the setpoint of T BZ duly. As a result, a single intelligent temperature controller and a single pre-setting model of T BZ cannot maintain satisfactory performance. In such a case, a human operator usually rectifies the output of the temperature controller, i.e. the coal feeding, based on the experience of observing burning status through the HMI embedded in the control system. Such interventions can adapt the variation of operating conditions to a certain degree to sustain the quality of the product.

To deal with such a problem, a compensation model and a setting selector are appended. When the offline analysis data of components of raw material slurry are known and input into the system, i.e. the lth sampled time, the setting selector mechanism triggers the pre-setting model to calculated the proper setpoint range of T BZ . When the components of raw material slurry are unknown, the compensation model is triggered to calculated the proper upper and lower limits of setpoint range of the burning zone temperature, denoted by T1BZ_SPHI and T1BZ_SPLO respectively. In the following section, a Q-learning strategy is employed to construct compensation model to learn the self-adjusting knowledge about the setpoint of T BZ through online self-learning from the human intervention signals.

3. Setpoint adjustment approach based on Q-learning

3.1. Bases of Q-learning

Reinforcement learning is learning with a critic instead of a teacher. The only feedback provided by the critic is a scalar signal r called reinforcement, which can be regarded as a reward or a punishment. Reinforcement learning performs an online search to find an optimal decision policy in multi-stage decision problems.

Q-learning (Watkins & Dayan, 1992) is a reinforcement learning method where the learner builds incrementally a Q-function which attempts to estimate the discounted future rewards for taking actions from given states. The output of the Q-function for state x and action a is denoted by Q(x,a) .When action a has been chosen and applied, the environment moves to a new state, x , and a reinforcement signal, r, is received. Q(x,a) is updated by

Qk(x,a)Qk1(x,a)+αk{r+γmaxaA(x)Qk1(x,a)Qk1(x,a)}
(2)

where

αk=11+visitsk(x,a)
(3)

where A(x) is the set of possible actions in state x , γ is discount factor, αk is the learning rate, and visitsk(x,a) is the total number of times this state-action pair (x,a) has been visited up to and including the kth iteration.

3.2. Principle of setpoint adjustment approach based on Q-learning

In this section, we may design an online self-learning system based on reinforcement learning to gradually establish the optimal policy of setpoint adjustment of T BZ . Although it cannot reach to the operator in time, the changes of components of raw material slurry may be indirectly reflected through certain measurements of the rotary kiln process. The measurements can be used to construct the environment state set of the learning system. Moreover, information of human interventions can be regarded as evaluations about whether the setpoint of T BZ is proper or not, for human interventions often occur when the performance is unsatisfactory. Thus this kind of information can be defined as reward signal from environment.

For the learning system, the environment includes the rotary kiln process, the temperature controller and the operator. The environment provides current states and reinforcement payoffs to the learning system. The learning system produces the compensated upper and lower limits of setpoint range of T BZ to temperature controller in the environment. The learning system consists of a state perceptron, a critic, a learner and an action selector, as shown in Fig. 3. The state perceptron firstly samples and processes selected measurements to construct the original state vector, and then converts the original continuous state vector into a discrete feature vector x based on a defined feature extraction function. The action selector employs a ε-greedy action selection strategy to produce an amendment of setpoint of T BZ , i.e. ΔTBZ_SP and the critic serves to calculate an internal practicable reward r relying on some heuristic rules. The learner updates value function of the state-action pair based on tabular Q-learning. The final outputs of the learning system are the compensated upper and lower limits of setpoint range of T BZ , which are calculated respectively by

T1BZ_SPHI(k)=ΔTBZ_SP(k)+T1BZ_SPHI(k1)
(4)
T1BZ_SPLO(k)=ΔTBZ_SP(k)+T1BZ_SPLO(k1)
(5)
media/image23.jpeg

Figure 3.

Schematic diagram of setpoint adjustment approach for T BZ based on Q-learning.

In a Markov decision process (MDP), only the sequential nature of the decision process is relevant, not the amount of time that passes between decision stages. A generalization of this is the semi-Markov decision process (SMDP) in which the amount of time between one decision and the next is a random variable. For the learning process, we define τs as state perception time span for the perceptron to get the state of the environment and τr as reward calculation time span, also named as action execution time span, for the critic to calculate internal reward. The shortest time span from one decision to the next is τ=τs+τr .

The design of the learning system concerns the following key issues:

Construction of the environment perception state set;

Determination of the action set;

Determination of the immediate reward function;

Determination of the learning algorithm.

3.3. Construction of the state set

When components of raw material slurry fluctuate and related offline analysis data are unavailable, we hope that the learning system can estimate the changes of the components of raw material slurry through the percepted information about the environment state. From this idea, some related variables are selected from online measurable variables of the kiln process based on human experience, with which the state vector s is defined to buildup the original state space S of the learning system, where s=[s1,s2,s3,s4,s5] , sS . s1 is defined as the averaged burning zone temperature T¯BZ , s2 is the averaged flow rate of raw material slurry G¯ , s3 is the averaged coal feeding u¯1 , s4 and s5 are the averaged upper and lower limit of the setpoint range of T BZ , named as T¯BZ_SPHI and T¯BZ_SPLO respectively, all during τs . They are calculated from

T¯BZ=j=1JTBZ(j)/J
(6)
G¯=j=1JG(j)/J
(7)
u¯1=j=1Ju1(j)/J
(8)
T¯BZ_SPHI=j=1JTBZ_SPHI(j)/J
(9)
T¯BZ_SPLO=j=1JTBZ_SPLO(j)/J
(10)

where TBZ(j) , G(j) , u1(j) , TBZ_SPHI(j) , TBZ_SPLO(j) denote the jth sampling values of T BZ , flow rate of raw material slurry, coal feeding, upper and lower limits of the setpoint range of T BZ during τs respectively. J is the total number of sampling values during τs .

Since the state space S defined above is continuous, it is impossible to compute and store value functions for every possible state or state-action pair due to the curse of dimensionality. The issue is often addressed by generating a compact parametric representation, such as an artificial neural network, that approximates the value function and can guide future actions. we practically choose to use a feature extraction method (Tsitsiklis & Van Roy, 1996) to map the original continuous state space into a finite feature space, then we can employ tabular Q-learning to solve the problem.

By identifying one partition per possible feature vector, the feature extraction mapping F(s)=[f1(s1,s4,s5),f2(s1),f3(s2),f4(s3)] defines a partitioning of the original state space. The burning zone temperature biasing (from the setpoint range) level feature f 1, the temperature level feature f 2, flow rate of raw material slurry level feature f 3, the coal feeding level feature f 4 are defined respectively by

f1(s1,s4,s5)={2,(T¯BZT¯BZ_SPLO)L21,L2(T¯BZT¯BZ_SPLO)L10,(T¯BZT¯BZ_SPHI)L1and(T¯BZT¯BZ_SPLO)L11,L1(T¯BZT¯BZ_SPHI)L22,(T¯BZT¯BZ_SPHI)L2
(11)
f2(s1)={0,T¯BZ12501,1250T¯BZ12802,T¯BZ1280
(12)
f3(s2)={0,70G¯751,75G¯802,G¯803,else
(13)
f4(s3)={0,800u¯110001,1000u¯112002,1200u¯114003,else
(14)

where L1 and L2 are the thresholds scaling the burning zone temperature bias from setpoint range level.

Each feature function maps the state space S to a finite set Pm,m=1,2,3,4 . Then we associate the feature vector x=[x1,x2,x3,x4]=F(s) to each state sS . The resulting set of all possible feature vectors, also defined as feature space X , is the Cartesian product of the sets Pm .

Because the compensation model for the setpoint of burning zone temperature needs only to be applicable for the normal kiln operating conditions, the design of state set needs certain filtration in the feature space X . The appearence of x3=3 or x4=3 might means the abnormal operating conditions such as low load of flow rate of raw material slurry during kiln starting phase or abnormal coal components. The state set excludes such valued feature vectors.

3.4. Action set

The learning system aims to deduce the proper or best actions of setpoint adjustment of T BZ from specified environment state. The problem to be handled is how to choose ΔTBZ_SP according to the changes of environment state. Thus the action set can be defined as A={a1,a2,a3,a4,a5}={30,15,0,15,30} .

3.5. Immediate reward signal

During τr after the action selection based on current state judgment, the learning system determines the immediate reward signal r=R(Δu1MAN,Δu1AUTO) , which represents the satisfactory degree of the environment about the action execution under current state, using the human intervention regulation of coal feeding Δu1MAN and temperature controller regulation Δu1AUTO .The reward signal r is determined in table 1.

r |ΔCoalAUTO|L3 ΔCoalAUTOL3 ΔCoalAUTOL3
|ΔCoalMAN|L3 0.40.40.4
ΔCoalMANL3 -0.20.2-0.4
ΔCoalMANL3 -0.2-0.40.2

Table 1.

Definition of immediate reward function R.

where L3 is the threshold constant, ΔCoalMAN denotes the total regulation of coal feeding from human intervention during τr , which is calculated by

ΔCoalMAN=τrΔu1MAN
(15)

and ΔCoalAUTO denotes the total regulation from temperature controller during τr , which is calculated by

ΔCoalAUTO=τrΔu1AUTO
(16)

The immediate reward function R in Table 1 is from the following heuristic rules:

During τr , if |ΔCoalMAN|L3 , which means the operator is satisfied with the regulation action of the control system and little human intervention occurs, then a positive reward r=0.4 is returned. If ΔCoalMAN and ΔCoalAUTO have same regulation directions, which means the direction of regulation action of the control system fits with the operator expectation with short amplitude, then a positive reward r=0.2 is returned. If ΔCoalMAN > L3 or ΔCoalMAN < L3 , and |ΔCoalAUTO|L3 , which means little regulation action of the control system occurs while large human intervention occurs, then r=-0.2. If ΔCoalMAN and ΔCoalAUTO have contrary regulation directions, which means the operator is not satisfied with the regulation action of the control system, then a negative reward r=-0.4 is returned.

3.6. Algorithm summary

The whole learning algorithm of the learning system under learning mode is summarized as follows:

Step 1: If it is in initialization, then the Q value table of state-action pairs is initialized according to expert experience, otherwise goto step 2 directly;

Step 2: During τs , the state perceptron obtains and saves measured burning zone temperature, flow rate of raw material slurry, coal feeding, upper and lower limits of the setpoint range of the burning zone temperature, and calculates related averaged values by using (6)-(10), then transfer them into related level features to construct feature vector x by using (11)-(14).

Step 3: Search in the Q table to make state matching, if unsuccessful then goto step 2 to make state judgement again, if successful then go ahead;

Step 4: The action selector chooses an amendment of setpoint of TBZ as its output according to ε-greedy action selection strategy (Sutton & Barto, 1998), where ε=0.1;

Step 5: During τr , the critic determines the reward signal r of this state-action pair according to Table 1.

Step 6: When the current τr finishes, entering the next τs , the state perceptron judges the next state x , state matching is made in the Q table, if unsuccessful then goto step 2 to start the next learning round, if successful then using the reward signal r, the learner calculates and updates the Q value of the last state-action pair by using (2)-(3), where γ=0.9 .

Step 7: Judge if the learning should be finished. When all evaluation values of state-action pairs in the Q table do not change obviously, it means that the Q-function have converged, and the compensation model is well trained.

The problem of Q table initialization: there is no explicit tutor signal in reinforcement learning, the learning procedure is carried out through constant interaction with environment to get the reward signals. Usually, less information from environment will results low learning efficiency of reinforcement learning. In this paper different initial evaluation values are given for different actions under same state based on expert experience so that the convergence of the algorithm has been speedup, and online learning efficiency has been enhanced.

3.7. Technical issues

The main task of the learning system is to estimate the variations of the kiln operating conditions continuously, and to adjust the setpoint range of burning zone temperature accordingly. Such adjustments should be made when the burning zone temperature is fairly controlled smooth by the temperature controller. Such a judgment signal is given out from the hybrid intelligent temperature controller. If the temperature control is in the abnormal conditions, the learning procedure must be postponed. In this case the setpoint range of the burning zone temperature is kept constant.

Moreover, setpoint adjustments should be made when the learning system make accurate judgment about the kiln operating conditions. Because of complexity and fluctuation of kiln operating conditions, accurate judgment for current state usually needs long time, and the time span between two setpoint adjustments cannot be too short, otherwise the calculated immediate reward cannot reflect the real influence of the above adjustment upon the behaviour and performance of the control system. Thus special attention should be paid to selection of τs and τr . This makes solid foundation, on which obtained environmental states and reinforcement payoffs are effective.

After long term running, large characteristic changes of components of raw material slurry, coal and kiln device may appear. The previous optimal designed compensation model for the setpoint of burning zone temperature might become invalid under new operating conditions. This needs new optimal design to keep good performance of control system for long term. In this case, the reinforcement learning system should be switched into the learning mode, and above models can be established through new learning to improve the performance, so that the control system has strong adaptability for long term running. This is an important issue drawing the attentions of the enterprise.

4. Industrial application

Shanxi Alumina Plant is the largest alumina plant in Asia with megaton production capacity. It has 6 oversize rotary kilns of φ4.5×110m. Its production employs the series parallel technology of Bayer and Sintering Processes. Such a production technology makes components of the raw material of rotary kilns often vary in large range. It is more difficult to keep a stable kiln operation than ordinary rotary kiln.

A supervisory control system has been developed in the #4 rotary kiln of Shanxi Alumina Plant based on the proposed structure and the setpoint adjustment approach of burning zone temperature. It is implemented in the I/A Series 51 DCS of Foxboro. The Q-learning-based strategy has been realized in the configuration environment of Fox Draw and ICC of I/A Series 51 DCS. Related parameters are chosen as τs =30min, τr =120min.

media/image110.png

Figure 4.

The setpoint of burning zone temperature is properly adjusted after learning.

Fig. 4 shows the condition that, after a period of learning, a set of relatively stable strategies of setpoint adjustment has been established so that the setpoint range of T BZ can be automatically adjusted to satisfy the requirement of sintering temperature, according to the level of raw material slurry flow, the level of coal feeding, the level of T BZ and the level of temperature biasing. It can be seen that the setpoint adjustment happened only when T BZ is controlled smoothly. The judgment signal, denoted by “control parameter” in Fig. 4, takes value of 0 when the burning zone temperature is fairly controlled smooth, and vice versa.

The adjustment actions of the above reinforcement learning system result in satisfactory performance of the kiln temperature controller, with reasonable and acceptable regulation amplitude of coal feeding and regulation rhythm, so that the adaptability for variations of operating conditions has been significantly enhanced and the production quality index, liter weight of clinker, can be kept to reach the technical requirement even if the boundary conditions and operation conditions change. Meanwhile, human interventions become weaker and weaker since the model application has improved the system performance.

In the period of test run, the running rate of supervisory control system has been up to 90%. Negative influences on the heating and operating conditions from human factors have been avoided, rationalization and stability of clinker production has been kept, and operational life span of kiln liner has been prolonged remarkably. The qualification rate of clinker unit weight has been enhanced from 78.67% to 84.77%; production capacity in unit time per kiln has been increased from 52.95t/h to 55t/h with 3.9% increment. The kiln running rate has been elevated up to 1.5%. Through the calculation based on average 10℃ reduction of kiln tail temperature and average 2% decrease of the residual oxygen content in combustion gas, it can be concluded that 1.5% energy consumption has been saved.

5. Conclusion

In this chapter, we focus on the discussion about an implementation strategy of how to employ reinforcement learning in control of a typical complex industrial process to enhance control performance and adaptability for the variations of operating conditions of the automatic control system.

Operation of large rotary kilns is difficult and relies on experienced human operators observing the burning status, because of their inherent complexities. Thus the problem of human-machine coordination is addressed when we design the rotary kiln control system, and the human intervention and adjustment can be introduced. Except for emergent operation conditions that need urgent human operation for system safety, the fact is observed that human interventions to the automatic control system usually imply human’s discontent to the performance of the control system when the variation of boundary conditions occurs. From this idea, an online reinforcement learning-based supervisory control system is designed, in which the human interventions might be defined as the environmental reward signals. The optimal mapping between rotary kiln operating conditions and adjustment of important controller setpoint parameters can be established gradually. Successful application of this strategy in an alumina rotary kiln has shown that the adaptability and performance of the control system have been improved effectively.

Further research will focus on trying to improve the setting model of the burning zone temperature by introducing the offline analysis data of clinker liter weight to reject the other uncertain disturbances in the quality control of kiln production.

References

1 - L. Holmblad, J. Østergaard, 1995 The FLS application of Fuzzy logic, Fuzzy Sets and Systems, 70 2-3 , (March 1995) 135-146, 0165-0114
2 - M. Jarvensivu, K. Saari, S. Jamsa-Jounela, 2001a Intelligent control system of an industrial lime kiln process, Control Engineering Practice, 9 6 (June 2001) 589-606, 0967-0661
3 - M. Jarvensivu, E. Juuso, O. Ahava, 2001b Intelligent control of a rotary kiln fired with producer gas generated from biomass, Engineering Applications of Artificial Intelligence, 14 5 (October 2001) 629-653, 0952-1976
4 - R. Sutton, A. Barto, 1998 Reinforcement Learning: An Introduction, MIT Press, 0-26219-398-1 Cambridge, MA
5 - J. Tsitsiklis, B. Van Roy, 1996 Feature-based methods for large scale dynamic programming, Machine Learning, 22 1-3 , (Jan./Feb./March 1996) 59 94 , 0885-6125
6 - J. Watkins, P. Dayan, 1992 Q-Learning, Machine Learning, 8 3-4 , (May 1992) 279 292 , 0885-6125
7 - R. Zanovello, H. Budman, 1999 Model predictive control with soft constraints with application to lime kiln control, Computers and Chemical Engineering, 23 6 (June 1999) 791 806, 0098-1354
8 - X. Zhou, D. Xu, L. Zhang, T. Chai, 2004 Integrated automation system of a rotary kiln process for Alumina production, Journal of Jilin University (Engineering and Technology Edition), 34 No. sup, (August 2004)350-353. 1671-5497(in Chinese)
9 - X. Zhou, H. Yue, T. Chai, B. Fang, 2006 Supervisory control for rotary kiln temperature based on reinforcement learning,Proceedings of 2006 International Conference on Intelligent Computing, 428 437 , 3-54037-255-5 Kunming, China, August, 2006, Springer-Verlag, Berlin, Germany