Online Gaming: Real Time Solution of Nonlinear Two-Player Zero-Sum Games Using Synchronous Policy Iteration

Games provide an ideal environment in which to study computational intelligence, offering a range of challenging and engaging problems. Game theory (Tijs, 2003) captures the behavior in which a player’s success in selecting strategies depends on the choices of other players. One goal of game theory techniques is to find (saddle point) equilibria, in which each player has an outcome that cannot be improved by unilaterally changing his strategy (e.g. Nash equilibrium). The H∞ control problem is a minimax optimization problem, and hence a zero-sum game where the controller is a minimizing player and the disturbance a maximizing one. Since the work of George Zames in the early 1980s, H∞ techniques have been used in control systems, for sensitivity reduction and disturbance rejection. This chapter is concerned with 2-player zero-sum games that are related to the H∞ control problem, as formulated by (Basar & Olsder, 1999; Basar & Bernard, 1995; Van Der Shaft, 1992). Game theory and H-infinity solutions rely on solving the Hamilton-Jacobi-Isaacs (HJI) equations, which in the zero-sum linear quadratic case reduce to the generalized game algebraic Riccati equation (GARE). In the nonlinear case the HJI equations are difficult or impossible to solve, and may not have global analytic solutions even in simple cases (e.g. scalar system, bilinear in input and state). Solution methods are generally offline and generate fixed control policies that are then implemented in online controllers in real time. In this chapter we provide methods for online gaming, that is for solution of 2-player zerosum infinite horizon games online, through learning the saddle point strategies in real-time. The dynamics may be nonlinear in continuous-time and are assumed known. A novel neural network adaptive control technique is given that is based on reinforcement learning techniques, whereby the control and disturbance policies are tuned online using data generated in real time along the system trajectories. Also tuned is a ‘critic’ approximator structure whose function is to identify the value or outcome of the current control and disturbance policies. Based on this value estimate, the policies are continuously updated. This is a sort of indirect adaptive control algorithm, yet, due to the direct form dependence of the policies on the learned value, it is affected online as direct (‘optimal’) adaptive control. Reinforcement learning (RL) is a class of methods used in machine learning to methodically modify the actions of an agent based on observed responses from its environment (Doya,


2001;
. The RL methods have been developed starting from learning mechanisms observed in mammals. Every decision-making organism interacts with its environment and uses those interactions to improve its own actions in order to maximize the positive effect of its limited available resources; this in turn leads to better survival chances. RL is a means of learning optimal behaviors by observing the response from the environment to non-optimal control policies. In engineering terms, RL refers to the learning approach of an actor or agent which modifies its actions, or control policies, based on stimuli received in response to its interaction with its environment. This learning can be extended along two dimensions: i) nature of interaction (competitive or collaborative) and ii) the number of decision makers (single or multi agent). In view of the advantages offered by the RL methods, a recent objective of control systems researchers is to introduce and develop RL techniques which result in optimal feedback controllers for dynamical systems that can be described in terms of ordinary differential or difference equations. These involve a computational intelligence technique known as Policy Iteration (PI) (Howard, 1960;Sutton & Barto, 1998;, which refers to a class of algorithms built as a two-step iteration: policy evaluation and policy improvement. PI provides effective means of learning solutions to HJ equations online. In control theoretic terms, the PI algorithm amounts to learning the solution to a nonlinear Lyapunov equation, and then updating the policy through minimizing a Hamiltonian function. PI has primarily been developed for discrete-time systems, and online implementation for control systems has been developed through approximation of the value function based on work by (Bertsekas & Tsitsiklis, 1996) and (Werbos, 1974;Werbos 1992). Recently, online policy iteration methods for continuous-time systems have been developed by (D. ). In recent work (Vamvoudakis & Lewis, 2010), we developed an online approximate solution method based on PI for the (1-player) infinite horizon optimal control problem for continuous-time nonlinear systems with known dynamics. This is an optimal adaptive controller that uses two adaptive structures, one for the value (cost) function and one for the control policy. The two structures are tuned simultaneously online to learn the solution of the HJ equation and the optimal policy. This chapter presents an optimal adaptive control method that converges online to the solution to the 2-player differential game (and hence the solution of the bounded L 2 gain problem). Three approximator structures are used. Parameter update laws are given to tune critic, actor, and disturbance neural networks simultaneously online to converge to the solution to the HJ equation and the saddle point policies, while also guaranteeing closedloop stability. Rigorous proofs of performance and convergence are given. The chapter is organized as follows. Section 2 reviews the formulation of the two-player zero-sum differential game. A policy iteration algorithm is given to solve the HJI equation by successive solutions on nonlinear Lyapunov-like equations. This essentially extends Kleinman's algorithm to nonlinear zero-sum differential games. Section 3 develops the synchronous zero-sum game PI algorithm. Care is needed to develop suitable approximator structures for online solution of zero-sum games. First a suitable 'critic' approximator structure is developed for the value function and its tuning method is pinned down. A persistence of excitation is needed to guarantee proper convergence. Next, suitable 'actor' approximator structures are developed for the control and disturbance policies. Finally in section 4, the main result is presented in Theorem 2, which shows how to tune all three approximators simultaneously by using measurements along the system trajectories in real www.intechopen.com time and Theorem 3, which proves exponential convergence to the critic neural network and convergence to the approximate Nash solution. Proofs using Lyapunov techniques guarantee convergence and closed-loop stability. Section 5 presents simulation examples that show the effectiveness of the online synchronous zero-sum game CT PI algorithm in learning the optimal value, control and disturbance for both linear and nonlinear systems. Interestingly, a simulation example shows that the two-player online game converges faster than an equivalent online 1-player (optimal control) problem when all the neural networks are tuned simultaneously in real time. Therefore, it is indicated that one learns faster if one has an opponent and uses synchronous policy iteration techniques.

Background: Two player differential game, and policy iteration
In this section is presented a background review of 2-player zero-sum differential games. The objective is to lay a foundation for the structure needed in subsequent sections for online solution of these problems in real-time. In this regard, the Policy Iteration Algorithm for 2-player games presented at the end of this section is key. Consider the nonlinear time-invariant affine in the input dynamical system given by , and (0) 0 f = so that 0 x = is an equilibrium point of the system. Furthermore take () , () gx kx as continuous. Define the performance index (Lewis & Syrmos, 1995) When the value is finite, a differential equivalent to this is the nonlinear Lyapunov-like equation where n VV x R ∇= ∂ ∂ ∈ is the (transposed) gradient and the Hamiltonian is For feedback policies (Basar & Bernard, 1995), a solution () 0 Vx ≥ to (4) is the value (5) for given feedback policy () ux and disturbance policy () dx .
www.intechopen.com 2.1 Two player zero-sum differential games and Nash equilibrium Define the 2-player zero-sum differential game (Basar & Bernard, 1995;Basar & Olsder, 1999) ( ) subject to the dynamical constraints (1). Thus, u is the minimizing player and d is the maximizing one. This 2-player optimal control problem has a unique solution if a game theoretic saddle point exists, i.e., if the Nash condition holds min max ( (0), , ) max min ( (0), , ) To this game is associated the Hamilton-Jacobi-Isaacs (HJI) equation Given a solution * () 0 : n Vx≥→ to the HJI (8), denote the associated control and disturbance as and write ** 1 2 11 0 (, , , ) Note that global solutions to the HJI (11) may not exist. Moreover, if they do, they may not be smooth. For a discussion on viscosity solutions to the HJI, see (Ball & Helton, 1996;Bardi & Capuzzo-Dolcetta, 1997;Basar & Bernard, 1995). The HJI equation (11) may have more than one nonnegative local smooth solution () 0 Vx ≥ . A minimal nonnegative solution () 0 a Vx≥ is one such that there exists no other nonnegative solution () Vx V x ≥≥ . Linearize the system (1) about the origin to obtain the Generalized ARE (See Section IV.A). Of the nonnegative solutions to the GARE, select the one corresponding to the stable invariant manifold of the Hamiltonian matrix. Then, the minimum nonnegative solution of the HJI is the one having this stabilizing GARE solution as its Hessian matrix evaluated at the origin (Van Der Shaft, 1992). It is shown in (Basar & Bernard, 1995) that if * () Vxis the minimum non-negative solution to the HJI (11) and (1) is locally detectable, then (9), (10) given in terms of * () Vxare in Nash equilibrium solution to the zero-sum game and * () Vxis its value.

Policy iteration solution of the HJI equation
The HJI equation (11) is usually intractable to solve directly. One can solve the HJI iteratively using one of several algorithms that are built on iterative solutions of the Lyapunov equation www.intechopen.com (4). Included are (Feng et al. 2009) which uses an inner loop with iterations on the control, and (Abu-Khalaf, Lewis, 2008;Abu-Khalaf et al. , 2006;Van Der Shaft, 1992) which uses an inner loop with iterations on the disturbance. These are in effect extensions of Kleinman's algorithm (Kleinman, 1968) to nonlinear 2-player games. The complementarity of these algorithms is shown in (Vrabie, 2009). Here, we shall use the latter algorithm (e.g. (Abu-Khalaf, Lewis, 2008;Abu-Khalaf et al., 2006;Van Der Shaft, 1992)).

Policy Iteration (PI) Algorithm for 2-Player Zero-Sum Differential Games (Van Der Shaft, 1992)
Initialization: Start with a stabilizing feedback control policy 0 u 1. For 0,1,... j = given j u Go to 1. ■ Nota Bene: In practice, the iterations in i and j are continued until some convergence criterion is met, e.g.
1 ii jj VV + − or, respectively 1 jj VV + − is small enough in some suitable norm. Given a feedback policy () ux , write the Hamilton-Jacobi (HJ) equation for fixed () ux . The minimal non negative solution () Vx to this equation is the so-called available storage for the given () ux (Van Der Shaft, 1992). Note that the inner loop of this algorithm finds the available storage for j u , where it exists. Assuming that the available storage at each index j is smooth on a local domain of validity, the convergence of this algorithm to the minimal nonnegative solution to the HJI equation is shown in (Abu-Khalaf & Lewis, 2008;Van Der Shaft, 1992). Under these assumptions, the existence of smooth solutions at each step to the Lyapunov-like equation (12) was further shown in (Abu-Khalaf et al., 2006). Also shown was the asymptotic stability of ( Note that this algorithm relies on successive solutions of nonlinear Lyapunov-like equations (12). As such, the discussion surrounding (4) shows that the algorithm finds the value (() ) i j Vx t of successive control policy/disturbance policy pairs.

Approximator structure and solution of the Lyapunov equation
The PI Algorithm is a sequential algorithm that solves the HJI equation (11) and finds the Nash solution ** (,) ud based on sequential solutions of the nonlinear Lyapunov equation (12). That is, while the disturbance policy is being updated, the feedback policy is held constant. In this section, we use PI to lay a rigorous foundation for the NN approximator structure required on-line solution of the 2-player zero-sum differential game in real time. In the next section, this structure will be used to develop an adaptive control algorithm of novel form that converges to the ZS game solution. It is important to define the neural network structures and the NN estimation errors properly or such an adaptive algorithm cannot be developed. The PI algorithm itself is not implemented in this chapter. Instead, here one implements both loops, the outer feedback control update loop and the inner disturbance update loop, simultaneously using neural network learning implemented as differential equations for tuning the weights, while simultaneously keeping track of and learning the value (() , , ) Vxt ud (3) of the current control and disturbance by solution of the Lyapunov equation (4)/(12). We call this synchronous PI for zero-sum games.

Value function approximation: Critic Neural Network Structure
This chapter uses nonlinear approximator structures (e.g. neural networks) for Value Function Approximation (VFA) (Bertsekas & Tsitsiklis, 1996;Werbos, 1974;Werbos, 1992), therefore sacrificing some representational accuracy in order to make the representation manageable in practice. Sacrificing accuracy in the representation of the value function is not so critical, since the ultimate goal is to find a good policy and not necessarily an accurate value function. Based on the structure of the PI algorithm in Section IIB, VFA for online 2player games requires three approximators, which are taken as neural networks (NN), one for the value function, one for the feedback control policy, and one for the disturbance policy. These are motivated respectively by the need to solve equations (12), (14), and (13). To solve equation (12), we use VFA, which here requires approximation in Sobolev norm (Adams & Fournier, 2003), that is, approximation of the value () Vx as well as its gradient () Vx ∇ . The following definition describes uniform convergence that is needed later.

Definition 2. (uniform convergence). A sequence of functions {} n p converges uniformly to
Assumption 1. For each feedback control and disturbance policy the nonlinear Lyapunov equation (12) has a smooth local solution () 0 Vx ≥ . According to the Weierstrass higher-order approximation Theorem (Abu-Khalaf & Lewis, 2005;Finlayson, 1990;Hornik et al., 1990), there exists a complete independent basis set {() } i x ϕ such that the solution () Vx to (4) and its gradient are uniformly approximated, that is, there exist coefficients c i such that , and the second terms in these equations converge uniformly to zero as N →∞. Specifically, the linear subspace generated by the basis set is dense in the Sobolev norm 1, Fournier, 2003). Therefore, assume there exist NN weights 1 W such that the value function () Vx is approximated as provides a complete independent basis set such that () Vx and its derivative are uniformly approximated, e.g., additionally Then, as the number of hidden-layer neurons N →∞, the approximation errors 0, 0 ε ε →∇ → uniformly (Abu-Khalaf & Lewis, 2005;Finlayson, 1990). In addition, for fixed N, the NN approximation errors () , x ε and ε ∇ are bounded by constants locally (Hornik et al., 1990). We refer to the NN with weights 1 W that performs VFA as the critic NN. Standard usage of the Weierstrass high-order approximation Theorem uses polynomial approximation. However, non-polynomial basis sets have been considered in the literature (e.g. (Hornik et al., 1990;Sandberg, 1997)). The NN approximation literature has considered a variety of activation functions including sigmoids, tanh, radial basis functions, etc. Using the NN VFA, considering fixed feedback and disturbance policies ( ( )), ( ( )) uxt dxt , equation (4) becomes where the residual error is Under the Lipschitz assumption on the dynamics, this residual error is bounded locally. www.intechopen.com The following Proposition has been shown in (Abu-Khalaf & Lewis, 2005;Abu-Khalaf & Lewis, 2008).
Define v as the magnitude of a scalar v, x as the vector norm of a vector x, and 2 as the induced matrix 2-norm. Proposition 1. For any policies ( ( )), ( ( )) uxt dxt the least-squares solution to (20) exists and is unique for each N. Denote this solution as 1 W and define Then, as N →∞: Fournier, 2003) to the exact solution () Vx to (4) as N →∞, and the weights 1 W converge to the first N of the weights, 1 C , which exactly solve (4). The effect of the approximation error on the HJI equation (8) is where the residual error due to the function approximation error is 22 11 11 1 1 11 11 24 24 It was also shown in (Abu-Khalaf & Lewis, 2005; Abu-Khalaf & Lewis, 2008) that this error converges uniformly to zero as the number of hidden layer units N increases. That is,

Tuning and convergence of the critic neural network
In this section are addressed the tuning and convergence of the critic NN weights when fixed feedback control and disturbance policies are prescribed. Therefore, the focus is on solving the nonlinear Lyapunov-like equation (4) (e.g. (12)) for a fixed feedback policy u and fixed disturbance policy d.
In fact, this amounts to the design of an observer for the value function. Therefore, this algorithm is consistent with adaptive control approaches which first design an observer for the system state and unknown dynamics, and then use this observer in the design of a feedback control. The ideal weights of the critic NN, 1 W which provide the best approximate solution for (20) are unknown. Therefore, the output of the critic neural network is  (1 ) T σσ+ is used for normalization instead of 11 (1 ) T σσ+ . This is required in the theorem proofs, where one needs both appearances of 11 1 /(1 ) (27) to be bounded (Ioannou & Fidan, 2006;Tao, 2003). Note that, from (20), Substituting (28) in (27) and, with the notation we obtain the dynamics of the critic weight estimation error as with Ι the identity matrix of appropriate dimensions. The PE assumption is needed in adaptive control if one desires to perform system identification using e.g. RLS (Ioannou & Fidan, 2006;Tao, 2003). It is needed here because one effectively desires to identify the critic parameters to approximate () Vx . The properties of tuning algorithm (27) The PE condition (31) is equivalent to the uniform complete observability (UCO) (Lewis, Jagannathan, Yesildirek, 1999) where δ is a positive constant of the order of 1.  uxt dxt be any bounded policies. Let tuning for the critic NN be provided by (27) Lewis, 2008). This means that max ε decreases as the number of hidden layer neurons in (25) increases. Remark 2. This theorem requires the assumption that the feedback policy (() ) uxt and the disturbance policy (() ) dxt are bounded, since the policies appear in (21). In the upcoming Theorems 2 and 3 this restriction is removed.

Action and disturbance neural network
It is important to define the neural network structure and the NN estimation errors properly for the control and disturbance or an adaptive algorithm cannot be developed. To determine a rigorously justified form for the actor and the disturbance NN, consider one step of the Policy Iteration algorithm (12)-(14). Suppose that the solution V(x) to the nonlinear Lyapunov equation (12) for given control and disturbance policies is smooth and given by (16). Then, according to (17) and (13), (14) one has for the policy and the disturbance updates:

Online solution of 2-player zero-sum games using neural networks
This section presents our main results. An online adaptive PI algorithm is given for online solution of the zero-sum game problem which involves simultaneous, or synchronous, tuning of critic, actor, and disturbance neural networks. That is, the weights of all three neural networks are tuned at the same time. This approach is a version of Generalized Policy Iteration (GPI), as introduced in (Sutton & Barto, 1998). In the standard Policy Iteration algorithm (12)- (14), the critic and actor NNs are tuned sequentially, e.g. one at a time, with the weights of the other NNs being held constant. By contrast, we tune all NN simultaneously in real-time. The next definition and facts complete the machinery required for the main results.
The main Theorems are now given, which provide the tuning laws for the actor, critic and disturbance neural networks that guarantee convergence of the synchronous online zerosum game PI algorithm in real-time to the game saddle point solution, while guaranteeing closed-loop stability. Theorem 2. System stability and convergence of NN weights. Let the dynamics be given by (1), the critic NN be given by (25), the control input be given by actor NN (41)  < . In the proof it is seen that the theorem holds for 0 NN > . Remark 4. The theorem shows that PE is needed for proper identification of the value function by the critic NN, and that nonstandard tuning algorithms are required for the actor and the disturbance NN to guarantee stability.

Remark 5. The assumption () 0
Qx > is sufficient but not necessary for this result. If this condition is replaced by zero state observability, the proof still goes through, however it is tedious and does not add insight. The method used would be the technique used in the www.intechopen.com proof of technical Lemma 2 Part a in (Vamvoudakis & Lewis), or the standard methods of (Ioannou & Fidan, 2006;Tao, 2003 (41) and (43)

Simulations
Here we present simulations of a linear and a nonlinear system to show that the game can be solved ONLINE by learning in real time, using the method of this chapter. We also present Simulation B to show that that one learns FASTER if one has an opponent. That is, the two-player online game converges faster than an equivalent online 1-player (optimal control) problem when all the NNs are tuned online in real time.

Linear system
Consider the continuous-time F16 aircraft plant with quadratic cost function used in (Stevens & Lewis, 2003  The synchronous zero-sum game PI algorithm is implemented as in Theorem 2. PE was ensured by adding a small probing noise to the control and the disturbance input. Figure 1 shows the critic parameters, denoted by Then, the disturbance NN is given as The evolution of the system states is presented in Figure 2. One can see that after 300s convergence of the NN weights in critic, actor and disturbance has occurred. www.intechopen.com

Single player linear system
The purpose of this example is to show that one learns FASTER if one has an opponent. That is, the online two-player game converges faster than an equivalent online 1-player (optimal control) problem. In this example, we use the method for online solution of the optimal control problem presented in (Vamvoudakis & Lewis, 2010). That is, Theorem 2 without the disturbance NN (47).
Consider the continuous-time F16 aircraft plant described before but with 0. d = Solving the ARE with Q and R identity matrices of appropriate dimensions, gives the parameters of the optimal critic as T f Wt = In comparison with part A, it is very clear that the two-player zero-sum game algorithm has faster convergence skills than the single-player game (e.g. optimal control problem) by a factor of two. As a conclusion the critic NN learns faster when there is an oponent for the control input, namely a disturbance. Fig. 3. Convergence of the critic parameters to the parameters of the optimal critic.

Nonlinear system
Consider the following affine in control input nonlinear system, with a quadratic cost constructed as in (Nevistic & Primbs, 1996;D. Vrabie, Vamvoudakis & Lewis, 2009) Figure 4 shows the critic parameters, denoted by  also converged to the optimal disturbance. The evolution of the system states is presented in Figure 5. Figure 6 shows the optimal value function. The identified value function given by is virtually indistinguishable from the exact solution and so is not plotted. In fact, Figure 7 shows the 3www.intechopen.com D plot of the difference between the approximated value function and the optimal one. This error is close to zero. Good approximation of the actual value function is being evolved. Figure 8 shows the 3-D plot of the difference between the approximated control, by using the online algorithm, and the optimal one. This error is close to zero. Finally Figure 9 shows the 3-D plot of the difference between the approximated disturbance, by using the online algorithm, and the optimal one. This error is close to zero.       It is now straightforward to demonstrate that if L exceeds a certain bound, then, L is negative. Therefore, according to the standard Lyapunov extension theorem (Lewis, Jagannathan, Yesildirek, 1999) Khalil, 1996). Now consider the error dynamics and the output as in Technical Lemmas 1, 2 and assume 2 σ is persistently exciting