Online Adaptive Learning Solution of Multi-agent Differential Graphical Games

Distributed networks have received much attention in the last year because of their flexibility and computational performance. The ability to coordinate agents is important in many real-world tasks where it is necessary for agents to exchange information with each other. Synchronization behavior among agents is found in flocking of birds, schooling of fish, and other natural systems. Work has been done to develop cooperative control methods for consensus and synchronization (Fax and Murray, 2004; Jadbabaie, Lin and Morse, 2003; Olfati-Saber, and Murray, 2004; Qu, 2009; Ren, Beard, and Atkins, 2005; Ren, and beard, 2005; Ren, and Beard, 2008; Tsitsiklis, 1984). See (Olfati-Saber, Fax, and Murray, 2007; Ren, Beard, and Atkins, 2005) for surveys. Leaderless consensus results in all nodes converging to common value that cannot generally be controlled. We call this the cooperative regulator problem. On the other hand the problem of cooperative tracking requires that all nodes synchronize to a leader or control node (Hong, Hu, and Gao, 2006; Li, Wang, and Chen, 2004; Ren, Moore, and Chen, 2007; Wang, and Chen, 2002). This has been called pinning control or control with a virtual leader. Consensus has been studied for systems on communication graphs with fixed or varying topologies and communication delays.


Introduction
Distributed networks have received much attention in the last year because of their flexibility and computational performance.The ability to coordinate agents is important in many real-world tasks where it is necessary for agents to exchange information with each other.Synchronization behavior among agents is found in flocking of birds, schooling of fish, and other natural systems.Work has been done to develop cooperative control methods for consensus and synchronization (Fax and Murray, 2004;Jadbabaie, Lin and Morse, 2003;Olfati-Saber, and Murray, 2004;Qu, 2009;Ren, Beard, and Atkins, 2005;Ren, and beard, 2005;Ren, and Beard, 2008;Tsitsiklis, 1984).See (Olfati-Saber, Fax, and Murray, 2007;Ren, Beard, and Atkins, 2005) for surveys.Leaderless consensus results in all nodes converging to common value that cannot generally be controlled.We call this the cooperative regulator problem.On the other hand the problem of cooperative tracking requires that all nodes synchronize to a leader or control node (Hong, Hu, and Gao, 2006;Li, Wang, and Chen, 2004;Ren, Moore, and Chen, 2007;Wang, and Chen, 2002).This has been called pinning control or control with a virtual leader.Consensus has been studied for systems on communication graphs with fixed or varying topologies and communication delays.
Game theory provides an ideal environment in which to study multi-player decision and control problems, and offers a wide range of challenging and engaging problems.Game theory (Tijs, 2003) has been successful in modeling strategic behavior, where the outcome for each player depends on the actions of himself and all the other players.Every player chooses a control to minimize independently from the others his own performance objective.Multi player cooperative games rely on solving coupled Hamilton-Jacobi (HJ) equations, which in the linear quadratic case reduce to the coupled algebraic Riccati equations (Basar, and Olsder, 1999;Freiling, Jank, and Abou-Kandil, 2002;Gajic, and Li, 1988).Solution methods are generally offline and generate fixed control policies that are then implemented in online controllers in real time.These coupled equations are difficult to solve.

Synchronization and node error dynamics 2.1 Graphs
Consider a graph ( , ) G V   with a nonempty finite set of N nodes 1 { , , } and a set of edges or arcs V V    .We assume the graph is simple, e.g.no repeated edges and ( , ) ,   the weighted in-degree of node i (i.e.i -th row sum of E).Define the graph Laplacian matrix as L D E   , which has all row sums equal to zero.
A directed path is a sequence of nodes 0 1 , , , r v v v  such that 1 ( , ) , {0,1, , 1} A directed graph is strongly connected if there is a directed path from i v to j v for all distinct nodes , i j v v V  .A (directed) tree is a connected digraph where every node except one, called the root, has in-degree equal to one.A graph is said to have a spanning tree if a subset of the edges forms a directed tree.A strongly connected digraph contains a spanning tree.
General directed graphs with fixed topology are considered in this chapter.

Synchronization and node error dynamics
Consider the N systems or agents distributed on communication graph G with node dynamics where ( ) x t   is the state of node i, ( ) i m i u t   its control input.Cooperative team objectives may be prescribed in terms of the local neighborhood tracking error n i    (Khoo, Xie, and Man, 2009) The pinning gain 0 i g  is nonzero for a small number of nodes i that are coupled directly to the leader or control node 0 x , and 0 i g  for at least one i (Li, Wang, and Chen, 2004).We refer to the nodes i for which 0 i g  as the pinned or controlled nodes.Note that i  represents the information available to node i for state feedback purposes as dictated by the graph structure.
The state of the control or target node is 0 ( ) x t   which satisfies the dynamics Frontiers in Advanced Control Systems 32 0 0 x Ax   (3) Note that this is in fact a command generator (Lewis, 1992) and we seek to design a cooperative control command generator tracker.Note that the trajectory generator A may not be stable.
The Synchronization control design problem is to design local control protocols for all the nodes in G to synchronize to the state of the control node, i.e. one requires 0 ( ) ( ), From (2), the overall error vector for network Gr is given by where the global vectors are and 1 the N-vector of ones.The Kronecker product is  (Brewer, 1978).
is a diagonal matrix with diagonal entries equal to the pinning gains i g .The (global) consensus or synchronization error (e.g. the disagreement vector in (Olfati-Saber, and Murray, 2004)) is The communication digraph is assumed to be strongly connected.Then, if 0 is nonsingular with all eigenvalues having positive real parts (Khoo, Xie, and Man, 2009).The next result therefore follows from (4) and the Cauchy Schwartz inequality and the properties of the Kronecker product (Brewer, 1978).
Lemma 1.Let the graph be strongly connected and 0 G  .Then the synchronization error is bounded by

■
Our objective now shall be to make small the local neighborhood tracking errors ( ) i t  , which in view of Lemma 1 will guarantee synchronization.
To find the dynamics of the local neighborhood tracking error, write ( ) This is a dynamical system with multiple control inputs, from node i and all of its neighbors.

Cooperative multi-player games on graphs
We wish to achieve synchronization while simultaneously optimizing some performance specifications on the agents.To capture this, we intend to use the machinery of multi-player games (Basar, Olsder, 1999)

Cooperative performance index
Define the local performance indices where all weighting matrices are constant and symmetric with 0, 0, 0 that the i-th performance index includes only information about the inputs of node i and its neighbors.
For dynamics (8) with performance objectives (9), introduce the associated Hamiltonians where i p is the costate variable.Necessary conditions (Lewis, and Syrmos, 1995) for a minimum of ( 9) are (1) and

Graphical games
Interpreting the control inputs , i j u u as state dependent policies or strategies, the value function for node i corresponding to those policies is When i V is finite, using Leibniz' formula, a differential equivalent to (13) is given in terms of the Hamiltonian function by the Bellman equation (The gradient is disabused here as a column vector.) That is, solution of equation ( 14) serves as an alternative to evaluating the infinite integral (13) for finding the value associated to the current feedback policies.It is shown in the Proof of Theorem 2 that ( 14) is a Lyapunov equation.According to ( 13) and (10) one equates The local dynamics (8) and performance indices (9) only depend for each node i on its own control actions and those of its neighbors.We call this a graphical game.It depends on the topology of the communication graph ( , ) G V   .We assume throughout the chapter that the game is well-formed in the following sense.
Definition 2. The graphical game with local dynamics (8) and performance indices ( 9) is well-formed if 0 The control objective of agent i in the graphical game is to determine * 1 2 ( ( )) min ( ) Employing the stationarity condition (12) (Lewis, and Syrmos, 1995) one obtains the control policies 1 ( ) ( ) ( ) The game defined in (15) corresponds to Nash equilibrium.
The N-tuple of game values   * * * 1 2 , ,..., J J J  is known as a Nash equilibrium outcome of the Nplayer game.
Online Adaptive Learning Solution of Multi-Agent Differential Graphical Games

35
The distributed multiplayer graphical game with local dynamics (8) and local performance indices (9) should be contrasted with standard multiplayer games (Abou-Kandil, Freiling, Ionescu, and Jank, 2003;Basar, and Olsder 1999) which have centralized dynamics where n z   is the state, ( ) i m i u t   is the control input for every player, and where the performance index of each player depends on the control inputs of all other players.In the graphical games, by contrast, each node's dynamics and performance index only depends on its own state, its control, and the controls of its immediate neighbors.
It is desired to study the distributed game on a graph defined by (15) with distributed dynamics (8).It is not clear in this scenario how global Nash equilibrium is to be achieved.
Graphical games have been studied in the computational intelligence community (Kakade, Kearns, Langford, and Ortitz, 2003;Kearns, Littman, and Singh, 2001;Shoham, and Leyton-Brown, 2009).A (nondynamic) graphical game has been defined there as a tuple ( , , )  G U v U the set of actions available to node i, and It is important to note that the payoff of node i only depends on its own action and those of its immediate neighbors.The work on graphical games has focused on developing algorithms to find standard Nash equilibria for payoffs generally given in terms of matrices.Such algorithms are simplified in that they only have complexity on the order of the maximum node degree in the graph, not on the order of the number of players N.
Undirected graphs are studied, and it is assumed that the graph is connected.
The intention in this chapter is to provide online real-time adaptive methods for solving differential graphical games that are distributed in nature.That is, the control protocols and adaptive algorithms of each node are allowed to depend only information about itself and its neighbors.Moreover, as the game solution is being learned, all node dynamics are required to be stable, until finally all the nodes synchronize to the state of the control node.These online methods are discussed in Section V.
The following notions are needed in the study of differential graphical games.
Definition 4. (Shoham, and Leyton-Brown, 2009) for all policies i u of agent i.
For centralized multi-agent games, where the dynamics is given by ( 18) and the performance of each agent depends on the actions of all other agents, an equivalent definition of Nash equilibrium is that each agent is in best response to all other agents.In graphical games, if all agents are in best response to their neighbors, then all agents are in Nash equilibrium, as seen in the proof of Theorem 1.
However, a counterexample shows the problems with the definition of Nash equilibrium in graphical games.Consider the completely disconnected graph with empty edge set where each node has no neighbors.Then Definition 4 holds if each agent simply chooses his single-player optimal control solution * * ( ) , since, for the disconnected graph case one has for any choices of the two sets , ' of the policies of all the other nodes.That is, the value function of each node does not depend on the policies of any other nodes.
Note, however, that Definition 3 also holds, that is, the nodes are in a global Nash equilibrium.Pathological cases such as this counterexample cannot occur in the standard games with centralized dynamics (18), particularly because stabilizability conditions are usually assumed.

Interactive Nash equilibrium
The counterexample in the previous section shows that in pathological cases when the graph is disconnected, agents can be in Nash equilibrium, yet have no influence on each others' games.In such situations, the definition of coalition-proof Nash equilibrium (Shinohara, 2010) may also hold, that is, no set of agents has an incentive to break away from the Nash equilibrium and seek a new Nash solution among themselves.
To rule out such undesirable situations and guarantee that all agents in a graph are involved in the same game, we make the following stronger definition of global Nash equilibrium.
said to constitute an interactive global Nash equilibrium solution for an N player game if, for all i N  , the Nash condition (17) holds and in addition there exists a policy ' That is, at equilibrium there exists a policy of every player k that influences the performance of all other players i.
If the systems are in Interactive Nash equilibrium, the graphical game is well-defined in the sense that all players are in a single Nash equilibrium with each player affecting the decisions of all other players.Condition (21) means that the reaction curve (Basar, and Olsder, 1999) of any player i is not constant with respect to all variations in the policy of any other player k.
The next results give conditions under which the local best responses in Definition 4 imply the interactive global Nash of Definition 5.
Consider the systems ( 8) in closed-loop with admissible feedbacks ( 12), ( 16) denoted by for a single node k and , The global closed-loop dynamics are , where [.] ik denotes the element (i,k) of a matrix.That is, M is the length of the shortest directed path from k to i. Denote the nodes along this path by where and   ik denotes the position of the block element in the block matrix.
All shortest paths to node i from node k pass through a single neighbor 1 An example case where Assumption 1a holds is when there is a single shortest path from k Lemma 2. Let ( , ) j A B be reachable for all j N  and let Assumption 1 hold.Then the i-th closed-loop system ( 22) is reachable from input k v if and only if there exists a directed path from node k to node i.

Proof:
Sufficiency.If k i  the result is obvious.Otherwise, the reachability matrix from node k to node i has the n m  block element in block row i and block column k given as where * denotes nonzero entries.Under the assumptions, the matrix on the right has full row rank and the matrix on the left is written as Necessity

Proof:
Let every node i be in best response to all his neighbors i j N  .Then * ( , ) ( , ), ,   and the nodes are in Nash equilibrium.
Necessity.If the graph is not strongly connected, then there exist nodes k and i such that there is no path from node k to node i.Then, the control input of node k cannot influence the state or the value of node i.Therefore, the Nash equilibrium is not interactive.

Sufficiency. Let ( , )
i A B be reachable for all i N  .Then if there is a path from node k to node i, the state i  is reachable from k u , and from (9) input k u can change the value i J .Strong connectivity means there is a path from every node k to every node i and condition (21) holds for all , i k N  .

■
The reachability condition is sufficient but not necessary for Interactive Nash equilibrium.
According to the results just established, the following assumptions are made.

a. ( , )
i A B is reachable for all i N  .
b.The graph is strongly connected and at least one pinning gain i g is nonzero.Then

 
L G  is nonsingular.

Stability and solution of graphical games
Substituting control policies ( 16) into ( 14) yields the coupled cooperative game Hamilton-Jacobi (HJ) equations where the closed-loop matrix is 2 1 ( ) For a given i V , define * ( ) Then HJ equations ( 25) can be written as * * ( , , , ) 0 There is one coupled HJ equation corresponding to each node, so solution of this N-player game problem is blocked by requiring a solution to N coupled partial differential equations.
In the next sections we show how to solve this N-player cooperative game online in a distributed fashion at each node, requiring only measurements from neighbor nodes, by using techniques from reinforcement learning.
It is now shown that the coupled HJ equations ( 25) can be written as coupled Riccati equations.For the global state  given in (4) we can write the dynamics as where u is the control given by   1 ( )( ) where (.) diag denotes diagonal matrix of appropriate dimensions.Furthermore the global costate dynamics are ( ) ( ) This is a set of coupled dynamic equations reminiscent of standard multi-player games (Basar, and Olsder, 1999) or single agent optimal control (Lewis, and Syrmos, 1995).Therefore the solution can be written without any loss of generality as for some matrix 0 P  nNxnN   .
Lemma 3. HJ equations (25) are equivalent to the coupled Riccati equations or equivalently, in closed-loop form, ( ) 0 where P is defined by ( 31), and Take ( 14) and write it with respect to the global state and costate as By definition of the costate one has 1 1 ... ...
It is now shown that if solutions can be found for the coupled design equations ( 25), they provide the solution to the graphical game problem.
Theorem 2. Stability and Solution for Cooperative Nash Equilibrium.
Let Assumptions 1 and 2a hold.Let  be smooth solutions to HJ equations (25) and control policies * i u , i N  be given by ( 16) in terms of these solutions i V .Then a. Systems ( 8) are asymptotically stable so all agents synchronize.
  * * * 1 2 , ,..., u u u  are in global Nash equilibrium and the corresponding game values are * ( (0)) , Proof: 25) then it also satisfies (14).Take the time derivative to obtain 1 2 ( ) which is negative definite since 0 ii Q  .Therefore i V is a Lyapunov function for i  and systems ( 8) are asymptotically stable.
According to part a, ( ) 0 i t   for the selected control policies.For any smooth functions ( ), u u  be the optimal controls given by ( 16).By completing the squares one has ) Since this is true for all i, Nash condition ( 17) is satisfied.

■
The next result shows when the systems are in Interactive Nash equilibrium.This means that the graphical game is well defined in the sense that all players are in a single Nash equilibrium with each player affecting the decisions of all other players.
Corollary 1.Let the hypotheses of Theorem 2 hold.Let Assumptions 1 and 2 hold so that the graph is strongly connected.Then   * * * 1 2 , ,..., u u u  are in interactive Nash equilibrium and all agents synchronize.

Global and local performance objectives: Cooperation and competition
The overall objective of all the nodes is to ensure synchronization of all the states ( ) x t to 0 ( ) x t .The multi player game formulation allows for considerable freedom of each agent while achieving this objective.Each agent has a performance objective that can embody team objectives as well as individual node objectives.
The performance objective of each node can be written as 1 1 ( ) where team J is the overall ('center of gravity') performance objective of the networked team and conflict i J is the conflict of interest or competitive objective.team J measures how much the players are vested in common goals, and conflict i J expresses to what extent their objectives differ.The objective functions can be chosen by the individual players, or they may be assigned to yield some desired team behavior.

Policy iteration algorithms for cooperative multi-player games
Reinforcement learning (RL) techniques have been used to solve the single-player optimal control problem online using adaptive learning techniques to determine the optimal value function.Especially effective are the approximate dynamic programming (ADP) methods (Werbos, 1974;Werbos, 1992).RL techniques have also been applied for multiplayer games with centralized dynamics (18).See for example (Busoniu, Babuska, and De Schutter, 2008;Vrancx, Verbeeck, and Nowe, 2008).Most applications of RL for solving optimal control problems or games online have been to finite-state systems or discrete-time dynamical systems.In this section is given a policy iteration algorithm for solving continuous-time differential games on graphs.The structure of this algorithm is used in the next section to provide online adaptive solutions for graphical games.

Best response
Theorem 2 and Corollary 1 reveal that, under assumptions 1 and 2, the systems are in interactive Nash equilibrium if, for all i N  node i selects his best response policy to his neighbors policies and the graph is strongly connected.Define the best response HJ equation as the Bellman equation ( 14) with control * i i u u  given by ( 16) and arbitrary policies { : } where the closed-loop matrix is 2 1 ( )

Theorem 3. Solution for Best Response Policy
Given fixed neighbor policies { : } , assume there is an admissible policy i u .Let 1 0 i V C   be a smooth solution to the best response HJ equation ( 38) and let control policy * i u be given by ( 16) in terms of this solution i V .Then a. Systems ( 8) are asymptotically stable so that all agents synchronize.b.
* i u is the best response to the fixed policies i u  of its neighbors.
b.According to part a, ( ) 0 i t   for the selected control policies.For any smooth functions ( ), V satisfy (38), * i u be the optimal controls given by ( 16), and i u  be arbitrary policies.By completing the squares one has * * 1 2 0 ( (0) , , ) ( ( 0) ) ( ) ( ) The agents are in best response to fixed policies i u  when Then clearly ( (0), , ) ■

Policy iteration solution for graphical games
The following algorithm for the N-player distributed games is motivated by the structure of policy iteration algorithms in reinforcement learning (Bertsekas, and Tsitsiklis, 1996;Sutton, and Barto, 1998) which rely on repeated policy evaluation (e.g.solution of ( 14)) and policy improvement (solution of ( 16)).These two steps are repeated until the policy improvement step no longer changes the present policy.If the algorithm converges for every i , then it converges to the solution to HJ equations ( 25), and hence provides the distributed Nash equilibrium.One must note that the costs can be evaluated only in the case of admissible control policies, admissibility being a condition for the control policy which initializes the algorithm.
Algorithm 1. Policy Iteration (PI) Solution for N-player distributed games.
Step 0: Start with admissible initial policies 0 , i u i  .
On convergence-End

■
The following two theorems prove convergence of the policy iteration algorithm for distributed games for two different cases.The two cases considered are the following, i) only agent i updates its policy and ii) all the agents update their policies.
Theorem 4. Convergence of Policy Iteration algorithm only i th agent updates its policy and all players i u  in its neighborhood do not change.Given fixed neighbors policies i u  , assume there exists an admissible policy i u .Assume that agent i performs Algorithm 1 and the its neighbors do not update their control policies.Then the algorithm converges to the best response i u to policies i u  of the neighbors and to the solution i V to the best response

HJ equation (38).
Proof: Using the next control policy 1 k i u  and the current policies k i u  one has the orbital derivative (Leake, Wen Liu, 1967) 1 1 ( , , , ) ( , , ) 42) and ( 43) one has Because only agent i update its control it is true that and by integration it follows that , the algorithm converges, to * i V , to the best response HJ equation ( 38).

■
The next result concerns the case where all nodes update their policies at each step of the algorithm.Define the relative control weighting as 1 ( ) is the maximum singular value of Theorem 5. Convergence of Policy Iteration algorithm when all agents update their policies.Assume all nodes i update their policies at each iteration of PI.Then for small enough edge weights ij e and ij  , i u converges to the global Nash equilibrium and for all i , and the values converge to the optimal game values .

Proof:
It is clear that and so . By continuity, it holds for small values of ,

■
This proof indicates that for the PI algorithm to converge, the neighbors' controls should not unduly influence the i-th node dynamics (8), and the j-th node should weight its own control j u in its performance index j J relatively more than node i weights j u in i J .These requirements are consistent with selecting the weighting matrices to obtain proper performance in the simulation examples.An alternative condition for convergence in Theorem 5 is that the norm j B should be small.This is similar to the case of weakly coupled dynamics in multi-player games in (Basar, and Olsder, 1999).

Online solution of multi-agent cooperative games using neural networks
In this section an online algorithm for solving cooperative Hamilton-Jacobi equations (25) based on (Vamvoudakis, Lewis 2011) is presented.This algorithm uses the structure in the PI Algorithm 1 to develop an actor/critic adaptive control architecture for approximate online solution of (25).Approximate solutions of ( 40), (41) are obtained using value function approximation (VFA).The algorithm uses two approximator structures at each node, which are taken here as neural networks (NN) (Abu-Khalaf, and Lewis, 2005;Bertsekas, and Tsitsiklis, 1996;Vamvoudakis, Lewis 2010;Werbos, 1974;Werbos, 1992).One critic NN is used at each node for value function approximation, and one actor NN at each node to approximate the control policy (41).The critic NN seeks to solve Bellman equation ( 40).We give tuning laws for the actor NN and the critic NN such that equations ( 40) and ( 41) are solved simultaneously online for each node.Then, the solutions to the coupled HJ equations (25) are determined.Though these coupled HJ equations are difficult to solve, and may not even have analytic solutions, we show how to tune the NN so that the approximate solutions are learned online.The next assumption is made.
Assumption 2. For each admissible control policy the nonlinear Bellman equations ( 14), (40) have smooth solutions 0 Frontiers in Advanced Control Systems 48 In fact, only local smooth solutions are needed.To solve the Bellman equations ( 40), approximation is required of both the value functions i V and their gradients / . This requires approximation in Sobolev space (Abu-Khalaf, and Lewis, 2005).

Critic neural network
According to the Weierstrass higher-order approximation Theorem (Abou-Khalaf, and Lewis, 2005) there are NN weights i W such that the smooth value functions i V are approximated using a critic NN as ( ) ( ) where ( ) i z t is an information vector constructed at node i using locally available measurements, e.g. ( ), { ( ) : } Then, the Bellman equation ( 40) can be approximated at each step k as ˆ( , , , ) It is desired to select ˆi W to minimize the square residual error Then ˆi i W W  which solves (49) in a least-squares sense and i H e becomes small.Theorem 6 gives a tuning law for the critic weights that achieves this.

Action neural network and online learning
Define the control policy in the form of an action neural network which computes the control input (41) in the structured form 1 1 2 ( ) where ˆi N W  denotes the current estimated values of the ideal actor NN weights i W .The notation ˆi N u  is used to keep indices straight in the proof.Define the critic and actor NN estimation errors as î The next results show how to tune the critic NN and actor NN in real time at each node so that equations ( 40) and ( 41) are simultaneously solved, while closed-loop system stability is also guaranteed.Simultaneous solution of ( 40) and ( 41) guarantees that the coupled HJ equations ( 25 Select the tuning law for the i th critic NN as where ˆ( () , and the tuning law for the i th actor NN as 1 ˆˆˆˆ{ ( ) 4 where 1 ( ) are tuning parameters.
Let the error dynamics be given by ( 8), and consider the cooperative game formulation in (15).Let the critic NN at each node be given by ( 48) and the control input be given for each node by actor NN (51).Let the tuning law for the i th critic NN be provided by ( 52) and the tuning law for the i th actor NN be provided by ( 53).Assume /( 1) persistently exciting.Then the closed-loop system states ( ) i t  , the critic NN errors i W  , and the actor NN errors i N W   are uniformly ultimately bounded.

■
Remark 1. Theorem 6 provides algorithms for tuning the actor/critic networks of the N agents at the same time to guarantee stability and make the system errors ( ) i t  small and the NN approximation errors bounded.Small errors guarantee synchronization of all the node trajectories.
Remark 2. Persistence of excitation is needed for proper identification of the value functions by the critic NNs, and nonstandard tuning algorithms are required for the actor NNs to guarantee stability.It is important to notice that the actor NN tuning law of every agent needs information of the critic weights of all his neighbors, while the critic NN tuning law of every agent needs information of the actor weights of all his neighbors, Remark 3. NN usage suggests starting with random, nonzero control NN weights in (51) in order to converge to the coupled HJ equation solutions.However, extensive simulations show that convergence is more sensitive to the persistence of excitation in the control inputs than to the NN weight initialization.If the proper persistence of excitation is not selected, the control weights may not converge to the correct values.
Remark 4. The issue of which inputs ( ) i z t to use for the critic and actor NNs needs to be addressed.According to the dynamics ( 8), the value functions (13), and the control inputs ( 16), the NN inputs at node i should consist of its own state, the states of its neighbors, and the costates of its neighbors.However, in view of (31) the costates are functions of the states.
In view of the approximation capabilities of NN, it is found in simulations that it is suitable to take as the NN inputs at node i its own state and the states of its neighbors.
The next result shows that the tuning laws given in Theorem 6 guarantee approximate solution to the coupled HJ equations ( 25) and convergence to the Nash equilibrium.
Theorem 7. Convergence to Cooperative Nash Equilibrium.
ˆi N u  converge to the approximate cooperative Nash equilibrium (Definition 2) for every i .

Proof:
The proof is similar to (Vamvoudakis, 2011) but is done only with respect to the neighbors (local information) of each agent and not with respect to all agents.
Consider the weights ˆ, i i N W W  to be UUB as proved in Theorem 6.
a.The approximate coupled HJ equations are ˆˆ( , , , ), where , i HJ i   are the residual errors due to approximation.
After adding zero we have ˆ( , ) After taking norms in (55) and letting All the signals on the right hand side of ( 56) are UUB and convergence to the approximate coupled HJ solution is obtained for every agent.

Simulation results
This section shows the effectiveness of the online approach described in Theorem 6 for two different cases.
Consider the three-node strongly connected digraph structure shown in Figure 1 with a leader node connected to node 3. The edge weights and the pinning gains are taken equal to 1 so that 1 Select the weight matrices in (9) as In the examples below, every node is a second-order system.Then, for every agent 1 2 According to the graph structure, the information vector at each node is Since the value is quadratic, the critic NNs basis sets were selected as the quadratic vector in the agent's components and its neighbors' components.( ,0, ) 0 0 0 ( , ,0) 0 0 0 ( , , )

Position and velocity regulated to zero
For the graph structure shown, consider the node dynamics and the command generator 0 The graphical game is implemented as in Theorem 6. Persistence of excitation was ensured by adding a small exponentially decreasing probing noise to the control inputs.Figure 2 shows the convergence of the critic parameters for every agent.Figure 3 shows the evolution of the states for the duration of the experiment.

All the nodes synchronize to the curve behavior of the leader node
For the graph structure shown above consider the following node dynamics with target generator 0 0 0 1 1 0  The command generator is marginally stable with poles at s j   , so it generates a sinusoidal reference trajectory.The graphical game is implemented as in Theorem 6. Persistence of excitation was ensured by adding a small exponential decreasing probing noise to the control inputs.Figure 4 shows the critic parameters converging for every agent.Figure 5 shows the synchronization of all the agents to the leader's behavior as given by the circular Lissajous plot.

Conclusion
This chapter brings together cooperative control, reinforcement learning, and game theory to solve multi-player differential games on communication graph topologies.It formulates graphical games for dynamic systems and provides policy iteration and online learning algorithms along with proof of convergence to the Nash equilibrium or best response.Simulation results show the effectiveness of the proposed algorithms.

Frontiers
are the critic NN activation function vectors, with h the number of neurons in the critic NN hidden layer.According to the Weierstrass Theorem, the NN approximation error i  converges to zero uniformly as h   .Assuming current weight estimates ˆi W , the outputs of the critic NN are given by

Fig. 3 .
Fig. 3. Evolution of the system states and regulation.

Fig. 5 .
Fig. 5. Synchronization of all the agents to the leader node.

Frontiers
. If there is no path from node k to node i, then the control input of node k cannot influence the state or value of node i.
■ Theorem 1.Let ( , ) i A B be reachable for all i N  .Let every node i be in best response to all his neighbors i j N  .Let Assumption 1 hold.Then all nodes in the graph are in interactive global Nash equilibrium if and only if the graph is strongly connected.
) are solved for each node i. System (8) is said to be uniformly ultimately bounded (UUB) if there exists a compact set Thus the NN activation functions are