Open access peer-reviewed chapter

Distributionally Robust Optimization

Written By

Jian Gao, Yida Xu, Julian Barreiro-Gomez, Massa Ndong, Michalis Smyrnakis and Hamidou Tembine

Submitted: 18 November 2017 Reviewed: 22 March 2018 Published: 05 September 2018

DOI: 10.5772/intechopen.76686

From the Edited Volume

Optimization Algorithms - Examples

Edited by Jan Valdman

Chapter metrics overview

1,699 Chapter Downloads

View Full Metrics

Abstract

This chapter presents a class of distributionally robust optimization problems in which a decision-maker has to choose an action in an uncertain environment. The decision-maker has a continuous action space and aims to learn her optimal strategy. The true distribution of the uncertainty is unknown to the decision-maker. This chapter provides alternative ways to select a distribution based on empirical observations of the decision-maker. This leads to a distributionally robust optimization problem. Simple algorithms, whose dynamics are inspired from the gradient flows, are proposed to find local optima. The method is extended to a class of optimization problems with orthogonal constraints and coupled constraints over the simplex set and polytopes. The designed dynamics do not use the projection operator and are able to satisfy both upper- and lower-bound constraints. The convergence rate of the algorithm to generalized evolutionarily stable strategy is derived using a mean regret estimate. Illustrative examples are provided.

Keywords

  • distribution robustness
  • gradient flow
  • Bregman divergence
  • Wasserstein metric
  • f-divergence

1. Introduction

Robust optimization can be defined as the process of determining the best or most effective result, utilizing a quantitative measurement system under worst case uncertain functions or parameters. The optimization may occur in terms of best robust design, net cash flows, profits, costs, benefit/cost ratio, quality-of-experience, satisfaction, end-to-end delay, completion time, etc. Other measurement units may be used, such as units of production or production time, and optimization may occur in terms of maximizing production units, minimizing processing time, production time, maximizing profits, or minimizing costs under uncertain parameters. There are numerous techniques of robust optimization methods such as robust linear programming, robust dynamic programming, robust geometric programming, queuing theory, risk analysis, etc. One of the main drawbacks of the robust optimization is that the worst scenario may be too conservative. The bounds provided by the worst case scenarios may not be useful in many interesting problems (see the wireless communication example provided below). However, distributionally robust optimization is not based on the worst case parameters. The distributional robustness method is based the probability distribution instead of worst parameters. The worse case distribution within a certain carefully designed distributional uncertainty set may provide interesting features. Distributionally robust programming can be used not only to provide a distributionally robust solution to a problem when the true distribution is unknown, but it also can, in many instances, give a general solution taking into account some risk. The presented methodology is simple and reduces significantly the dimensionality of the distributionally robust optimization. We hope that the designs of distributionally robust programming presented here can help designers, engineers, cost–benefit analyst, managers to solve concrete problems under unknown distribution.

The rest of the chapter is organized as follows. Section 2 presents some preliminary concepts of distributionally robust optimization. A class of constrained distributionally robust optimization problems are presented in Section 3. Section 4 focuses on distributed distributionally robust optimization. Afterwards, illustrative examples in distributed power networks and in wireless communication networks are provided to evaluate the performance of the method. Finally, prior works and concluding remarks are drawn in Section 5.

Notation: Let R, R+, denote the set of real and non-negative real numbers, respectively, Ωd be a separable completely metrizable topological space with d:Ω×ΩR+ a metric (distance). Let PΩ be the set of all probability measures over Ω.

Advertisement

2. Distributionally robust optimization

This section introduces distributionally robust optimization models. We will first present a generic formulation of the problem. Then, individual components of the optimization and their solvability issues via equivalent formulations will be discussed.

2.1. Model

Consider a decision-maker who wants to select an action aARn in order to optimize her objective raω, where ω is an uncertain parameter. The information structure is the following:

  • The true distribution of ω is not known to the decision-maker.

  • The upper/lower bound (if any) of ω are unknown to the decision-maker.

  • The decision-maker can measure/observe realization of the random variable ω.

The decision-maker chooses to experiment several trials and obtains statistical realizations of ω from measurements. The measurement data can be noisy, imperfect and erroneous. Then, an empirical distribution (or histogram) m is built from the realizations of ω. However, m is not the true distribution of the random variable ω, and m may not be a reliable measure due to statistical, bias, measurement, observation or computational errors. Therefore, the decision-maker is facing a risk. The risk-sensitive decision-maker should decide action that improves the performance of Em˜raω among alternative distributions m˜ within a certain level of deviation ρ>0 from the distribution m. The distributionally robust optimization problem is therefore formulated as

supaAinfm˜BρmEωm˜raω.E1

where Bρm is the uncertainty set of alternative admissible distributions from m within a certain radius ρ>0. Different distributional uncertainty sets are presented: the f-divergence and the Wasserstein metric, defined below.

2.1.1. f-divergence

We introduce the notion of f divergence which will be used to compute the discrepancy between probability distributions.

Definition 1. Let m and m˜ be two probability measures over Ω such that m is absolutely continuous with respect to m˜. Let f be a convex function. Then, the f-divergence between m and m˜ is defined as follows:

Dfmm˜Ωfdmdm˜dm˜f1,

where dmdm˜ is the Radon-Nikodym derivative of the measure m with the respect the measure m˜.

By Jensen’s inequality:

Dfmm˜=Ωfdmdm˜dm˜f1fΩdmdm˜dm˜f1=fΩdmf1=f1f1=0.E2

Thus, Dfmm˜0 for any convex function f. Note however that, the f divergence Dfmm˜ is not a distance (for example, it does not satisfy the symmetry property). Here the distributional uncertainty set imposed to the alternative distribution m˜ is given by

Bρm=m˜m˜.0Ωdm˜=m˜Ω=1Dfm˜mρ.

Example 1. From the notion of f divergence one can derive the following important concept:

  • α-divergence for

fa=4α+11α1aα+12ifα1+1,alogaifα=1,logaifα=1,
  • In particular, Kullback–Leibler divergence (or relative entropy) is retrieved as α goes to 1.

2.1.2. Wasserstein metric

The Wasserstein metric between two probability distributions m˜ and m is defined as follows:

Definition 2. For m,m˜PΩ, let Πm˜m be the set of all couplings between m and m˜. That is,

πPΩ×ΩπA×Ω=mAπΩ×B=m˜BABB2Ω.

BΩ denotes the measurable sets of Ω. Let θ1. The Wasserstein metric between m˜ and m is defined as

Wθm˜m=infπΠm˜mdLπθ=infπΠm˜mabdθabπdadb,

It is well-known that for every θ1, Wθm˜m is a true distance in the sense that it satisfies the following three axioms:

  • positive-definiteness,

  • the symmetry property,

  • the triangle inequality.

Note that m˜ is not necessarily absolutely continuous with respect to m. Now the distributional uncertainty/constraint set is the set of all possible probability distributions within a Lθ-Wasserstein distance below ρ.

B˜ρm=m˜Ωdm˜=m˜Ω=1Wθm˜mρ,

Note that, if m is a random measure (obtained from a sampled realization), we use the expected value of the Wasserstein metric.

Example 2. The Lθ-Wasserstein distance between two Dirac measures δω0 and δω˜0 is Wθδω0δω˜0=dω0ω˜o. More generally, for K2, the L2-Wasserstein distance between empirical measures μK=1Kk=1Kδωk and νK=1Kk=1Kδω˜k is W22μKνK1Ki=1Kωkω˜k2.

We have defined Bρm and B˜ρm. The goal now is to solve (1) under both f divergence and Wasserstein metric. One of the difficulties of the problem is the curse of dimensionality. The distributionally robust optimization problem (1) of the decision-maker is an infinite-dimensional robust optimization problem because Bρ is of infinite dimensions. Below we will show that (1) can be transformed into an optimization in the form of supinfsup. The latter problem has three alternating terms. Solving this problem requires a triality theory.

2.2. Triality theory

We first present the duality gap and develop a triality theory to solve equivalent formulations of (1). Consider uncoupled domains Ai,i1,2,3. For a general function r2, one has

supa2A2infa1A1r2a1a2infa1A1supa2A2r2a1a2

and the difference

mina1A1maxa2A2r2a1a2maxa2A2mina1A1r2a1a2,

is called duality gap. As it is widely known in duality theory from Sion’s Theorem [1] (which is an extension of von Neumann minimax Theorem) the duality gap vanishes, for example for convex-concave function, and the value is achieved by a saddle point in the case of non-empty convex compact domain.

Triality theory focuses on optimization problems of the forms: supinfsup or infsupinf. The term triality is used here because there are three key alternating terms in these optimizations.

Proposition 1. Let a1a2a3r3a1a2a3R be a function defined on the product space i=13Ai. Then, the following inequalities hold:

supa2A2infa1A1,a3A3r3a1a2a3infa3A3supa2A2infa1A1r3a1a2a3infa1A1,a3A3supa2A2r3a1a2a3,E3

and similarly

supa1A1,a3A3infa2A2r3a1a2a3supa3A3infa2A2supa1A1r3a1a2a3infa2A2supa1A1,a3A3r3a1a2a3.E4

Proof. Define

ĝa2a3infa1A1r3a1a2a3.

Thus, for all a2,a3, one has ĝa2a3r3a1a2a3. It follows that, for any a1,a3,

supa2A2ĝa2a3supa2A2r3a1a2a3.

Using the definition of ĝ, one obtains

supa2A2infa1A1r3a1a2a3supa2A2r3a1a2a3,a1,a3.

Taking the infimum in a1 yields:

supa2A2infa1A1r3a1a2a3infa1A1supa2A2r3a1a2a3,a3E5

Now, we use two operations for the variable a3:

  • Taking the infimum in the inequality (5) in a3 yields

infa3A3supa2A2infa1A1r3a1a2a3infa3A3infa1A1supa2A2r3a1a2a3
=infa1a3A1×A3supa2A2r3a1a2a3,

which proves the second part of the inequalities (3). The first part of the inequalities (3) follows immediately from (5).

  • Taking the supremum in inequality (5) in a3 yields

supa2a3A2×A3infa1A1r3a1a2a3supa3A3infa1A1supa2A2r3a1a2a3,

which proves the first part of the inequalities (4). The second part of the inequalities (4) follows immediately from (5).

This completes the proof.

2.3. Equivalent formulations

Below we explain how the dimensionality of problem (1) can be significantly reduced using a representation by means of the triality theory inequalities of Proposition 1.

2.3.1. f-divergence

Interestingly, the distributionally robust optimization problem (1) under f-divergence is equivalent to the finite dimensional stochastic optimization problem (when A are of finite dimensions). To see this, the original problem need to be transformed. Let us introduce the likelihood functional Lω˜=dm˜dmω˜, and set

Lρm=Lω˜fLω˜dmf1ρω˜Lω˜dmω˜=1.

Then, the Lagrangian of the problem is

r˜aLλμ=ω˜raω˜Lω˜dmω˜λρ+f1ω˜fLω˜dmω˜μ1ω˜Lω˜dmω˜,

where λ0 and μR. The problem becomes

supainfLLρmsupλ0,μRr˜aLλμ.E6

A full understanding of problem 6 requires a triality theory (not a duality theory). The use of triality theory leads to the following equation:

supaAinfm˜BρmEm˜r=supaA,λ0,μREmh,E7

where h is the integrand function λρ+f1μλfr+μλ, where f is Legendre-Fenchel transform of f defined by

fξ=supLLξfL=infLfLLξ.E8

Note that the righthand side of (7) is of dimension n+2, which reduces considerably the dimensionality of the original problem (1).

2.3.2. Wasserstein metric

Similarly, the distributionally robust optimization problem under Wasserstein metric is equivalent to the finite dimensional stochastic optimization problem (when A is a set of finite dimension). If the function ωraω is upper semi-continuous and Ωd is a Polish space then the Wasserstein distributionally robust optimization problem is equivalent to

supaAinfm˜B˜ρmEm˜r=supaAsupλ0Emh˜,h˜=λρθ+μ+supω̂Ωraωμλdθωω̂;E9

The next subsection presents algorithms for computing a distributionally robust solution from the equivalent formulations above.

2.4. Learning algorithms

Learning algorithms are crucial for finding approximate solutions to optimization and control problems. They are widely used for seeking roots/kernel of a function and for finding feasible solutions to variational inequalities. Practically, a learning algorithm generates a certain trajectory (or a set of trajectories) toward a potential approximate solution. Selecting a learning algorithm that has specific properties such as better accuracy, more stability, less-oscillatory and quick convergence is a challenging task [2, 3, 4, 5]. From the calculus of variations point of view, however, a learning algorithm generates curves. Therefore, selecting an algorithm among the others leads to an optimal control problem on the spaces of curves. Hence, it is natural to use optimal control theory to derive faster algorithms for a family of curves. Bergman-based algorithms and risk-aware version of it are introduced below to meet specific properties. We start by introducing the Bregman divergence.

Definition 3. The Bregman divergence dg:A×AR is defined on a differentiable strictly convex function g:AR. For two points abA2, it measures the gap between ga and the first-order Taylor expansion of g around a evaluated at b

dgabgagbgbab.

Example 3. From the Bregman divergence one gets other features by choosing specific functions g:

  • If ga=i=1nai2 then the Bregman divergence dgab=i=1naibi2 is the squared standard Euclidean distance.

  • If ga=i=1nailogai is defined on the relative interior of the simplex, i.e., abb01ni=1nbi=1 then the Bregman divergence dgab=i=1nailogaibi, is the Kullback–Leibler divergence.

We are now ready to define algorithms for solving the righthand side of (7) and (9). One of the key approaches for error quantification of the algorithm with respect to the distributionally robust optimum is the so-called average regret. When the regret vanishes one gets close to a distributionally robust optimum.

Definition 4. The average regret of an algorithm which generates the trajectory at=a˜tλtμt within t0T,t0>0 is

regretT1Tt0t0TmaxbA×R+×REmhbωEmhatωdt

2.4.1. Armijo gradient flow

Algorithm 1. The Armijo’s gradient pseudocode is as follows:

1: Procedure Armijo gradient a0ϵTgmh The Armijo’s gradient starting from a0 within 0T

2:       aa0

3:       while regret>ϵ and tT do We have the answer if regret is 0

4:              Compute at solution of (10)

5:              Compute regrett

6:       end while

7:       return at,regrett get a(t) and the regret

8: end procedure

Proposition 2. Let aEmhaω:Rn+2R be a concave function that has a unique global maximizer a. Assume that a be a feasible action profile, i.e., aA. Consider the continuous time analogue of the Armijo gradient flow [6], which is given by

ddtat=2g1.aEmhatω,a0=a0Rn+2,E10

where a0=a0 is the initial point of the algorithm and g is a strictly convex function on a. Let at be the solution to (10).

Then the average regret within t0T,t0>0 is bounded above by

regretT1Tt0t0TEmhaωhatωdtdgaa0logTt0Tt0.

Proof. Let

Wat=tEmhaωhatω+dgaat,

where a is solution to (10). The function W is positive and ddtW=EmhaωhatωtEmahaωgaa1Emahatω+ddtdgaat. By concavity of Emhaω one has

EmahaωaaEmhaωhaω,a.

On the other hand,

ddtdgaat=ȧgaagaaȧaa+gaȧ=gaaȧaa=Emahaωaa.E11

Hence,

ddtWEmahaωaatEmahaωgaa1EmahaωEmahaωaa=tEmahaωgaa1Emahaω0,E12

where the last inequality is by convexity of g. It follows that ddtWat0 along the path of the gradient flow. This decreasing property implies 0WatWa0=dgaa0. In particular, 0tEmhaωhaωWa0<+. Thus, the error to the value Emhaω is bounded by

0EmhaωhaωWa0t.

The announced result on the regret follows by integration over t0T and by averaging. This completes the proof.

Note that the above regret-bound is established without assuming strong convexity of aEmhaω. Also no Lipschitz continuity bound of the gradient is assumed.

2.4.2. Bregman learning algorithms

Algorithm 2. The Bregman learning pseudocode is as follows:

1: procedure Bregman a0ϵTgαβmh The Bregman learning starting from a0 within 0T

2:       aa0

3:       while regret>ϵ and tT do We have the answer if regret is 0

4:              Compute at solution of (13)

5:              Compute regrett

6:       end while

7:       return at,regrett get at and the regret

8: end procedure

Proposition 3. Let aEmhaω:Rn+2R be a concave function that has a unique global maximizer a. Assume that a be a feasible action profile, i.e., aA. Let α and β be two functions such that β̇teαt. Consider the following Bregman learning algorithm

ddtgaat+eαtȧt=eαt+βtaEmhatω,a0Rn+2,ȧ0Rn+2,E13

where a0 is the initial point of the algorithm and g is a strictly convex function on a. Let at be the solution to (13). Then the average regret within t0T,t0>0 is bounded above by

regretTc0Tt0t0Teβsds,E14

where c0dgaa0+eα0ȧ0)+eβ0Emhaωha0ω>0.

Proof. Let Waȧta=dgaat+eαtȧt+eβtEmhaωhatω. It is clear that W is positive. Moreover, ddtWatȧtta0 for β̇eα. Thus WatȧttaWa0ȧ00a=c0. By integration between t0T it follows

1Tt0t0TEmhaωhatωdtc0Tt0t0Teβsds.

This completes the proof.

In particular, for βs=s+es, one obtains an error bound to the minimum value as

c0t0teβsds=c0t0teseesds=c01eeett,

and for βs=s, the regret bound becomes

c0t0teβsds=c01ett.

Figure 1 illustrates the advantage of algorithm (13) compared with the gradient flow (10). It plots the regret bound c0Tt0t0Teβsds for β=s and dgaa0logTt0Tt0 with an initial gap of c0=25.

Figure 1.

Global regret bound under Bregman vs. gradient. The initial gap is c0=25.

The advantage of algorithms (10) and (13) is that it is not required to compute the Hessian of Emhaω as it is the case in the Newton scheme. As a corollary of Proposition 2 the regret vanishes as T grows. Thus, it is a no-regret algorithm. However, Algorithm (10) may not be sufficiently fast. Algorithm (13) provides a higher order convergence rate by carefully designing αβ. The average regret decays very quickly to zero [7]. However, it may generate an oscillatory trajectory with a big magnitude. The next subsection presents risk-aware algorithms that reduce the oscillatory phase of the trajectory.

2.4.3. Risk-aware Bregman learning algorithm

In order to reduce the oscillatory phase, we introduce a risk-aware Bregman learning algorithm [7] which is a speed-up-and-average version of (13) called mean dynamics m¯ of a given by

m¯=3tm¯¨eαα̇m¯¨+2tm¯̇+e2α+βtgm¯m¯1m¯+t+2eαm¯̇+teαm¯¨Ehm¯tm¯̇+m¯ω,E15

with starting vector m¯0=a0,m¯̇0,m¯¨0.

Algorithm 3. The risk-aware Bregman learning pseudocode is as follows:

1: procedure Risk-aware Bregman m¯0ϵTgαβmh The risk-aware Bregman learning starting from m¯0 within 0T

2:       m¯m¯0=a0,m¯̇0,m¯¨0

3:       while regret>ϵ and tT do We have the answer if regret is 0

4:             Compute m¯t solution of (15)

5:             Compute regret

6:       end while

7:       return m¯t,regrett get m¯t and the regret

8: end procedure

Proposition 4. The time-average trajectory of the learning algorithm (13) generates the mean dynamics (15).

Proof. We use the average relation m¯t=1t0tasds where a solves Eq. (13). From the definition of m¯, and by Hopital’s rule, m¯0=a0. Moreover, m¯t and at share the following equations:

at=m¯t+tm¯̇t,ȧt=2m¯̇t+tm¯¨t,a¨t=3m¯¨t+tm¯t.E16

Substituting these values in Eq. (13) yields the mean dynamics (15). This completes the proof.

The risk-aware Bregman dynamics (15) generates a less oscillatory trajectory due to its averaging nature. The next result provides an accuracy bound for (15).

Proposition 5. The risk-aware Bregman dynamics (15) satisfies

0Emhaωhm¯tωc0t0teβsds.

Proof. Let m¯t=1t0tasds. Then, m¯t=Ras1t1l0tsds. Thus, m¯t=Eμta where μt is the measure with density ts=1t1l0tds. By convexity of Emhaω we apply the Jensen’s inequality:

Emh1t0tasdsω=Emhm¯tω=EmhEμtaω
EμtEmhaω=1t0tEmhasωds.

In view of (14) one has

0EmhaωEmh1t0tasdsω
1t0tEmhaωEmhasωds
c01t0teβsds,
0EmhaωEmhm¯tωc0t0teβsds.

This completes the proof.

Definition 5. (Convergence time). Let δ>0 and at be the trajectory generated by Bregman algorithm starting from a0 at time t0. The convergence time to be within a ball BEmhaωδ of radius δ>0 from the center ra is given by

Tδ=inftEmhaωhatωδt>t0.

Proposition 6. Under the assumptions above, the error generated by the algorithm is at most (14) which means that it takes at most Tδ=β1logc0δ time units to the algorithm to be within a ball Braδ of radius δ>0 from the center Emhaω.

Proof. The proof is immediate. For δ>0 the average regret bound of Proposition 5,

regretTc0Tt0t0Teβsdsδ,E17

provides the announced convergence time bound. This completes the proof.

See Table 1 for detailed parametric functions on the bound Tδ.

Example 4. Let fy=ylogy defined on R+. Then, f1=0, and derivatives of f are fy=1+logy,fy=1y>0. The Legendre-Fenchel transform of f is fξ=y=eξ1. Let a1a2ga=a22, and a1a2ωra1a2ω=1+k=12ωik2ak2. The coefficient ω distribution is unknown but a sampled empirical measure m is considered to be similar to uniform distribution in 01 with 104 samples. We illustrate the quick convergence rate of the algorithm in a basic example and plot in Figure 2 the trajectories under standard gradient, Bregman dynamics and risk-aware Bregman dynamics (15). In particular, we observe that risk-aware Bregman dynamics (15) provides very quickly a satisfactory value. In this particular setup, we observe that the accuracy of the risk-aware Bregman algorithm (15) at t=0.5 will need four times (t=2) less than the standard Bregman algorithm to reach a similar level of error. It takes 40 times more t=20 than the gradient ascent to reach that level. Also, we observe that the risk-aware Bregman algorithm is less oscillatory and the amplitude decays very fast compared to the risk-neutral algorithm.

ConvergenceError boundTime-to-reach Tδ
Triple exponentialeeetc0logloglogc0δ
αt=t+et,βt=eet
Double exponential rateeetc0loglogc0δ
αt=t,βt=et
Exponential rateetc0logc0δ
αt=0,βt=t
Polynomial order kc0tkc01/kδ1/k
αt=logklogt,βt=klogt

Table 1.

Convergence rate under different set of functions.

Figure 2.

Gradient ascent vs. risk-aware Bregman dynamics for r=1+k=12ωk2ak2.

Advertisement

3. Constrained distributionally robust optimization

In the constrained case i.e., when A is a strict subset of Rn+2, algorithms (10) and (13) present some drawbacks: The trajectory at may not be feasible, i.e., atA×R+×R even when it starts in A. In order to design feasible trajectories, projected gradient has been widely studied in the literature. However, a projection into A at each time t involves additional optimization problems and the computation of the projected gradient adds extra complexity to the algorithm. We restrict our attention to the following constraints:

A=aRnala¯la¯ll1nl=1nclalb.

We impose the following feasibility condition: a¯l<a¯l,l1n,cl>0, l=1ncla¯l<b. Under this setting, the constraint set A is non-empty, convex and compact.

We propose a method to compute a constrained solution that has a full support (whenever it exists). We do not use the projection operator. Indeed we transform the domain a¯la¯l=ξ01 where ξxl=a¯lxl+a¯l1xl=al. ξ is a one-to-one mapping and

xl=ξ1al=ala¯la¯la¯l01.
l=1ncla¯la¯lxlbl=1ncla¯lb̂.

The algorithm (18)

ẏ=2g1aEmhaωf̂a,ala¯lxl+a¯l1xl,xl=min1eylk=1neykb̂cla¯la¯l,l1n,E18

generates a trajectory at that satisfies the constraint.

Algorithm 4. The constrained learning pseudocode is as follows:

1: procedure Constrained gradient a0ϵTgmh The constrained learning algorithm starting from a0 within 0T

2:       aa0

3:       while regret>ϵ and tT do We have the answer if regret is 0

4:             Compute at solution of (18)

5:             Compute regret

6:       end while

7:       return at,regrett get a(t) and the regret

8: end procedure

Proposition 7. If b̂minlcla¯la¯l then Algorithm (18) reduces to

ala¯lxl+a¯l1xl,ẋl=xlelf̂a1b̂lelf̂axlcla¯la¯l,l1nE19

Proof. It suffices to check that for b̂minlcla¯la¯l, the vector z defined by zl=eylk=1neyk solves the replicator equation,

żl=zlẏlzẏ.

Thus, xl=eylk=1neykb̂cla¯la¯l solves ẋl=xlelf̂a1b̂lelf̂axlcla¯la¯l. This completes the proof. 

Note that the dynamics of x in Eq. (19) is a constrained replicator dynamics [8] which is widely used in evolutionary game dynamics. This observation establishes a relationship between optimization and game dynamics and explains that the replicator dynamics is the gradient flow of the (expected payoff) under simplex constraint.

The next example illustrates a constrained distributionally robust optimization in wireless communication networks.

Example 5 (Wireless communication). Consider a power allocation problem over n medium access channels. The signal-to-interference-plus-noise ratio (SINR) is

SINRl=alωll2d2srlstl+ε2o2N0srl+Ilsrl,

where

  • N0>0 is the background noise.

  • The interference on channel l is denoted Il0. One typical model for Il is

Il=klakωkl2d2srlstk+ε2o2.

  • ϵ>0 is the height of the transmitter antenna.

  • ωll is the channel state at l. The channel state is unknown. Its true distribution is also unknown.

  • srl is the location of the receiver of l

  • stl is the location of the transmitter of l

  • o2,3,4 is the pathloss exponent.

  • al is the power allocated to channel l. It is assumed to be between a¯l0 and a¯l with 0a¯l<a¯l<+. Moreover, a total power budget constraint is imposed l=1nala¯ where a¯>l=1na¯l0.

It is worth mentioning that the action constraint of the power allocation problem are similar to the ones analyzed in Section 3. The admissible action space is

AaR+n:a¯lala¯ll=1nala¯.

Clearly, A is a non-empty convex compact set. The payoff function is the sum-rate raω=l=1nWllog1+SINRl where Wl>0. The mapping aωraω is continuously differentiable.

  • Robust optimization is too conservative: Part of the robust optimization problem [9, 7] consists of choosing the channel gain ωll20ω¯ll were the bound ω¯ need to be carefully designed. However the worst case is achieved when the channel gain is zero: infωl0ω¯llraω=0. Hence the robust performance is zero. This is too conservative as several realizations of the channel may give better performance than zero. Another way is to re-design the bounds ω¯ll and ω¯ll. But if ω¯ll>0 it means that very low channel gains are not allowed, which may be too optimistic. Below we use the distributional robust optimization approach which eliminates this design issue.

  • Distributional robust optimization: By means of the training sequence or channel estimation method, a certain (statistical) distribution m is derived. However m cannot be considered as the true distribution of the channel state due to estimation error. The true distribution of ω is unknown. Based on this observation, an uncertainty set Bρm with radius ρ0 is constructed for alternative distribution candidates. Note that ρ=0 means that B0m=m. The distributional robust optimization problem is supainfm˜BρmEm˜raω. In presence of interference, the function raω is not necessarily concave in a. In absence of interference, the problem becomes concave.

Advertisement

4. Distributed optimization

This section presents distributed distributionally robust optimization problems over a direct graph. A large number of virtual agents can potentially choose a node (vertex) subject to constraint. The vector a represents the population state. Since a has n components, the graph has n vertices. The interactions between virtual agents are interpreted as possible connections of the graph. Let us suppose that the current interactions are represented by a directed graph G=LE, where EL2 is the set of links representing the possible interaction among the proportion of agents, i.e., if lkE, then the component l of a can interact with the kth component of a. In other words, lkE means that virtual agents selecting the strategy lL could migrate to strategy kL. Moreover, Λ01n×n is the adjacency matrix of the graph G, and whose entries are λlk=1, if lkE; and λlk=0, otherwise.

Definition 6. The distributionally robust fitness function is the marginal distributionally robust payoff function. If aEmhaω is continuously differentiable, the distributionally robust fitness function is Emahaω.

Definition 7. The virtual population state a is an equilibrium if aA and it solves the variational inequality

ab,Emahaω0,bA.

Proposition 8. Let the set of virtual population state A be non-empty convex compact and bEmhbω be continuous. Then the following conditions are equivalent:

  • ab,Emhaω0,bA.

  • the action a satisfies a=projAa+ηEmhaω

Proof. Let a be a feasible action that solves the variational inequality:

ab,Emhaω0,bA.

Let η>0. By multiplying both sides by η, we obtain

ab,ηEmhaω0,bA.

We add the term aba to both sides to obtain the following relationships:

abηEmhaω0bA,abηEmhaω+abaababA,baa+ηEmhaω+aba0bA,baaa+ηEmhaω0bA,E20

Recall that the projection operator on a convex and closed set A is uniquely determined by

zRn,z=projAzzzbz0,bA.

Thus

baaa+ηEmhaω0,bAa=projAa+ηEmhaω.E21

This completes the proof.

As a consequence we can derive the following existence result.

Proposition 9. Let the set of virtual population states A be a non-empty convex compact and the mapping bEmhbω be continuous. Then, there exists at least one equilibrium in A.

Proof. A direct application of the Brouwer-Schauder’s fixed-point theorem which states that if ϕ:AA is continuous and A non-empty convex compact then ϕ has at least one fixed-point in A. Here we choose ϕa=projAa+ηEmhaω. Clearly ϕAA and ϕ is continuous on A as the mapping bEmhbω and the projection operator bprojAb are both continuous. Then the announced result follows. This completes the proof.

Note that we do not need sophisticated set-valued fixed-point theory to obtain this result.

Definition 8. The virtual population state a is evolutionarily stable if aA and for any alternative deviant state ba there is an invasion barrier ϵb>0 such that

ab,Emha+ϵbaω>0,ϵ0ϵb.

The function ϱ:A×Rn×R+n×nRn×n is the revision protocol, which describes how virtual agents are making decisions. The revision protocol ϱ takes a population state a, the corresponding fitness Emh, the adjacency matrix Λ and returns a matrix. Therefore, let ϱlkahΛ be the switching rate from the lth to kth component. Then, the virtual agents selecting the strategy lL have incentives to migrate to the strategy lL only if ϱlkahΛ>0, and it is also possible to design switch rates depending on the topology describing the migration constraints, i.e., λlk=0ϱlkahΛ=0. The distributed distributionally robust optimization consists to perform the optimization problem above over the distributed network that is subject to communication restriction. We construct a distributed distributionally robust game dynamics to perform such a task. The distributed distributionally robust evolutionary game dynamics emerge from the combination of the (robust) fitness h and the constrained switching rates ϱ. The evolution of the portion al is given by the distributed distributional robust mean dynamics

ȧl=kLakϱklahΛalkLϱlkahΛ,lL,E22

Since the distributionally robust function h is obtained after the transformation from payoff function r by means of triality theory, the dynamics (22) is seeking for distributed distributionally robust solution.

Algorithm 5. The distributed distributional robust mean dynamics pseudocode is as follows:

1: procedure Population-inspired algorithm a0ϵTϱgmhΛ The population-inspired learning starting from a0 within 0T

2:       aa0

3:       while regret>ϵ and tT do We have the answer if regret is 0

4:             Compute at solution of (22)

5:             Compute regrett

6:       end while

7:       return at,regrett get at and the regret

8: end procedure

The next example establishes evolutionarily stable state, equilibria and rest-point of the dynamics (22) by designing ϱ.

Example 6. Let us consider a power system that is composed of 10 generators, i.e., let L=110. Let alR+ be the power generated by the generator lL. Each power generation should satisfy the physical and/or operation constraints ala¯la¯l, for all lL. It is desired to satisfy the power demand given by dR, i.e., it is necessary to guarantee that lLal=d, i.e., the supply meets the demand. The objective is to minimize the generation quadratic costs for all the generators, i.e.,

Maximizeraω=lLrlal=lLc0l+c1lal+c2lal2,
s.t.lLal=d,a¯lala¯l,lL,

where r:RnR is concave, and the parameters are possibly uncertain and selected as c0l=25+6l, c1l=15+4l+ω1l, c2l=5+l+ω2l, and d=20+ω3l. Therefore, the fitness functions for the corresponding full potential game are given by fla=2alc2lc1l, for all lL, and action space is given by

A=aR+n:lLal=dala¯la¯l.

The distributed revision protocol is set to

ϱlkahΛ=λlkalmax0a¯kakmax0ala¯lmax0Emhkhl,

for al0. We evaluate four different scenarios, i.e.,

  1. a¯=0n and a¯=d1ln,

  2. a¯l=0, for all lL\910, a¯9=1.1, and a¯10=1; and a¯l=d, for all lL\12, a¯1=3, and a¯2=2.5,

  3. Case 1 constraints and with interaction restricted to the cycle graph G=LE with set of links E=lL\nll+1n1,

  4. Case 2 constraints and with interaction restricted as in Case 3.

Figure 3 presents the evolution of the generated power, the fitness functions corresponding to the marginal costs and the total cost. For the first scenario, the evolutionary game dynamics converge to a standard evolutionarily stable state in which f̂a=c1n. In contrast, for the second scenario, the dynamics converge to a constrained evolutionarily stable state.

Figure 3.

Economic power dispatch. Evolution of the population states (generated power), fitness functions f̂a=Ehaω, and the costs Eraω. Figures (a)-(c) for case 1, (d)-(f) for case 2, (g)-(i) for case 3, and (j)-(l) for case 4.

4.1. Extension to multiple decision-makers

Consider a constrained game G in strategic-form given by

  • P=1P is the set of players. The cardinality of P is P2.

  • Player p has a decision space ApRnp,np1. Players are coupled through their actions and their payoffs. The set of all feasible action profiles is ARn, with n=pPnp. Player p can choose an action ap in the set Apap=apAp:apapA.

  • Player p has a payoff function rp:AR.

We restrict our attention to the following constraints:

Ap=apRnpapla¯pla¯pll1npl=1npcplaplbp

The coupled constraint is

A=apAppPc¯papb¯.

Feasibility condition: If a¯pl<a¯pl,l1np,cpl>0, l=1npcpla¯pl<bp, c¯pR>0np and pPc¯pa¯p<b¯, the constraint set A is non-empty, convex and compact.

We propose a method to compute a constrained equilibrium that has a full support (whenever it exists). We do not use the projection operator. Indeed we transform the domain a¯pla¯pl=ξ01 where ξxpl=a¯plxpl+a¯pl1xpl=apl. ξ is a one-to-one mapping and

xpl=ξ1apl=apla¯pla¯pla¯pl01.
l=1npcpla¯pla¯plxplbpl=1npcpla¯plb̂p.

The learning algorithm (23) is

ẏp=p2g1aprpaω,apla¯plxpl+a¯pl1xpl,xpl=min1eyplk=1npeypkb̂pcpla¯pla¯pl,l1np,E23

generates a trajectory apt=apltl that satisfies the constraint of player p at any time t.

Advertisement

5. Notes

The work in [10] provides a nice intuitive introduction to robust optimization emphasizing the parallel with static optimization. Another nice treatment [11], focusing on robust empirical risk minimization problem, is designed to give calibrated confidence intervals on performance and provide optimal tradeoffs between bias and variance [12, 13]. f-divergence based performance evaluations are conducted in [11, 14, 15]. The connection between risk-sensitivity measures such as the exponentiated payoff and distributionally robustness can be found in [16]. Distributionally robust optimization and learning are extended to multiple strategic decision-making problems i.e., distributionally robust games in [17, 18].

Advertisement

Acknowledgments

We gratefully acknowledge support from U.S. Air Force Office of Scientific Research under grant number FA9550-17-1-0259.

References

  1. 1. Sion M. On general minimax theorems. Pacific Journal of Mathematics. 1958;8(1):171-176
  2. 2. Bach FR. Duality between subgradient and conditional gradient methods. SIAM Journal on Optimization. 2015;1(25):115-129
  3. 3. Kim D, Fessler JA. Optimized first-order methods for smooth convex minimization. Mathematical Programming. 2016;159(1):81-107
  4. 4. Nesterov Y. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming. 2008;112(1):159-181
  5. 5. Nesterov Y. Primal-dual subgradient methods for convex problems. Mathematical Programming. 2009;120(1):221-259
  6. 6. Larry A. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics. Tokyo Japan: International Academic Printing Co., Ltd.; 1966;16(1):1-3
  7. 7. Tembine H. Distributed Strategic Learning for Wireless Engineers. Boca Raton, FL, USA: CRC Press, Inc.; 2012. p. 496
  8. 8. Taylor PD, Jonker L. Evolutionarily stable strategies and game dynamics. Mathematical Biosciences. 1978;40:145-156
  9. 9. Ben-Tal A, Ghaoui LE, Nemirovski A. Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press; August 30, 2009. 576 p. ISBN-10: 0691143684, ISBN-13: 978-0691143682
  10. 10. Esfahani PM, Kuhn D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. In: Mathematical Programming. Jul 2017
  11. 11. Duchi JC, Namkoong H. Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in Neural Information Processing Systems. 2015;2208-2216
  12. 12. Ben-Tal A, den Hertog D, Waegenaere AD, Melenberg B, Rennen G. Robust solutions of optimization problems affected by uncertain probabilities. Management Science. Catonsville, USA: Informs PubsOnLine; 2013;59(2):341-357
  13. 13. Ben-Tal A, Hazan E, Koren T, Mannor S. Oracle-based robust optimization via online learning. Operations Research. 2015;63(3):628-638
  14. 14. Ahmadi-Javid A. An information-theoretic approach to constructing coherent risk measures. IEEE International Symposium on Information Theory Proceedings. 2011;2125-2127
  15. 15. Ahmadi-Javid A. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications. 2012;155(3):1105-1123
  16. 16. Föllmer H, Knispel T. Entropic risk measures: Coherence vs. convexity, model ambiguity, and robust large deviations. Stochastics Dynamics. 2011;11:333-351
  17. 17. Bauso D, Gao J, Tembine H. Distributionally robust games: F-divergence and learning. ValueTools, International Conference on Performance Evaluation Methodologies and Tools, Venice, Italy; December 5-7, 2017
  18. 18. Tembine H. Dynamic robust games in mimo systems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). Aug 2011;41(4):990-1002

Written By

Jian Gao, Yida Xu, Julian Barreiro-Gomez, Massa Ndong, Michalis Smyrnakis and Hamidou Tembine

Submitted: 18 November 2017 Reviewed: 22 March 2018 Published: 05 September 2018