Open access peer-reviewed chapter

# Distributionally Robust Optimization

By Jian Gao, Yida Xu, Julian Barreiro-Gomez, Massa Ndong, Michalis Smyrnakis and Hamidou Tembine

Submitted: November 18th 2017Reviewed: March 22nd 2018Published: September 5th 2018

DOI: 10.5772/intechopen.76686

Downloaded: 586

## Abstract

This chapter presents a class of distributionally robust optimization problems in which a decision-maker has to choose an action in an uncertain environment. The decision-maker has a continuous action space and aims to learn her optimal strategy. The true distribution of the uncertainty is unknown to the decision-maker. This chapter provides alternative ways to select a distribution based on empirical observations of the decision-maker. This leads to a distributionally robust optimization problem. Simple algorithms, whose dynamics are inspired from the gradient flows, are proposed to find local optima. The method is extended to a class of optimization problems with orthogonal constraints and coupled constraints over the simplex set and polytopes. The designed dynamics do not use the projection operator and are able to satisfy both upper- and lower-bound constraints. The convergence rate of the algorithm to generalized evolutionarily stable strategy is derived using a mean regret estimate. Illustrative examples are provided.

### Keywords

• distribution robustness
• gradient flow
• Bregman divergence
• Wasserstein metric
• f-divergence

## 1. Introduction

Robust optimization can be defined as the process of determining the best or most effective result, utilizing a quantitative measurement system under worst case uncertain functions or parameters. The optimization may occur in terms of best robust design, net cash flows, profits, costs, benefit/cost ratio, quality-of-experience, satisfaction, end-to-end delay, completion time, etc. Other measurement units may be used, such as units of production or production time, and optimization may occur in terms of maximizing production units, minimizing processing time, production time, maximizing profits, or minimizing costs under uncertain parameters. There are numerous techniques of robust optimization methods such as robust linear programming, robust dynamic programming, robust geometric programming, queuing theory, risk analysis, etc. One of the main drawbacks of the robust optimization is that the worst scenario may be too conservative. The bounds provided by the worst case scenarios may not be useful in many interesting problems (see the wireless communication example provided below). However, distributionally robust optimization is not based on the worst case parameters. The distributional robustness method is based the probability distribution instead of worst parameters. The worse case distribution within a certain carefully designed distributional uncertainty set may provide interesting features. Distributionally robust programming can be used not only to provide a distributionally robust solution to a problem when the true distribution is unknown, but it also can, in many instances, give a general solution taking into account some risk. The presented methodology is simple and reduces significantly the dimensionality of the distributionally robust optimization. We hope that the designs of distributionally robust programming presented here can help designers, engineers, cost–benefit analyst, managers to solve concrete problems under unknown distribution.

The rest of the chapter is organized as follows. Section 2 presents some preliminary concepts of distributionally robust optimization. A class of constrained distributionally robust optimization problems are presented in Section 3. Section 4 focuses on distributed distributionally robust optimization. Afterwards, illustrative examples in distributed power networks and in wireless communication networks are provided to evaluate the performance of the method. Finally, prior works and concluding remarks are drawn in Section 5.

Notation: Let R, R+, denote the set of real and non-negative real numbers, respectively, Ωdbe a separable completely metrizable topological space with d:Ω×ΩR+a metric (distance). Let PΩbe the set of all probability measures over Ω.

## 2. Distributionally robust optimization

This section introduces distributionally robust optimization models. We will first present a generic formulation of the problem. Then, individual components of the optimization and their solvability issues via equivalent formulations will be discussed.

### 2.1. Model

Consider a decision-maker who wants to select an action aARnin order to optimize her objective raω,where ωis an uncertain parameter. The information structure is the following:

• The true distribution of ωis not known to the decision-maker.

• The upper/lower bound (if any) of ωare unknown to the decision-maker.

• The decision-maker can measure/observe realization of the random variable ω.

The decision-maker chooses to experiment several trials and obtains statistical realizations of ωfrom measurements. The measurement data can be noisy, imperfect and erroneous. Then, an empirical distribution (or histogram) mis built from the realizations of ω.However, mis not the true distribution of the random variable ω,and mmay not be a reliable measure due to statistical, bias, measurement, observation or computational errors. Therefore, the decision-maker is facing a risk. The risk-sensitive decision-maker should decide action that improves the performance of Em˜raωamong alternative distributions m˜within a certain level of deviation ρ>0from the distribution m.The distributionally robust optimization problem is therefore formulated as

supaAinfm˜BρmEωm˜raω.E1

where Bρmis the uncertainty set of alternative admissible distributions from mwithin a certain radius ρ>0.Different distributional uncertainty sets are presented: the f-divergence and the Wasserstein metric, defined below.

#### 2.1.1. f-divergence

We introduce the notion of fdivergence which will be used to compute the discrepancy between probability distributions.

Definition 1. Let mand m˜be two probability measures over Ωsuch that mis absolutely continuous with respect to m˜.Let fbe a convex function. Then, the f-divergence between mand m˜is defined as follows:

Dfmm˜Ωfdmdm˜dm˜f1,

where dmdm˜is the Radon-Nikodym derivative of the measure mwith the respect the measure m˜.

By Jensen’s inequality:

Dfmm˜=Ωfdmdm˜dm˜f1fΩdmdm˜dm˜f1=fΩdmf1=f1f1=0.E2

Thus, Dfmm˜0for any convex function f.Note however that, the fdivergence Dfmm˜is not a distance (for example, it does not satisfy the symmetry property). Here the distributional uncertainty set imposed to the alternative distribution m˜is given by

Bρm=m˜m˜.0Ωdm˜=m˜Ω=1Dfm˜mρ.

Example 1. From the notion of fdivergence one can derive the following important concept:

• α-divergence for

fa=4α+11α1aα+12ifα1+1,alogaifα=1,logaifα=1,
• In particular, Kullback–Leibler divergence (or relative entropy) is retrieved as αgoes to 1.

#### 2.1.2. Wasserstein metric

The Wasserstein metric between two probability distributions m˜and mis defined as follows:

Definition 2. For m,m˜PΩ,let Πm˜mbe the set of all couplings between mand m˜.That is,

πPΩ×ΩπA×Ω=mAπΩ×B=m˜BABB2Ω.

BΩdenotes the measurable sets of Ω.Let θ1.The Wasserstein metric between m˜and mis defined as

Wθm˜m=infπΠm˜mdLπθ=infπΠm˜mabdθabπdadb,

It is well-known that for every θ1,Wθm˜mis a true distance in the sense that it satisfies the following three axioms:

• positive-definiteness,

• the symmetry property,

• the triangle inequality.

Note that m˜is not necessarily absolutely continuous with respect to m.Now the distributional uncertainty/constraint set is the set of all possible probability distributions within a Lθ-Wasserstein distance below ρ.

B˜ρm=m˜Ωdm˜=m˜Ω=1Wθm˜mρ,

Note that, if mis a random measure (obtained from a sampled realization), we use the expected value of the Wasserstein metric.

Example 2. The Lθ-Wasserstein distance between two Dirac measures δω0and δω˜0is Wθδω0δω˜0=dω0ω˜o.More generally, for K2,the L2-Wasserstein distance between empirical measures μK=1Kk=1Kδωkand νK=1Kk=1Kδω˜kis W22μKνK1Ki=1Kωkω˜k2.

We have defined Bρmand B˜ρm.The goal now is to solve (1) under both fdivergence and Wasserstein metric. One of the difficulties of the problem is the curse of dimensionality. The distributionally robust optimization problem (1) of the decision-maker is an infinite-dimensional robust optimization problem because Bρis of infinite dimensions. Below we will show that (1) can be transformed into an optimization in the form of supinfsup.The latter problem has three alternating terms. Solving this problem requires a triality theory.

### 2.2. Triality theory

We first present the duality gap and develop a triality theory to solve equivalent formulations of (1). Consider uncoupled domains Ai,i1,2,3.For a general function r2,one has

supa2A2infa1A1r2a1a2infa1A1supa2A2r2a1a2

and the difference

mina1A1maxa2A2r2a1a2maxa2A2mina1A1r2a1a2,

is called duality gap. As it is widely known in duality theory from Sion’s Theorem [1] (which is an extension of von Neumann minimax Theorem) the duality gap vanishes, for example for convex-concave function, and the value is achieved by a saddle point in the case of non-empty convex compact domain.

Triality theory focuses on optimization problems of the forms: supinfsupor infsupinf.The term triality is used here because there are three key alternating terms in these optimizations.

Proposition 1. Let a1a2a3r3a1a2a3Rbe a function defined on the product space i=13Ai.Then, the following inequalities hold:

supa2A2infa1A1,a3A3r3a1a2a3infa3A3supa2A2infa1A1r3a1a2a3infa1A1,a3A3supa2A2r3a1a2a3,E3

and similarly

supa1A1,a3A3infa2A2r3a1a2a3supa3A3infa2A2supa1A1r3a1a2a3infa2A2supa1A1,a3A3r3a1a2a3.E4

Proof. Define

ĝa2a3infa1A1r3a1a2a3.

Thus, for all a2,a3,one has ĝa2a3r3a1a2a3.It follows that, for any a1,a3,

supa2A2ĝa2a3supa2A2r3a1a2a3.

Using the definition of ĝ,one obtains

supa2A2infa1A1r3a1a2a3supa2A2r3a1a2a3,a1,a3.

Taking the infimum in a1yields:

supa2A2infa1A1r3a1a2a3infa1A1supa2A2r3a1a2a3,a3E5

Now, we use two operations for the variable a3:

• Taking the infimum in the inequality (5) in a3yields

infa3A3supa2A2infa1A1r3a1a2a3infa3A3infa1A1supa2A2r3a1a2a3
=infa1a3A1×A3supa2A2r3a1a2a3,

which proves the second part of the inequalities (3). The first part of the inequalities (3) follows immediately from (5).

• Taking the supremum in inequality (5) in a3yields

supa2a3A2×A3infa1A1r3a1a2a3supa3A3infa1A1supa2A2r3a1a2a3,

which proves the first part of the inequalities (4). The second part of the inequalities (4) follows immediately from (5).

This completes the proof.

### 2.3. Equivalent formulations

Below we explain how the dimensionality of problem (1) can be significantly reduced using a representation by means of the triality theory inequalities of Proposition 1.

#### 2.3.1. f-divergence

Interestingly, the distributionally robust optimization problem (1) under f-divergence is equivalent to the finite dimensional stochastic optimization problem (when Aare of finite dimensions). To see this, the original problem need to be transformed. Let us introduce the likelihood functional Lω˜=dm˜dmω˜,and set

Lρm=Lω˜fLω˜dmf1ρω˜Lω˜dmω˜=1.

Then, the Lagrangian of the problem is

r˜aLλμ=ω˜raω˜Lω˜dmω˜λρ+f1ω˜fLω˜dmω˜μ1ω˜Lω˜dmω˜,

where λ0and μR.The problem becomes

supainfLLρmsupλ0,μRr˜aLλμ.E6

A full understanding of problem 6requires a triality theory (not a duality theory). The use of triality theory leads to the following equation:

supaAinfm˜BρmEm˜r=supaA,λ0,μREmh,E7

where his the integrand function λρ+f1μλfr+μλ,where fis Legendre-Fenchel transform of fdefined by

fξ=supLLξfL=infLfLLξ.E8

Note that the righthand side of (7) is of dimension n+2,which reduces considerably the dimensionality of the original problem (1).

#### 2.3.2. Wasserstein metric

Similarly, the distributionally robust optimization problem under Wasserstein metric is equivalent to the finite dimensional stochastic optimization problem (when Ais a set of finite dimension). If the function ωraωis upper semi-continuous and Ωdis a Polish space then the Wasserstein distributionally robust optimization problem is equivalent to

supaAinfm˜B˜ρmEm˜r=supaAsupλ0Emh˜,h˜=λρθ+μ+supω̂Ωraωμλdθωω̂;E9

The next subsection presents algorithms for computing a distributionally robust solution from the equivalent formulations above.

### 2.4. Learning algorithms

Learning algorithms are crucial for finding approximate solutions to optimization and control problems. They are widely used for seeking roots/kernel of a function and for finding feasible solutions to variational inequalities. Practically, a learning algorithm generates a certain trajectory (or a set of trajectories) toward a potential approximate solution. Selecting a learning algorithm that has specific properties such as better accuracy, more stability, less-oscillatory and quick convergence is a challenging task [2, 3, 4, 5]. From the calculus of variations point of view, however, a learning algorithm generates curves. Therefore, selecting an algorithm among the others leads to an optimal control problem on the spaces of curves. Hence, it is natural to use optimal control theory to derive faster algorithms for a family of curves. Bergman-based algorithms and risk-aware version of it are introduced below to meet specific properties. We start by introducing the Bregman divergence.

Definition 3. The Bregman divergence dg:A×ARis defined on a differentiable strictly convex function g:AR.For two points abA2, it measures the gap between gaand the first-order Taylor expansion of g around aevaluated at b

dgabgagbgbab.

Example 3. From the Bregman divergence one gets other features by choosing specific functions g:

• If ga=i=1nai2then the Bregman divergence dgab=i=1naibi2is the squared standard Euclidean distance.

• If ga=i=1nailogaiis defined on the relative interior of the simplex, i.e., abb01ni=1nbi=1then the Bregman divergence dgab=i=1nailogaibi,is the Kullback–Leibler divergence.

We are now ready to define algorithms for solving the righthand side of (7) and (9). One of the key approaches for error quantification of the algorithm with respect to the distributionally robust optimum is the so-called average regret. When the regret vanishes one gets close to a distributionally robust optimum.

Definition 4. The average regret of an algorithm which generates the trajectory at=a˜tλtμtwithin t0T,t0>0is

regretT1Tt0t0TmaxbA×R+×REmhbωEmhatωdt

#### 2.4.1. Armijo gradient flow

Algorithm 1. The Armijo’s gradient pseudocode is as follows:

1: Procedure Armijo gradienta0ϵTgmhThe Armijo’s gradient starting from a0within 0T

2:       aa0

3:       while regret>ϵand tTdo We have the answer if regret is 0

4:              Compute atsolution of (10)

5:              Compute regrett

6:       end while

7:       return at,regrettget a(t) and the regret

8: end procedure

Proposition 2. Let aEmhaω:Rn+2Rbe a concave function that has a unique global maximizer a.Assume that abe a feasible action profile, i.e., aA.Consider the continuous time analogue of the Armijo gradient flow [6], which is given by

ddtat=2g1.aEmhatω,a0=a0Rn+2,E10

where a0=a0is the initial point of the algorithm and gis a strictly convex function on a.Let atbe the solution to (10).

Then the average regret within t0T,t0>0is bounded above by

regretT1Tt0t0TEmhaωhatωdtdgaa0logTt0Tt0.

Proof. Let

Wat=tEmhaωhatω+dgaat,

where ais solution to (10). The function Wis positive and ddtW=EmhaωhatωtEmahaωgaa1Emahatω+ddtdgaat.By concavity of Emhaωone has

EmahaωaaEmhaωhaω,a.

On the other hand,

ddtdgaat=ȧgaagaaȧaa+gaȧ=gaaȧaa=Emahaωaa.E11

Hence,

ddtWEmahaωaatEmahaωgaa1EmahaωEmahaωaa=tEmahaωgaa1Emahaω0,E12

where the last inequality is by convexity of g.It follows that ddtWat0along the path of the gradient flow. This decreasing property implies 0WatWa0=dgaa0.In particular, 0tEmhaωhaωWa0<+.Thus, the error to the value Emhaωis bounded by

0EmhaωhaωWa0t.

The announced result on the regret follows by integration over t0Tand by averaging. This completes the proof.

Note that the above regret-bound is established without assuming strong convexity of aEmhaω.Also no Lipschitz continuity bound of the gradient is assumed.

#### 2.4.2. Bregman learning algorithms

Algorithm 2. The Bregman learning pseudocode is as follows:

1: procedure Bregman a0ϵTgαβmhThe Bregman learning starting from a0within 0T

2:       aa0

3:       while regret>ϵand tTdo We have the answer if regret is 0

4:              Compute atsolution of (13)

5:              Compute regrett

6:       end while

7:       return at,regrettget atand the regret

8: end procedure

Proposition 3. Let aEmhaω:Rn+2Rbe a concave function that has a unique global maximizer a.Assume that abe a feasible action profile, i.e., aA.Let αand βbe two functions such that β̇teαt.Consider the following Bregman learning algorithm

ddtgaat+eαtȧt=eαt+βtaEmhatω,a0Rn+2,ȧ0Rn+2,E13

where a0is the initial point of the algorithm and gis a strictly convex function on a.Let atbe the solution to (13). Then the average regret within t0T,t0>0is bounded above by

regretTc0Tt0t0Teβsds,E14

where c0dgaa0+eα0ȧ0)+eβ0Emhaωha0ω>0.

Proof. Let Waȧta=dgaat+eαtȧt+eβtEmhaωhatω.It is clear that Wis positive. Moreover, ddtWatȧtta0for β̇eα.Thus WatȧttaWa0ȧ00a=c0.By integration between t0Tit follows

1Tt0t0TEmhaωhatωdtc0Tt0t0Teβsds.

This completes the proof.

In particular, for βs=s+es,one obtains an error bound to the minimum value as

c0t0teβsds=c0t0teseesds=c01eeett,

and for βs=s,the regret bound becomes

c0t0teβsds=c01ett.

Figure 1 illustrates the advantage of algorithm (13) compared with the gradient flow (10). It plots the regret bound c0Tt0t0Teβsdsfor β=sand dgaa0logTt0Tt0with an initial gap of c0=25.

The advantage of algorithms (10) and (13) is that it is not required to compute the Hessian of Emhaωas it is the case in the Newton scheme. As a corollary of Proposition 2 the regret vanishes as Tgrows. Thus, it is a no-regret algorithm. However, Algorithm (10) may not be sufficiently fast. Algorithm (13) provides a higher order convergence rate by carefully designing αβ.The average regret decays very quickly to zero [7]. However, it may generate an oscillatory trajectory with a big magnitude. The next subsection presents risk-aware algorithms that reduce the oscillatory phase of the trajectory.

#### 2.4.3. Risk-aware Bregman learning algorithm

In order to reduce the oscillatory phase, we introduce a risk-aware Bregman learning algorithm [7] which is a speed-up-and-average version of (13) called mean dynamics m¯of agiven by

m¯=3tm¯¨eαα̇m¯¨+2tm¯̇+e2α+βtgm¯m¯1m¯+t+2eαm¯̇+teαm¯¨Ehm¯tm¯̇+m¯ω,E15

with starting vector m¯0=a0,m¯̇0,m¯¨0.

Algorithm 3. The risk-aware Bregman learning pseudocode is as follows:

1: procedure Risk-awareBregmanm¯0ϵTgαβmhThe risk-aware Bregman learning starting from m¯0within 0T

2:       m¯m¯0=a0,m¯̇0,m¯¨0

3:       while regret>ϵand tTdo We have the answer if regret is 0

4:             Compute m¯tsolution of (15)

5:             Compute regret

6:       end while

7:       return m¯t,regrettget m¯tand the regret

8: end procedure

Proposition 4. The time-average trajectory of the learning algorithm (13) generates the mean dynamics (15).

Proof. We use the average relation m¯t=1t0tasdswhere asolves Eq. (13). From the definition of m¯,and by Hopital’s rule, m¯0=a0.Moreover, m¯tand atshare the following equations:

at=m¯t+tm¯̇t,ȧt=2m¯̇t+tm¯¨t,a¨t=3m¯¨t+tm¯t.E16

Substituting these values in Eq. (13) yields the mean dynamics (15). This completes the proof.

The risk-aware Bregman dynamics (15) generates a less oscillatory trajectory due to its averaging nature. The next result provides an accuracy bound for (15).

Proposition 5. The risk-aware Bregman dynamics (15) satisfies

0Emhaωhm¯tωc0t0teβsds.

Proof. Let m¯t=1t0tasds.Then, m¯t=Ras1t1l0tsds.Thus, m¯t=Eμtawhere μtis the measure with density ts=1t1l0tds.By convexity of Emhaωwe apply the Jensen’s inequality:

Emh1t0tasdsω=Emhm¯tω=EmhEμtaω
EμtEmhaω=1t0tEmhasωds.

In view of (14) one has

0EmhaωEmh1t0tasdsω
1t0tEmhaωEmhasωds
c01t0teβsds,
0EmhaωEmhm¯tωc0t0teβsds.

This completes the proof.

Definition 5. (Convergence time). Let δ>0and atbe the trajectory generated by Bregman algorithm starting from a0at time t0.The convergence time to be within a ball BEmhaωδof radius δ>0from the center rais given by

Tδ=inftEmhaωhatωδt>t0.

Proposition 6. Under the assumptions above, the error generated by the algorithm is at most (14) which means that it takes at most Tδ=β1logc0δtime units to the algorithm to be within a ball Braδof radius δ>0from the center Emhaω.

Proof. The proof is immediate. For δ>0the average regret bound of Proposition 5,

regretTc0Tt0t0Teβsdsδ,E17

provides the announced convergence time bound. This completes the proof.

See Table 1 for detailed parametric functions on the bound Tδ.

Example 4. Let fy=ylogydefined on R+.Then, f1=0,and derivatives of fare fy=1+logy,fy=1y>0.The Legendre-Fenchel transform of fis fξ=y=eξ1.Let a1a2ga=a22,and a1a2ωra1a2ω=1+k=12ωik2ak2.The coefficient ωdistribution is unknown but a sampled empirical measure mis considered to be similar to uniform distribution in 01with 104samples. We illustrate the quick convergence rate of the algorithm in a basic example and plot in Figure 2 the trajectories under standard gradient, Bregman dynamics and risk-aware Bregman dynamics (15). In particular, we observe that risk-aware Bregman dynamics (15) provides very quickly a satisfactory value. In this particular setup, we observe that the accuracy of the risk-aware Bregman algorithm (15) at t=0.5will need four times (t=2) less than the standard Bregman algorithm to reach a similar level of error. It takes 40times more t=20than the gradient ascent to reach that level. Also, we observe that the risk-aware Bregman algorithm is less oscillatory and the amplitude decays very fast compared to the risk-neutral algorithm.

ConvergenceError boundTime-to-reach Tδ
Triple exponentialeeetc0logloglogc0δ
αt=t+et,βt=eet
Double exponential rateeetc0loglogc0δ
αt=t,βt=et
Exponential rateetc0logc0δ
αt=0,βt=t
Polynomial order kc0tkc01/kδ1/k
αt=logklogt,βt=klogt

### Table 1.

Convergence rate under different set of functions.

## 3. Constrained distributionally robust optimization

In the constrained case i.e., when Ais a strict subset of Rn+2,algorithms (10) and (13) present some drawbacks: The trajectory atmay not be feasible, i.e., atA×R+×Reven when it starts in A.In order to design feasible trajectories, projected gradient has been widely studied in the literature. However, a projection into Aat each time tinvolves additional optimization problems and the computation of the projected gradient adds extra complexity to the algorithm. We restrict our attention to the following constraints:

A=aRnala¯la¯ll1nl=1nclalb.

We impose the following feasibility condition: a¯l<a¯l,l1n,cl>0,l=1ncla¯l<b.Under this setting, the constraint set Ais non-empty, convex and compact.

We propose a method to compute a constrained solution that has a full support (whenever it exists). We do not use the projection operator. Indeed we transform the domain a¯la¯l=ξ01where ξxl=a¯lxl+a¯l1xl=al.ξis a one-to-one mapping and

xl=ξ1al=ala¯la¯la¯l01.
l=1ncla¯la¯lxlbl=1ncla¯lb̂.

The algorithm (18)

ẏ=2g1aEmhaωf̂a,ala¯lxl+a¯l1xl,xl=min1eylk=1neykb̂cla¯la¯l,l1n,E18

generates a trajectory atthat satisfies the constraint.

Algorithm 4. The constrained learning pseudocode is as follows:

1: procedure Constrained gradienta0ϵTgmhThe constrained learning algorithm starting from a0within 0T

2:       aa0

3:       while regret>ϵand tTdo We have the answer if regret is 0

4:             Compute atsolution of (18)

5:             Compute regret

6:       end while

7:       return at,regrettget a(t) and the regret

8: end procedure

Proposition 7. If b̂minlcla¯la¯lthen Algorithm (18) reduces to

ala¯lxl+a¯l1xl,ẋl=xlelf̂a1b̂lelf̂axlcla¯la¯l,l1nE19

Proof. It suffices to check that for b̂minlcla¯la¯l,the vector zdefined by zl=eylk=1neyksolves the replicator equation,

żl=zlẏlzẏ.

Thus, xl=eylk=1neykb̂cla¯la¯lsolves ẋl=xlelf̂a1b̂lelf̂axlcla¯la¯l.This completes the proof.

Note that the dynamics of xin Eq. (19) is a constrained replicator dynamics [8] which is widely used in evolutionary game dynamics. This observation establishes a relationship between optimization and game dynamics and explains that the replicator dynamics is the gradient flow of the (expected payoff) under simplex constraint.

The next example illustrates a constrained distributionally robust optimization in wireless communication networks.

Example 5 (Wireless communication). Consider a power allocation problem over nmedium access channels. The signal-to-interference-plus-noise ratio (SINR) is

SINRl=alωll2d2srlstl+ε2o2N0srl+Ilsrl,

where

• N0>0is the background noise.

• The interference on channel lis denoted Il0.One typical model for Ilis

Il=klakωkl2d2srlstk+ε2o2.

• ϵ>0is the height of the transmitter antenna.

• ωllis the channel state at l.The channel state is unknown. Its true distribution is also unknown.

• srlis the location of the receiver of l

• stlis the location of the transmitter of l

• o2,3,4is the pathloss exponent.

• alis the power allocated to channel l.It is assumed to be between a¯l0and a¯lwith 0a¯l<a¯l<+.Moreover, a total power budget constraint is imposed l=1nala¯where a¯>l=1na¯l0.

It is worth mentioning that the action constraint of the power allocation problem are similar to the ones analyzed in Section 3. The admissible action space is

AaR+n:a¯lala¯ll=1nala¯.

Clearly, Ais a non-empty convex compact set. The payoff function is the sum-rate raω=l=1nWllog1+SINRlwhere Wl>0.The mapping aωraωis continuously differentiable.

• Robust optimization is too conservative: Part of the robust optimization problem [9, 7] consists of choosing the channel gain ωll20ω¯llwere the bound ω¯need to be carefully designed. However the worst case is achieved when the channel gain is zero: infωl0ω¯llraω=0.Hence the robust performance is zero. This is too conservative as several realizations of the channel may give better performance than zero. Another way is to re-design the bounds ω¯lland ω¯ll.But if ω¯ll>0it means that very low channel gains are not allowed, which may be too optimistic. Below we use the distributional robust optimization approach which eliminates this design issue.

• Distributional robust optimization: By means of the training sequence or channel estimation method, a certain (statistical) distribution mis derived. However mcannot be considered as the true distribution of the channel state due to estimation error. The true distribution of ωis unknown. Based on this observation, an uncertainty set Bρmwith radius ρ0is constructed for alternative distribution candidates. Note that ρ=0means that B0m=m.The distributional robust optimization problem is supainfm˜BρmEm˜raω.In presence of interference, the function raωis not necessarily concave in a.In absence of interference, the problem becomes concave.

## 4. Distributed optimization

This section presents distributed distributionally robust optimization problems over a direct graph. A large number of virtual agents can potentially choose a node (vertex) subject to constraint. The vector arepresents the population state. Since ahas ncomponents, the graph has nvertices. The interactions between virtual agents are interpreted as possible connections of the graph. Let us suppose that the current interactions are represented by a directed graph G=LE, where EL2is the set of links representing the possible interaction among the proportion of agents, i.e., if lkE, then the component lof acan interact with the kth component of a. In other words, lkEmeans that virtual agents selecting the strategy lLcould migrate to strategy kL.Moreover, Λ01n×nis the adjacency matrix of the graph G, and whose entries are λlk=1, if lkE; and λlk=0, otherwise.

Definition 6. The distributionally robust fitness function is the marginal distributionally robust payoff function. If aEmhaωis continuously differentiable, the distributionally robust fitness function is Emahaω.

Definition 7. The virtual population state ais an equilibrium if aAand it solves the variational inequality

ab,Emahaω0,bA.

Proposition 8. Let the set of virtual population state Abe non-empty convex compact and bEmhbωbe continuous. Then the following conditions are equivalent:

• ab,Emhaω0,bA.

• the action asatisfies a=projAa+ηEmhaω

Proof. Let abe a feasible action that solves the variational inequality:

ab,Emhaω0,bA.

Let η>0.By multiplying both sides by η,we obtain

ab,ηEmhaω0,bA.

We add the term abato both sides to obtain the following relationships:

abηEmhaω0bA,abηEmhaω+abaababA,baa+ηEmhaω+aba0bA,baaa+ηEmhaω0bA,E20

Recall that the projection operator on a convex and closed set Ais uniquely determined by

zRn,z=projAzzzbz0,bA.

Thus

baaa+ηEmhaω0,bAa=projAa+ηEmhaω.E21

This completes the proof.

As a consequence we can derive the following existence result.

Proposition 9. Let the set of virtual population states Abe a non-empty convex compact and the mapping bEmhbωbe continuous. Then, there exists at least one equilibrium in A.

Proof. A direct application of the Brouwer-Schauder’s fixed-point theorem which states that if ϕ:AAis continuous and Anon-empty convex compact then ϕhas at least one fixed-point in A.Here we choose ϕa=projAa+ηEmhaω.Clearly ϕAAand ϕis continuous on Aas the mapping bEmhbωand the projection operator bprojAbare both continuous. Then the announced result follows. This completes the proof.

Note that we do not need sophisticated set-valued fixed-point theory to obtain this result.

Definition 8. The virtual population state ais evolutionarily stable if aAand for any alternative deviant state bathere is an invasion barrier ϵb>0such that

ab,Emha+ϵbaω>0,ϵ0ϵb.

The function ϱ:A×Rn×R+n×nRn×nis the revision protocol, which describes how virtual agents are making decisions. The revision protocol ϱtakes a population state a,the corresponding fitness Emh, the adjacency matrix Λand returns a matrix. Therefore, let ϱlkahΛbe the switching rate from the lthto kthcomponent. Then, the virtual agents selecting the strategy lLhave incentives to migrate to the strategy lLonly if ϱlkahΛ>0, and it is also possible to design switch rates depending on the topology describing the migration constraints, i.e., λlk=0ϱlkahΛ=0.The distributed distributionally robust optimization consists to perform the optimization problem above over the distributed network that is subject to communication restriction. We construct a distributed distributionally robust game dynamics to perform such a task. The distributed distributionally robust evolutionary game dynamics emerge from the combination of the (robust) fitness hand the constrained switching rates ϱ.The evolution of the portion alis given by the distributed distributional robust mean dynamics

ȧl=kLakϱklahΛalkLϱlkahΛ,lL,E22

Since the distributionally robust function his obtained after the transformation from payoff function rby means of triality theory, the dynamics (22) is seeking for distributed distributionally robust solution.

Algorithm 5. The distributed distributional robust mean dynamics pseudocode is as follows:

1: procedure Population-inspired algorithma0ϵTϱgmhΛThe population-inspired learning starting from a0within 0T

2:       aa0

3:       while regret>ϵand tTdo We have the answer if regret is 0

4:             Compute atsolution of (22)

5:             Compute regrett

6:       end while

7:       return at,regrettget atand the regret

8: end procedure

The next example establishes evolutionarily stable state, equilibria and rest-point of the dynamics (22) by designing ϱ.

Example 6. Let us consider a power system that is composed of 10generators, i.e., let L=110. Let alR+be the power generated by the generator lL. Each power generation should satisfy the physical and/or operation constraints ala¯la¯l, for all lL. It is desired to satisfy the power demand given by dR, i.e., it is necessary to guarantee that lLal=d,i.e., the supply meets the demand. The objective is to minimize the generation quadratic costs for all the generators, i.e.,

Maximizeraω=lLrlal=lLc0l+c1lal+c2lal2,
s.t.lLal=d,a¯lala¯l,lL,

where r:RnRis concave, and the parameters are possibly uncertain and selected as c0l=25+6l, c1l=15+4l+ω1l, c2l=5+l+ω2l, and d=20+ω3l. Therefore, the fitness functions for the corresponding full potential game are given by fla=2alc2lc1l, for all lL, and action space is given by

A=aR+n:lLal=dala¯la¯l.

The distributed revision protocol is set to

ϱlkahΛ=λlkalmax0a¯kakmax0ala¯lmax0Emhkhl,

for al0.We evaluate four different scenarios, i.e.,

1. a¯=0nand a¯=d1ln,

2. a¯l=0, for all lL\910, a¯9=1.1, and a¯10=1; and a¯l=d, for all lL\12, a¯1=3, and a¯2=2.5,

3. Case 1 constraints and with interaction restricted to the cycle graph G=LEwith set of links E=lL\nll+1n1,

4. Case 2 constraints and with interaction restricted as in Case 3.

Figure 3 presents the evolution of the generated power, the fitness functions corresponding to the marginal costs and the total cost. For the first scenario, the evolutionary game dynamics converge to a standard evolutionarily stable state in which f̂a=c1n. In contrast, for the second scenario, the dynamics converge to a constrained evolutionarily stable state.

### 4.1. Extension to multiple decision-makers

Consider a constrained game Gin strategic-form given by

• P=1Pis the set of players. The cardinality of Pis P2.

• Player phas a decision space ApRnp,np1.Players are coupled through their actions and their payoffs. The set of all feasible action profiles is ARn,with n=pPnp.Player pcan choose an action apin the set Apap=apAp:apapA.

• Player phas a payoff function rp:AR.

We restrict our attention to the following constraints:

Ap=apRnpapla¯pla¯pll1npl=1npcplaplbp

The coupled constraint is

A=apAppPc¯papb¯.

Feasibility condition: If a¯pl<a¯pl,l1np,cpl>0,l=1npcpla¯pl<bp,c¯pR>0npand pPc¯pa¯p<b¯,the constraint set Ais non-empty, convex and compact.

We propose a method to compute a constrained equilibrium that has a full support (whenever it exists). We do not use the projection operator. Indeed we transform the domain a¯pla¯pl=ξ01where ξxpl=a¯plxpl+a¯pl1xpl=apl.ξis a one-to-one mapping and

xpl=ξ1apl=apla¯pla¯pla¯pl01.
l=1npcpla¯pla¯plxplbpl=1npcpla¯plb̂p.

The learning algorithm (23) is

ẏp=p2g1aprpaω,apla¯plxpl+a¯pl1xpl,xpl=min1eyplk=1npeypkb̂pcpla¯pla¯pl,l1np,E23

generates a trajectory apt=apltlthat satisfies the constraint of player pat any time t.

## 5. Notes

The work in [10] provides a nice intuitive introduction to robust optimization emphasizing the parallel with static optimization. Another nice treatment [11], focusing on robust empirical risk minimization problem, is designed to give calibrated confidence intervals on performance and provide optimal tradeoffs between bias and variance [12, 13]. f-divergence based performance evaluations are conducted in [11, 14, 15]. The connection between risk-sensitivity measures such as the exponentiated payoff and distributionally robustness can be found in [16]. Distributionally robust optimization and learning are extended to multiple strategic decision-making problems i.e., distributionally robust games in [17, 18].

## Acknowledgments

We gratefully acknowledge support from U.S. Air Force Office of Scientific Research under grant number FA9550-17-1-0259.

## How to cite and reference

### Cite this chapter Copy to clipboard

Jian Gao, Yida Xu, Julian Barreiro-Gomez, Massa Ndong, Michalis Smyrnakis and Hamidou Tembine (September 5th 2018). Distributionally Robust Optimization, Optimization Algorithms - Examples, Jan Valdman, IntechOpen, DOI: 10.5772/intechopen.76686. Available from:

### chapter statistics

586total chapter downloads

### More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

### Related Content

#### Optimization Algorithms

Edited by Jan Valdman

Next chapter

#### Polyhedral Complementarity Approach to Equilibrium Problem in Linear Exchange Models

By Vadim I. Shmyrev

#### Applications from Engineering with MATLAB Concepts

Edited by Jan Valdman

First chapter

#### Digital Image Processing with MATLAB

By Mahmut Sinecen

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

View all Books