Distributionally Robust Optimization

Jian Gao; Yida Xu; Julian Barreiro-Gomez; Massa Ndong; Michalis
Smyrnakis; Hamidou Tembine

doi:10.5772/intechopen.76686

Abstract

This chapter presents a class of distributionally robust optimization problems in which a decision-maker has to choose an action in an uncertain environment. The decision-maker has a continuous action space and aims to learn her optimal strategy. The true distribution of the uncertainty is unknown to the decision-maker. This chapter provides alternative ways to select a distribution based on empirical observations of the decision-maker. This leads to a distributionally robust optimization problem. Simple algorithms, whose dynamics are inspired from the gradient flows, are proposed to find local optima. The method is extended to a class of optimization problems with orthogonal constraints and coupled constraints over the simplex set and polytopes. The designed dynamics do not use the projection operator and are able to satisfy both upper- and lower-bound constraints. The convergence rate of the algorithm to generalized evolutionarily stable strategy is derived using a mean regret estimate. Illustrative examples are provided.

Keywords

distribution robustness
gradient flow
Bregman divergence
Wasserstein metric
f-divergence

Author Information

Show +

Jian Gao
- Learning and Game Theory Laboratory, New York University, Abu Dhabi, United Arab Emirates
Yida Xu
- Learning and Game Theory Laboratory, New York University, Abu Dhabi, United Arab Emirates
Julian Barreiro-Gomez
- Learning and Game Theory Laboratory, New York University, Abu Dhabi, United Arab Emirates
Massa Ndong
- Learning and Game Theory Laboratory, New York University, Abu Dhabi, United Arab Emirates
Michalis Smyrnakis
- Learning and Game Theory Laboratory, New York University, Abu Dhabi, United Arab Emirates
Hamidou Tembine*
- Learning and Game Theory Laboratory, New York University, Abu Dhabi, United Arab Emirates

*Address all correspondence to: tembine@ieee.org

1. Introduction

Robust optimization can be defined as the process of determining the best or most effective result, utilizing a quantitative measurement system under worst case uncertain functions or parameters. The optimization may occur in terms of best robust design, net cash flows, profits, costs, benefit/cost ratio, quality-of-experience, satisfaction, end-to-end delay, completion time, etc. Other measurement units may be used, such as units of production or production time, and optimization may occur in terms of maximizing production units, minimizing processing time, production time, maximizing profits, or minimizing costs under uncertain parameters. There are numerous techniques of robust optimization methods such as robust linear programming, robust dynamic programming, robust geometric programming, queuing theory, risk analysis, etc. One of the main drawbacks of the robust optimization is that the worst scenario may be too conservative. The bounds provided by the worst case scenarios may not be useful in many interesting problems (see the wireless communication example provided below). However, distributionally robust optimization is not based on the worst case parameters. The distributional robustness method is based the probability distribution instead of worst parameters. The worse case distribution within a certain carefully designed distributional uncertainty set may provide interesting features. Distributionally robust programming can be used not only to provide a distributionally robust solution to a problem when the true distribution is unknown, but it also can, in many instances, give a general solution taking into account some risk. The presented methodology is simple and reduces significantly the dimensionality of the distributionally robust optimization. We hope that the designs of distributionally robust programming presented here can help designers, engineers, cost–benefit analyst, managers to solve concrete problems under unknown distribution.

The rest of the chapter is organized as follows. Section 2 presents some preliminary concepts of distributionally robust optimization. A class of constrained distributionally robust optimization problems are presented in Section 3. Section 4 focuses on distributed distributionally robust optimization. Afterwards, illustrative examples in distributed power networks and in wireless communication networks are provided to evaluate the performance of the method. Finally, prior works and concluding remarks are drawn in Section 5.

Notation: Let R, R+, denote the set of real and non-negative real numbers, respectively, Ωd be a separable completely metrizable topological space with d:Ω×Ω→R+ a metric (distance). Let PΩ be the set of all probability measures over Ω.

2. Distributionally robust optimization

This section introduces distributionally robust optimization models. We will first present a generic formulation of the problem. Then, individual components of the optimization and their solvability issues via equivalent formulations will be discussed.

2.1. Model

Consider a decision-maker who wants to select an action a∈A⊂Rn in order to optimize her objective raω, where ω is an uncertain parameter. The information structure is the following:

The true distribution of ω is not known to the decision-maker.
The upper/lower bound (if any) of ω are unknown to the decision-maker.
The decision-maker can measure/observe realization of the random variable ω.

The decision-maker chooses to experiment several trials and obtains statistical realizations of ω from measurements. The measurement data can be noisy, imperfect and erroneous. Then, an empirical distribution (or histogram) m is built from the realizations of ω. However, m is not the true distribution of the random variable ω, and m may not be a reliable measure due to statistical, bias, measurement, observation or computational errors. Therefore, the decision-maker is facing a risk. The risk-sensitive decision-maker should decide action that improves the performance of Em˜raω among alternative distributions m˜ within a certain level of deviation ρ>0 from the distribution m. The distributionally robust optimization problem is therefore formulated as

supa∈Ainfm˜∈BρmEω∼m˜raω.E1

where Bρm is the uncertainty set of alternative admissible distributions from m within a certain radius ρ>0. Different distributional uncertainty sets are presented: the f-divergence and the Wasserstein metric, defined below.

2.1.1. f-divergence

We introduce the notion of f− divergence which will be used to compute the discrepancy between probability distributions.

Definition 1. Let m and m˜ be two probability measures over Ω such that m is absolutely continuous with respect to m˜. Let f be a convex function. Then, the f-divergence between m and m˜ is defined as follows:

Dfm∥m˜≡∫Ωfdmdm˜dm˜−f1,

where dmdm˜ is the Radon-Nikodym derivative of the measure m with the respect the measure m˜.

By Jensen’s inequality:

Dfm∥m˜=∫Ωfdmdm˜dm˜−f1≥f∫Ωdmdm˜dm˜−f1=f∫Ωdm−f1=f1−f1=0.E2

Thus, Dfm∥m˜≥0 for any convex function f. Note however that, the f− divergence Dfm∥m˜ is not a distance (for example, it does not satisfy the symmetry property). Here the distributional uncertainty set imposed to the alternative distribution m˜ is given by

Bρm=m˜m˜.≥0∫Ωdm˜=m˜Ω=1Dfm˜‖m≤ρ.

Example 1. From the notion of f− divergence one can derive the following important concept:

α-divergence for

fa=4α+11−α1−aα+12ifα∉−1+1,alogaifα=1,−logaifα=−1,

In particular, Kullback–Leibler divergence (or relative entropy) is retrieved as α goes to 1.

2.1.2. Wasserstein metric

The Wasserstein metric between two probability distributions m˜ and m is defined as follows:

Definition 2. For m,m˜∈PΩ, let Πm˜m be the set of all couplings between m and m˜. That is,

π∈PΩ×ΩπA×Ω=mAπΩ×B=m˜BAB∈B2Ω.

BΩ denotes the measurable sets of Ω. Let θ∈1∞. The Wasserstein metric between m˜ and m is defined as

Wθm˜m=infπ∈Πm˜m∥d∥Lπθ=infπ∈Πm˜m∫abdθabπdadb,

It is well-known that for every θ≥1, Wθm˜m is a true distance in the sense that it satisfies the following three axioms:

positive-definiteness,
the symmetry property,
the triangle inequality.

Note that m˜ is not necessarily absolutely continuous with respect to m. Now the distributional uncertainty/constraint set is the set of all possible probability distributions within a Lθ-Wasserstein distance below ρ.

B˜ρm=m˜∫Ωdm˜=m˜Ω=1Wθm˜m≤ρ,

Note that, if m is a random measure (obtained from a sampled realization), we use the expected value of the Wasserstein metric.

Example 2. The Lθ-Wasserstein distance between two Dirac measures δω0 and δω˜0 is Wθδω0δω˜0=dω0ω˜o. More generally, for K≥2, the L2-Wasserstein distance between empirical measures μK=1K∑k=1Kδωk and νK=1K∑k=1Kδω˜k is W22μKνK≤1K∑i=1Kωk−ω˜k2.

We have defined Bρm and B˜ρm. The goal now is to solve (1) under both f− divergence and Wasserstein metric. One of the difficulties of the problem is the curse of dimensionality. The distributionally robust optimization problem (1) of the decision-maker is an infinite-dimensional robust optimization problem because Bρ is of infinite dimensions. Below we will show that (1) can be transformed into an optimization in the form of supinfsup. The latter problem has three alternating terms. Solving this problem requires a triality theory.

2.2. Triality theory

We first present the duality gap and develop a triality theory to solve equivalent formulations of (1). Consider uncoupled domains Ai,i∈1,2,3. For a general function r2, one has

supa2∈A2infa1∈A1r2a1a2≤infa1∈A1supa2∈A2r2a1a2

and the difference

mina1∈A1maxa2∈A2r2a1a2−maxa2∈A2mina1∈A1r2a1a2,

is called duality gap. As it is widely known in duality theory from Sion’s Theorem [1] (which is an extension of von Neumann minimax Theorem) the duality gap vanishes, for example for convex-concave function, and the value is achieved by a saddle point in the case of non-empty convex compact domain.

Triality theory focuses on optimization problems of the forms: supinfsup or infsupinf. The term triality is used here because there are three key alternating terms in these optimizations.

Proposition 1. Let a1a2a3↦r3a1a2a3∈R be a function defined on the product space ∏i=13Ai. Then, the following inequalities hold:

supa2∈A2infa1∈A1,a3∈A3r3a1a2a3≤infa3∈A3supa2∈A2infa1∈A1r3a1a2a3≤infa1∈A1,a3∈A3supa2∈A2r3a1a2a3,E3

and similarly

supa1∈A1,a3∈A3infa2∈A2r3a1a2a3≤supa3∈A3infa2∈A2supa1∈A1r3a1a2a3≤infa2∈A2supa1∈A1,a3∈A3r3a1a2a3.E4

Proof. Define

ĝa2a3≔infa1∈A1r3a1a2a3.

Thus, for all a2,a3, one has ĝa2a3≤r3a1a2a3. It follows that, for any a1,a3,

supa2∈A2ĝa2a3≤supa2∈A2r3a1a2a3.

Using the definition of ĝ, one obtains

supa2∈A2infa1∈A1r3a1a2a3≤supa2∈A2r3a1a2a3,∀a1,a3.

Taking the infimum in a1 yields:

supa2∈A2infa1∈A1r3a1a2a3≤infa1∈A1supa2∈A2r3a1a2a3,∀a3E5

Now, we use two operations for the variable a3:

Taking the infimum in the inequality (5) in a3 yields

infa3∈A3supa2∈A2infa1∈A1r3a1a2a3≤infa3∈A3infa1∈A1supa2∈A2r3a1a2a3

=infa1a3∈A1×A3supa2∈A2r3a1a2a3,

which proves the second part of the inequalities (3). The first part of the inequalities (3) follows immediately from (5).

Taking the supremum in inequality (5) in a3 yields

supa2a3∈A2×A3infa1∈A1r3a1a2a3≤supa3∈A3infa1∈A1supa2∈A2r3a1a2a3,

which proves the first part of the inequalities (4). The second part of the inequalities (4) follows immediately from (5).

This completes the proof.

2.3. Equivalent formulations

Below we explain how the dimensionality of problem (1) can be significantly reduced using a representation by means of the triality theory inequalities of Proposition 1.

2.3.1. f-divergence

Interestingly, the distributionally robust optimization problem (1) under f-divergence is equivalent to the finite dimensional stochastic optimization problem (when A are of finite dimensions). To see this, the original problem need to be transformed. Let us introduce the likelihood functional Lω˜=dm˜dmω˜, and set

Lρm=L∫ω˜fLω˜dm−f1≤ρ∫ω˜Lω˜dmω˜=1.

Then, the Lagrangian of the problem is

r˜aLλμ=∫ω˜raω˜Lω˜dmω˜−λρ+f1−∫ω˜fLω˜dmω˜−μ1−∫ω˜Lω˜dmω˜,

where λ≥0 and μ∈R. The problem becomes

supainfL∈Lρmsupλ≥0,μ∈Rr˜aLλμ.E6

A full understanding of problem 6 requires a triality theory (not a duality theory). The use of triality theory leads to the following equation:

supa∈Ainfm˜∈BρmEm˜r=supa∈A,λ≥0,μ∈REmh,E7

where h is the integrand function −λρ+f1−μ−λf∗r+μ−λ, where f∗ is Legendre-Fenchel transform of f defined by

f∗ξ=supLLξ−fL=−infLfL−Lξ.E8

Note that the righthand side of (7) is of dimension n+2, which reduces considerably the dimensionality of the original problem (1).

2.3.2. Wasserstein metric

Similarly, the distributionally robust optimization problem under Wasserstein metric is equivalent to the finite dimensional stochastic optimization problem (when A is a set of finite dimension). If the function ω↦raω is upper semi-continuous and Ωd is a Polish space then the Wasserstein distributionally robust optimization problem is equivalent to

supa∈Ainfm˜∈B˜ρmEm˜r=supa∈Asupλ≥0Emh˜,h˜=λρθ+μ+supω̂∈Ωraω−μ−λdθωω̂;E9

The next subsection presents algorithms for computing a distributionally robust solution from the equivalent formulations above.

2.4. Learning algorithms

Learning algorithms are crucial for finding approximate solutions to optimization and control problems. They are widely used for seeking roots/kernel of a function and for finding feasible solutions to variational inequalities. Practically, a learning algorithm generates a certain trajectory (or a set of trajectories) toward a potential approximate solution. Selecting a learning algorithm that has specific properties such as better accuracy, more stability, less-oscillatory and quick convergence is a challenging task [2, 3, 4, 5]. From the calculus of variations point of view, however, a learning algorithm generates curves. Therefore, selecting an algorithm among the others leads to an optimal control problem on the spaces of curves. Hence, it is natural to use optimal control theory to derive faster algorithms for a family of curves. Bergman-based algorithms and risk-aware version of it are introduced below to meet specific properties. We start by introducing the Bregman divergence.

Definition 3. The Bregman divergence dg:A×A→R is defined on a differentiable strictly convex function g:A→R. For two points ab∈A2, it measures the gap between ga and the first-order Taylor expansion of g around a evaluated at b

dgab≔ga−gb−∇gba−b.

Example 3. From the Bregman divergence one gets other features by choosing specific functions g:

If ga=∑i=1nai2 then the Bregman divergence dgab=∑i=1nai−bi2 is the squared standard Euclidean distance.
If ga=∑i=1nailogai is defined on the relative interior of the simplex, i.e., a∈bb∈01n∑i=1nbi=1 then the Bregman divergence dgab=∑i=1nailogaibi, is the Kullback–Leibler divergence.

We are now ready to define algorithms for solving the righthand side of (7) and (9). One of the key approaches for error quantification of the algorithm with respect to the distributionally robust optimum is the so-called average regret. When the regret vanishes one gets close to a distributionally robust optimum.

Definition 4. The average regret of an algorithm which generates the trajectory at=a˜tλtμt within t0T,t0>0 is

regretT≔1T−t0∫t0Tmaxb∈A×R+×REmhbω−Emhatωdt

2.4.1. Armijo gradient flow

Algorithm 1. The Armijo’s gradient pseudocode is as follows:

1: Procedure Armijo gradient a0ϵTgmh⊳ The Armijo’s gradient starting from a0 within 0T

2: a←a0

3: while regret>ϵ and t≤T do ⊳ We have the answer if regret is 0

4: Compute at solution of (10)

5: Compute regrett

6: end while

7: return at,regrett ⊳ get a(t) and the regret

8: end procedure

Proposition 2. Let a↦Emhaω:Rn+2→R be a concave function that has a unique global maximizer a∗. Assume that a∗ be a feasible action profile, i.e., a∗∈A. Consider the continuous time analogue of the Armijo gradient flow [6], which is given by

ddtat=∇2g−1.∇aEmhatω,a0=a0∈Rn+2,E10

where a0=a0 is the initial point of the algorithm and g is a strictly convex function on a. Let at be the solution to (10).

Then the average regret within t0T,t0>0 is bounded above by

regretT≔1T−t0∫t0TEmha∗ω−hatωdt≤dga∗a0logTt0T−t0.

Proof. Let

Wat=tEmha∗ω−hatω+dga∗at,

where a is solution to (10). The function W is positive and ddtW=Emha∗ω−hatω−tEm∇ahaωgaa−1Em∇ahatω+ddtdga∗at. By concavity of Emhaω one has

Em∇ahaωa∗−a≥Emha∗ω−haω,∀a.

On the other hand,

ddtdga∗at=−ȧgaa−gaaȧa−a∗+gaȧ=−gaaȧa−a∗=−Em∇ahaωa∗−a.E11

Hence,

ddtW≤Em∇ahaωa∗−a−tEm∇ahaωgaa−1Em∇ahaω−Em∇ahaωa∗−a=−tEm∇ahaωgaa−1Em∇ahaω≤0,E12

where the last inequality is by convexity of g. It follows that ddtWat≤0 along the path of the gradient flow. This decreasing property implies 0≤Wat≤Wa0=dga∗a0. In particular, 0≤tEmha∗ω−haω≤Wa0<+∞. Thus, the error to the value Emha∗ω is bounded by

0≤Emha∗ω−haω≤Wa0t.

The announced result on the regret follows by integration over t0T and by averaging. This completes the proof.

Note that the above regret-bound is established without assuming strong convexity of a↦−Emhaω. Also no Lipschitz continuity bound of the gradient is assumed.

2.4.2. Bregman learning algorithms

Algorithm 2. The Bregman learning pseudocode is as follows:

1: procedure Bregman a0ϵTgαβmh⊳ The Bregman learning starting from a0 within 0T

2: a←a0

3: while regret>ϵ and t≤T do ⊳ We have the answer if regret is 0

4: Compute at solution of (13)

5: Compute regrett

6: end while

7: return at,regrett ⊳ get at and the regret

8: end procedure

Proposition 3. Let a↦Emhaω:Rn+2→R be a concave function that has a unique global maximizer a∗. Assume that a∗ be a feasible action profile, i.e., a∗∈A. Let α and β be two functions such that β̇t≤eαt. Consider the following Bregman learning algorithm

ddtgaat+e−αtȧt=eαt+βt∇aEmhatω,a0∈Rn+2,ȧ0∈Rn+2,E13

where a0 is the initial point of the algorithm and g is a strictly convex function on a. Let at be the solution to (13). Then the average regret within t0T,t0>0 is bounded above by

regretT≤c0T−t0∫t0Te−βsds,E14

where c0≔dga∗a0+e−α0ȧ0)+eβ0Emha∗ω−ha0ω>0.

Proof. Let Waȧta∗=dga∗at+e−αtȧt+eβtEmha∗ω−hatω. It is clear that W is positive. Moreover, ddtWatȧtta∗≤0 for β̇≤eα. Thus Watȧtta∗≤Wa0ȧ00a∗=c0. By integration between t0T it follows

1T−t0∫t0TEmha∗ω−hatωdt≤c0T−t0∫t0Te−βsds.

This completes the proof.

In particular, for βs=−s+es, one obtains an error bound to the minimum value as

c0t∫0te−βsds=c0t∫0tese−esds=c01e−e−ett,

and for βs=s, the regret bound becomes

c0t∫0te−βsds=c01−e−tt.

Figure 1 illustrates the advantage of algorithm (13) compared with the gradient flow (10). It plots the regret bound c0T−t0∫t0Te−βsds for β=s and dga∗a0logTt0T−t0 with an initial gap of c0=25.

Figure 1.
Global regret bound under Bregman vs. gradient. The initial gap is c0=25.

The advantage of algorithms (10) and (13) is that it is not required to compute the Hessian of Emhaω as it is the case in the Newton scheme. As a corollary of Proposition 2 the regret vanishes as T grows. Thus, it is a no-regret algorithm. However, Algorithm (10) may not be sufficiently fast. Algorithm (13) provides a higher order convergence rate by carefully designing αβ. The average regret decays very quickly to zero [7]. However, it may generate an oscillatory trajectory with a big magnitude. The next subsection presents risk-aware algorithms that reduce the oscillatory phase of the trajectory.

2.4.3. Risk-aware Bregman learning algorithm

In order to reduce the oscillatory phase, we introduce a risk-aware Bregman learning algorithm [7] which is a speed-up-and-average version of (13) called mean dynamics m¯ of a given by

m¯⃛=−3tm¯¨−eα−α̇m¯¨+2tm¯̇+e2α+βtgm¯m¯−1m¯+t+2e−αm¯̇+te−αm¯¨Ehm¯tm¯̇+m¯ω,E15

with starting vector m¯0=a0,m¯̇0,m¯¨0.

Algorithm 3. The risk-aware Bregman learning pseudocode is as follows:

1: procedure Risk-aware Bregman m¯0ϵTgαβmh⊳ The risk-aware Bregman learning starting from m¯0 within 0T

2: m¯←m¯0=a0,m¯̇0,m¯¨0

3: while regret>ϵ and t≤T do ⊳ We have the answer if regret is 0

4: Compute m¯t solution of (15)

5: Compute regret

6: end while

7: return m¯t,regrett ⊳ get m¯t and the regret

8: end procedure

Proposition 4. The time-average trajectory of the learning algorithm (13) generates the mean dynamics (15).

Proof. We use the average relation m¯t=1t∫0tasds where a solves Eq. (13). From the definition of m¯, and by Hopital’s rule, m¯0=a0. Moreover, m¯t and at share the following equations:

at=m¯t+tm¯̇t,ȧt=2m¯̇t+tm¯¨t,a¨t=3m¯¨t+tm¯⃛t.E16

Substituting these values in Eq. (13) yields the mean dynamics (15). This completes the proof.

The risk-aware Bregman dynamics (15) generates a less oscillatory trajectory due to its averaging nature. The next result provides an accuracy bound for (15).

Proposition 5. The risk-aware Bregman dynamics (15) satisfies

0≤Emha∗ω−hm¯tω≤c0t∫0te−βsds.

Proof. Let m¯t=1t∫0tasds. Then, m¯t=∫Ras1t1l0tsds. Thus, m¯t=Eμta where μt is the measure with density dμts=1t1l0tds. By convexity of −Emhaω we apply the Jensen’s inequality:

Emh1t∫0tasdsω=Emhm¯tω=EmhEμtaω

≥EμtEmhaω=1t∫0tEmhasωds.

In view of (14) one has

0≤Emha∗ω−Emh1t∫0tasdsω

≤1t∫0tEmha∗ω−Emhasωds

≤c01t∫0te−βsds,

0≤Emha∗ω−Emhm¯tω≤c0t∫0te−βsds.

This completes the proof.

Definition 5. (Convergence time). Let δ>0 and at be the trajectory generated by Bregman algorithm starting from a0 at time t0. The convergence time to be within a ball BEmha∗ωδ of radius δ>0 from the center ra∗ is given by

Tδ=inftEmha∗ω−hatω≤δt>t0.

Proposition 6. Under the assumptions above, the error generated by the algorithm is at most (14) which means that it takes at most Tδ=β−1logc0δ time units to the algorithm to be within a ball Bra∗δ of radius δ>0 from the center Emha∗ω.

Proof. The proof is immediate. For δ>0 the average regret bound of Proposition 5,

regretT≤c0T−t0∫t0Te−βsds≤δ,E17

provides the announced convergence time bound. This completes the proof.

See Table 1 for detailed parametric functions on the bound Tδ.

Example 4. Let fy=ylogy defined on R+∗. Then, f1=0, and derivatives of f are f′y=1+logy,f′′y=1y>0. The Legendre-Fenchel transform of f is f∗ξ=y∗=eξ−1. Let a1a2↦ga=∥a∥22, and a1a2ω↦ra1a2ω=−1+∑k=12ωik2ak2. The coefficient ω distribution is unknown but a sampled empirical measure m is considered to be similar to uniform distribution in 01 with 104 samples. We illustrate the quick convergence rate of the algorithm in a basic example and plot in Figure 2 the trajectories under standard gradient, Bregman dynamics and risk-aware Bregman dynamics (15). In particular, we observe that risk-aware Bregman dynamics (15) provides very quickly a satisfactory value. In this particular setup, we observe that the accuracy of the risk-aware Bregman algorithm (15) at t=0.5 will need four times (t=2) less than the standard Bregman algorithm to reach a similar level of error. It takes 40 times more t=20 than the gradient ascent to reach that level. Also, we observe that the risk-aware Bregman algorithm is less oscillatory and the amplitude decays very fast compared to the risk-neutral algorithm.

Convergence	Error bound	Time-to-reach Tδ
Triple exponential	e−eetc0	logloglogc0δ
αt=t+et,βt=eet
Double exponential rate	e−etc0	loglogc0δ
αt=t,βt=et
Exponential rate	e−tc0	logc0δ
αt=0,βt=t
Polynomial order k	c0tk	c01/kδ1/k
αt=logk−logt,βt=klogt

Table 1.

Convergence rate under different set of functions.

Figure 2.
Gradient ascent vs. risk-aware Bregman dynamics for r=−1+∑k=12ωk2ak2.

3. Constrained distributionally robust optimization

In the constrained case i.e., when A is a strict subset of Rn+2, algorithms (10) and (13) present some drawbacks: The trajectory at may not be feasible, i.e., at∉A×R+×R even when it starts in A. In order to design feasible trajectories, projected gradient has been widely studied in the literature. However, a projection into A at each time t involves additional optimization problems and the computation of the projected gradient adds extra complexity to the algorithm. We restrict our attention to the following constraints:

A=a∈Rnal∈a¯la¯ll∈1…n∑l=1nclal≤b.

We impose the following feasibility condition: a¯l<a¯l,l∈1…n,cl>0, ∑l=1ncla¯l<b. Under this setting, the constraint set A is non-empty, convex and compact.

We propose a method to compute a constrained solution that has a full support (whenever it exists). We do not use the projection operator. Indeed we transform the domain a¯la¯l=ξ01 where ξxl=a¯lxl+a¯l1−xl=al. ξ is a one-to-one mapping and

xl=ξ−1al=al−a¯la¯l−a¯l∈01.

∑l=1ncla¯l−a¯lxl≤b−∑l=1ncla¯l≕b̂.

The algorithm (18)

ẏ=∇2g−1∇aEmhaω≕f̂a,al≔a¯lxl+a¯l1−xl,xl=min1eyl∑k=1neykb̂cla¯l−a¯l,l∈1…n,E18

generates a trajectory at that satisfies the constraint.

Algorithm 4. The constrained learning pseudocode is as follows:

1: procedure Constrained gradient a0ϵTgmh⊳ The constrained learning algorithm starting from a0 within 0T

2: a←a0

3: while regret>ϵ and t≤T do ⊳ We have the answer if regret is 0

4: Compute at solution of (18)

5: Compute regret

6: end while

7: return at,regrett ⊳ get a(t) and the regret

8: end procedure

Proposition 7. If b̂≤minlcla¯l−a¯l then Algorithm (18) reduces to

al≔a¯lxl+a¯l1−xl,ẋl=xlelf̂a−1b̂∑lelf̂axlcla¯l−a¯l,l∈1…nE19

Proof. It suffices to check that for b̂≤minlcla¯l−a¯l, the vector z defined by zl=eyl∑k=1neyk solves the replicator equation,

żl=zlẏl−zẏ.

Thus, xl=eyl∑k=1neykb̂cla¯l−a¯l solves ẋl=xlelf̂a−1b̂∑lelf̂axlcla¯l−a¯l. This completes the proof.

Note that the dynamics of x in Eq. (19) is a constrained replicator dynamics [8] which is widely used in evolutionary game dynamics. This observation establishes a relationship between optimization and game dynamics and explains that the replicator dynamics is the gradient flow of the (expected payoff) under simplex constraint.

The next example illustrates a constrained distributionally robust optimization in wireless communication networks.

Example 5 (Wireless communication). Consider a power allocation problem over n medium access channels. The signal-to-interference-plus-noise ratio (SINR) is

SINRl=alωll2d2srlstl+ε2o2N0srl+Ilsrl,

where

N0>0 is the background noise.
The interference on channel l is denoted Il≥0. One typical model for Il is

Il=∑k≠lakωkl2d2srlstk+ε2o2.

ϵ>0 is the height of the transmitter antenna.
ωll is the channel state at l. The channel state is unknown. Its true distribution is also unknown.
srl is the location of the receiver of l
stl is the location of the transmitter of l
o∈2,3,4 is the pathloss exponent.
al is the power allocated to channel l. It is assumed to be between a¯l≥0 and a¯l with 0≤a¯l<a¯l<+∞. Moreover, a total power budget constraint is imposed ∑l=1nal≤a¯ where a¯>∑l=1na¯l≥0.

It is worth mentioning that the action constraint of the power allocation problem are similar to the ones analyzed in Section 3. The admissible action space is

A≔a∈R+n:a¯l≤al≤a¯l∑l=1nal≤a¯.

Clearly, A is a non-empty convex compact set. The payoff function is the sum-rate raω=∑l=1nWllog1+SINRl where Wl>0. The mapping aω↦raω is continuously differentiable.

Robust optimization is too conservative: Part of the robust optimization problem [9, 7] consists of choosing the channel gain ωll2∈0ω¯ll were the bound ω¯ need to be carefully designed. However the worst case is achieved when the channel gain is zero: infω∈∏l0ω¯llraω=0. Hence the robust performance is zero. This is too conservative as several realizations of the channel may give better performance than zero. Another way is to re-design the bounds ω¯ll and ω¯ll. But if ω¯ll>0 it means that very low channel gains are not allowed, which may be too optimistic. Below we use the distributional robust optimization approach which eliminates this design issue.
Distributional robust optimization: By means of the training sequence or channel estimation method, a certain (statistical) distribution m is derived. However m cannot be considered as the true distribution of the channel state due to estimation error. The true distribution of ω is unknown. Based on this observation, an uncertainty set Bρm with radius ρ≥0 is constructed for alternative distribution candidates. Note that ρ=0 means that B0m=m. The distributional robust optimization problem is supainfm˜∈BρmEm˜raω. In presence of interference, the function raω is not necessarily concave in a. In absence of interference, the problem becomes concave.

4. Distributed optimization

This section presents distributed distributionally robust optimization problems over a direct graph. A large number of virtual agents can potentially choose a node (vertex) subject to constraint. The vector a represents the population state. Since a has n components, the graph has n vertices. The interactions between virtual agents are interpreted as possible connections of the graph. Let us suppose that the current interactions are represented by a directed graph G=LE, where E⊆L2 is the set of links representing the possible interaction among the proportion of agents, i.e., if lk∈E, then the component l of a can interact with the k−th component of a. In other words, lk∈E means that virtual agents selecting the strategy l∈L could migrate to strategy k∈L. Moreover, Λ∈01n×n is the adjacency matrix of the graph G, and whose entries are λlk=1, if lk∈E; and λlk=0, otherwise.

Definition 6. The distributionally robust fitness function is the marginal distributionally robust payoff function. If a↦Emhaω is continuously differentiable, the distributionally robust fitness function is Em∇ahaω.

Definition 7. The virtual population state a is an equilibrium if a∈A and it solves the variational inequality

⟨a−b,Em∇ahaω≥0,∀b∈A.

Proposition 8. Let the set of virtual population state A be non-empty convex compact and b↦Em∇hbω be continuous. Then the following conditions are equivalent:

⟨a−b,Em∇haω≥0,∀b∈A.
the action a satisfies a=projAa+ηEm∇haω

Proof. Let a be a feasible action that solves the variational inequality:

⟨a−b,Em∇haω≥0,∀b∈A.

Let η>0. By multiplying both sides by η, we obtain

⟨a−b,ηEm∇haω≥0,∀b∈A.

We add the term ab−a to both sides to obtain the following relationships:

a−bηEm∇haω≥0∀b∈A,⇔a−bηEm∇haω+a−b−a≥ab−a∀b∈A,⇔b−a−a+ηEm∇haω+a−b−a≥0∀b∈A,⇔b−aa−a+ηEm∇haω≥0∀b∈A,E20

Recall that the projection operator on a convex and closed set A is uniquely determined by

z∈Rn,z′=projAz⇔z′−zb−z′≥0,∀b∈A.

Thus

b−aa−a+ηEm∇haω≥0,∀b∈A⇔a=projAa+ηEm∇haω.E21

This completes the proof.

As a consequence we can derive the following existence result.

Proposition 9. Let the set of virtual population states A be a non-empty convex compact and the mapping b↦Em∇hbω be continuous. Then, there exists at least one equilibrium in A.

Proof. A direct application of the Brouwer-Schauder’s fixed-point theorem which states that if ϕ:A→A is continuous and A non-empty convex compact then ϕ has at least one fixed-point in A. Here we choose ϕa=projAa+ηEm∇haω. Clearly ϕA⊆A and ϕ is continuous on A as the mapping b↦Em∇hbω and the projection operator b↦projAb are both continuous. Then the announced result follows. This completes the proof.

Note that we do not need sophisticated set-valued fixed-point theory to obtain this result.

Definition 8. The virtual population state a is evolutionarily stable if a∈A and for any alternative deviant state b≠a there is an invasion barrier ϵb>0 such that

⟨a−b,Em∇ha+ϵb−aω>0,∀ϵ∈0ϵb.

The function ϱ:A×Rn×R+n×n→Rn×n is the revision protocol, which describes how virtual agents are making decisions. The revision protocol ϱ takes a population state a, the corresponding fitness ∇Emh, the adjacency matrix Λ and returns a matrix. Therefore, let ϱlkahΛ be the switching rate from the lth to kth component. Then, the virtual agents selecting the strategy l∈L have incentives to migrate to the strategy l∈L only if ϱlkahΛ>0, and it is also possible to design switch rates depending on the topology describing the migration constraints, i.e., λlk=0⇒ϱlkahΛ=0. The distributed distributionally robust optimization consists to perform the optimization problem above over the distributed network that is subject to communication restriction. We construct a distributed distributionally robust game dynamics to perform such a task. The distributed distributionally robust evolutionary game dynamics emerge from the combination of the (robust) fitness h and the constrained switching rates ϱ. The evolution of the portion al is given by the distributed distributional robust mean dynamics

ȧl=∑k∈LakϱklahΛ−al∑k∈LϱlkahΛ,l∈L,E22

Since the distributionally robust function h is obtained after the transformation from payoff function r by means of triality theory, the dynamics (22) is seeking for distributed distributionally robust solution.

Algorithm 5. The distributed distributional robust mean dynamics pseudocode is as follows:

1: procedure Population-inspired algorithm a0ϵTϱgmhΛ⊳ The population-inspired learning starting from a0 within 0T

2: a←a0

3: while regret>ϵ and t≤T do ⊳ We have the answer if regret is 0

4: Compute at solution of (22)

5: Compute regrett

6: end while

7: return at,regrett ⊳ get at and the regret

8: end procedure

The next example establishes evolutionarily stable state, equilibria and rest-point of the dynamics (22) by designing ϱ.

Example 6. Let us consider a power system that is composed of 10 generators, i.e., let L=1…10. Let al∈R+ be the power generated by the generator l∈L. Each power generation should satisfy the physical and/or operation constraints al∈a¯la¯l, for all l∈L. It is desired to satisfy the power demand given by d∈R, i.e., it is necessary to guarantee that ∑l∈Lal=d, i.e., the supply meets the demand. The objective is to minimize the generation quadratic costs for all the generators, i.e.,

Maximizeraω=∑l∈Lrlal=−∑l∈Lc0l+c1lal+c2lal2,

s.t.∑l∈Lal=d,a¯l≤al≤a¯l,l∈L,

where r:Rn→R is concave, and the parameters are possibly uncertain and selected as c0l=25+6l, c1l=15+4l+ω1l, c2l=5+l+ω2l, and d=20+ω3l. Therefore, the fitness functions for the corresponding full potential game are given by fla=−2alc2l−c1l, for all l∈L, and action space is given by

A=a∈R+n:∑l∈Lal=dal∈a¯la¯l.

The distributed revision protocol is set to

ϱlkahΛ=λlkalmax0a¯k−akmax0al−a¯lmax0Emhk−hl,

for al≠0. We evaluate four different scenarios, i.e.,

a¯=0n and a¯=d1ln,
a¯l=0, for all l∈L\910, a¯9=1.1, and a¯10=1; and a¯l=d, for all l∈L\12, a¯1=3, and a¯2=2.5,
Case 1 constraints and with interaction restricted to the cycle graph G=LE with set of links E=∪l∈L\nll+1∪n1,
Case 2 constraints and with interaction restricted as in Case 3.

Figure 3 presents the evolution of the generated power, the fitness functions corresponding to the marginal costs and the total cost. For the first scenario, the evolutionary game dynamics converge to a standard evolutionarily stable state in which f̂a⋆=c1n. In contrast, for the second scenario, the dynamics converge to a constrained evolutionarily stable state.

Figure 3.
Economic power dispatch. Evolution of the population states (generated power), fitness functions f̂a=∇Ehaω, and the costs −Eraω. Figures (a)-(c) for case 1, (d)-(f) for case 2, (g)-(i) for case 3, and (j)-(l) for case 4.

4.1. Extension to multiple decision-makers

Consider a constrained game G in strategic-form given by

P=1…P is the set of players. The cardinality of P is P≥2.
Player p has a decision space Ap⊂Rnp,np≥1. Players are coupled through their actions and their payoffs. The set of all feasible action profiles is A⊂Rn, with n=∑p∈Pnp. Player p can choose an action ap in the set Apa−p=ap∈Ap:apa−p∈A.
Player p has a payoff function rp:A→R.

We restrict our attention to the following constraints:

Ap=ap∈Rnpapl∈a¯pla¯pll∈1…np∑l=1npcplapl≤bp

The coupled constraint is

A=a∈∏pAp∑p∈Pc¯pap≤b¯.

Feasibility condition: If a¯pl<a¯pl,l∈1…np,cpl>0, ∑l=1npcpla¯pl<bp, c¯p∈R>0np and ∑p∈Pc¯pa¯p<b¯, the constraint set A is non-empty, convex and compact.

We propose a method to compute a constrained equilibrium that has a full support (whenever it exists). We do not use the projection operator. Indeed we transform the domain a¯pla¯pl=ξ01 where ξxpl=a¯plxpl+a¯pl1−xpl=apl. ξ is a one-to-one mapping and

xpl=ξ−1apl=apl−a¯pla¯pl−a¯pl∈01.

∑l=1npcpla¯pl−a¯plxpl≤bp−∑l=1npcpla¯pl≕b̂p.

The learning algorithm (23) is

ẏp=∇p2g−1∇aprpaω,apl≔a¯plxpl+a¯pl1−xpl,xpl=min1eypl∑k=1npeypkb̂pcpla¯pl−a¯pl,l∈1…np,E23

generates a trajectory apt=apltl that satisfies the constraint of player p at any time t.

5. Notes

The work in [10] provides a nice intuitive introduction to robust optimization emphasizing the parallel with static optimization. Another nice treatment [11], focusing on robust empirical risk minimization problem, is designed to give calibrated confidence intervals on performance and provide optimal tradeoffs between bias and variance [12, 13]. f-divergence based performance evaluations are conducted in [11, 14, 15]. The connection between risk-sensitivity measures such as the exponentiated payoff and distributionally robustness can be found in [16]. Distributionally robust optimization and learning are extended to multiple strategic decision-making problems i.e., distributionally robust games in [17, 18].

Acknowledgments

We gratefully acknowledge support from U.S. Air Force Office of Scientific Research under grant number FA9550-17-1-0259.

References

1. Sion M. On general minimax theorems. Pacific Journal of Mathematics. 1958;8(1):171-176
2. Bach FR. Duality between subgradient and conditional gradient methods. SIAM Journal on Optimization. 2015;1(25):115-129
3. Kim D, Fessler JA. Optimized first-order methods for smooth convex minimization. Mathematical Programming. 2016;159(1):81-107
4. Nesterov Y. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming. 2008;112(1):159-181
5. Nesterov Y. Primal-dual subgradient methods for convex problems. Mathematical Programming. 2009;120(1):221-259
6. Larry A. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics. Tokyo Japan: International Academic Printing Co., Ltd.; 1966;16(1):1-3
7. Tembine H. Distributed Strategic Learning for Wireless Engineers. Boca Raton, FL, USA: CRC Press, Inc.; 2012. p. 496
8. Taylor PD, Jonker L. Evolutionarily stable strategies and game dynamics. Mathematical Biosciences. 1978;40:145-156
9. Ben-Tal A, Ghaoui LE, Nemirovski A. Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press; August 30, 2009. 576 p. ISBN-10: 0691143684, ISBN-13: 978-0691143682
10. Esfahani PM, Kuhn D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. In: Mathematical Programming. Jul 2017
11. Duchi JC, Namkoong H. Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in Neural Information Processing Systems. 2015;2208-2216
12. Ben-Tal A, den Hertog D, Waegenaere AD, Melenberg B, Rennen G. Robust solutions of optimization problems affected by uncertain probabilities. Management Science. Catonsville, USA: Informs PubsOnLine; 2013;59(2):341-357
13. Ben-Tal A, Hazan E, Koren T, Mannor S. Oracle-based robust optimization via online learning. Operations Research. 2015;63(3):628-638
14. Ahmadi-Javid A. An information-theoretic approach to constructing coherent risk measures. IEEE International Symposium on Information Theory Proceedings. 2011;2125-2127
15. Ahmadi-Javid A. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications. 2012;155(3):1105-1123
16. Föllmer H, Knispel T. Entropic risk measures: Coherence vs. convexity, model ambiguity, and robust large deviations. Stochastics Dynamics. 2011;11:333-351
17. Bauso D, Gao J, Tembine H. Distributionally robust games: F-divergence and learning. ValueTools, International Conference on Performance Evaluation Methodologies and Tools, Venice, Italy; December 5-7, 2017
18. Tembine H. Dynamic robust games in mimo systems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). Aug 2011;41(4):990-1002

[1] 1. Sion M. On general minimax theorems. Pacific Journal of Mathematics. 1958;8(1):171-176

[2] 2. Bach FR. Duality between subgradient and conditional gradient methods. SIAM Journal on Optimization. 2015;1(25):115-129

[3] 3. Kim D, Fessler JA. Optimized first-order methods for smooth convex minimization. Mathematical Programming. 2016;159(1):81-107

[4] 4. Nesterov Y. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming. 2008;112(1):159-181

[5] 5. Nesterov Y. Primal-dual subgradient methods for convex problems. Mathematical Programming. 2009;120(1):221-259

[6] 6. Larry A. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics. Tokyo Japan: International Academic Printing Co., Ltd.; 1966;16(1):1-3

[7] 7. Tembine H. Distributed Strategic Learning for Wireless Engineers. Boca Raton, FL, USA: CRC Press, Inc.; 2012. p. 496

[8] 8. Taylor PD, Jonker L. Evolutionarily stable strategies and game dynamics. Mathematical Biosciences. 1978;40:145-156

[9] 9. Ben-Tal A, Ghaoui LE, Nemirovski A. Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press; August 30, 2009. 576 p. ISBN-10: 0691143684, ISBN-13: 978-0691143682

[10] 10. Esfahani PM, Kuhn D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. In: Mathematical Programming. Jul 2017

[11] 11. Duchi JC, Namkoong H. Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in Neural Information Processing Systems. 2015;2208-2216

[12] 12. Ben-Tal A, den Hertog D, Waegenaere AD, Melenberg B, Rennen G. Robust solutions of optimization problems affected by uncertain probabilities. Management Science. Catonsville, USA: Informs PubsOnLine; 2013;59(2):341-357

[13] 13. Ben-Tal A, Hazan E, Koren T, Mannor S. Oracle-based robust optimization via online learning. Operations Research. 2015;63(3):628-638

[14] 14. Ahmadi-Javid A. An information-theoretic approach to constructing coherent risk measures. IEEE International Symposium on Information Theory Proceedings. 2011;2125-2127

[15] 15. Ahmadi-Javid A. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications. 2012;155(3):1105-1123

[16] 16. Föllmer H, Knispel T. Entropic risk measures: Coherence vs. convexity, model ambiguity, and robust large deviations. Stochastics Dynamics. 2011;11:333-351

[17] 17. Bauso D, Gao J, Tembine H. Distributionally robust games: F-divergence and learning. ValueTools, International Conference on Performance Evaluation Methodologies and Tools, Venice, Italy; December 5-7, 2017

[18] 18. Tembine H. Dynamic robust games in mimo systems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). Aug 2011;41(4):990-1002

Distributionally Robust Optimization

Optimization Algorithms - Examples

Abstract

Keywords

Author Information

Jian Gao

Yida Xu

Julian Barreiro-Gomez

Massa Ndong

Michalis Smyrnakis

Hamidou Tembine*

1. Introduction

2. Distributionally robust optimization

2.1. Model

2.1.1. f-divergence

2.1.2. Wasserstein metric

2.2. Triality theory

2.3. Equivalent formulations

2.3.1. f-divergence

2.3.2. Wasserstein metric

2.4. Learning algorithms

2.4.1. Armijo gradient flow

2.4.2. Bregman learning algorithms

Figure 1.

2.4.3. Risk-aware Bregman learning algorithm

Table 1.

Figure 2.

3. Constrained distributionally robust optimization

4. Distributed optimization

Figure 3.

4.1. Extension to multiple decision-makers

5. Notes

Acknowledgments

References

Polyhedral Complementarity Approach to Equilibrium Problem in Linear Exchange Models

Distributionally Robust Optimization

Optimization Algorithms - Examples

Abstract

Keywords

Author Information

Jian Gao

Yida Xu

Julian Barreiro-Gomez

Massa Ndong

Michalis Smyrnakis

Hamidou Tembine*

1. Introduction

2. Distributionally robust optimization

2.1. Model

2.1.1. f-divergence

2.1.2. Wasserstein metric

2.2. Triality theory

2.3. Equivalent formulations

2.3.1. f-divergence

2.3.2. Wasserstein metric

2.4. Learning algorithms

2.4.1. Armijo gradient flow

2.4.2. Bregman learning algorithms

Figure 1.

2.4.3. Risk-aware Bregman learning algorithm

Table 1.

Figure 2.

3. Constrained distributionally robust optimization

4. Distributed optimization

Figure 3.

4.1. Extension to multiple decision-makers

5. Notes

Acknowledgments

References

Continue reading from the same book

Optimization Algorithms