Conic Reformulations for Kullback-Leibler Divergence Constrained Distributionally Robust Optimization and Applications

In this paper, we consider a distributionally robust optimization (DRO) model in which the ambiguity set is defined as the set of distributions whose Kullback-Leibler (KL) divergence to an empirical distribution is bounded. Utilizing the fact that KL divergence is an exponential cone representable function, we obtain the robust counterpart of the KL divergence constrained DRO problem as a dual exponential cone constrained program under mild assumptions on the underlying optimization problem. The resulting conic reformulation of the original optimization problem can be directly solved by a commercial conic programming solver. We specialize our generic formulation to two classical optimization problems, namely, the Newsvendor Problem and the Uncapacitated Facility Location Problem. Our computational study in an out-of-sample analysis shows that the solutions obtained via the DRO approach yield significantly better performance in terms of the dispersion of the cost realizations while the central tendency deteriorates only slightly compared to the solutions obtained by stochastic programming.


Introduction
Decision making under uncertainty is one of the most challenging tasks in operations research. Two paradigms are predominantly used in the literature to address uncertainty: stochastic programming and robust optimization. In the classical stochastic programming [7,33], a predefined set of scenarios (or samples) are determined, either taken directly from observed data or after fitting an appropriate distribution. Then, the objective function is replaced with an expectation taken with respect to the random elements, and constraints are copied for each scenario. In addition to the assumption about knowing the underlying distribution, this basic stochastic programming approach has some limitations: Firstly, the size of the resulting deterministic equivalent grows larger with the size of the scenarios. Secondly, an expectation may not be an appropriate performance measure. Thirdly, satisfying constraints for each scenario might be too restrictive. The respective remedies for these shortcomings are proposed such as sample average approximation to limit the problem size, riskaverse objective function for a more appropriate performance measure and chance constraints to allow constraint satisfaction with high probability. However, the implicit assumption of stochastic programming remains, which is the need to assume a distribution by analyzing the data or fitting one. Unfortunately, this step may not be performed satisfactorily in all cases.
In robust optimization [4,5,6], a predefined uncertainty set, which includes all possible values of the uncertain elements, is used. Then, the optimization is performed with the aim of optimizing with respect to the worst possible realization from the uncertainty set. There are two main advantages of using robust optimization. Firstly, the decision maker does not need to make any assumptions about the distribution of the uncertain elements in the problem as opposed to the stochastic programming approach. Secondly, the deterministic equivalent (or so-called the robust counterpart) of the robust optimization problem typically has the same computational complexity as the deterministic version of the problem under reasonable assumptions on the uncertainty sets. On the other hand, the main disadvantage of the robust optimization approach is that depending on the construction of the uncertainty set, it might lead to overly conservative solutions, which might have poor performance in central tendency such as expectation.
Distributionally robust optimization (DRO) is a relatively new paradigm that aims to combine stochastic programming and robust optimization approaches. The main modeling assumption of DRO is that some partial information about the distribution governing the underlying uncertainty is available, and the optimization is performed with respect to the worst distribution from an ambiguity set, which contains all distributions consistent with this partial information. There are mainly two streams in the DRO literature based on how the ambiguity set is defined: moment-based and distance-based.
In moment-based DRO, ambiguity sets are defined as the set of distributions whose first few moments are assumed to be known or constrained to lie in certain subsets. If certain structural properties hold for the ambiguity sets such as convexity (or conic representability), then tractable convex (or conic reformulations) can be obtained [30,11,34]. Recently, chance constrained DRO problems also draw attention [37,35]. In distance-based DRO, ambiguity sets are defined as the set of distributions whose distance (or divergence) from a reference distribution is constrained. For Wasserstein distance [13,24,15,36] and φ−divergence [2,17,19] constrained DRO, tractable convex reformulations have been proposed.
As summarized above, in many cases, tractable convex robust counterparts or reformulations can be obtained for robust and DR optimization problems. However, an even more special structure such as conic representability can be preferred whenever available. Especially, if the robust counterpart can be expressed as a conic program for which the underlying cone admits a self-concordant barrier, then efficient polynomial-time interior point methods can be applied directly [28]. This desired property holds for linear programs, second-order cone programs and semidefinite programs, which appear extensively in both robust and DR optimization literature. We note that the efficiency of the conic programming solvers specialized in these three problem classes has improved considerably.
There is some recent interest in conic programs for which the underlying cone is not selfdual, such as exponential cone. There are two main reasons: i) exponential cone has extensive expressive power that is useful to model optimization problems involving the exponential and logarithm functions (see e.g. [32]), and ii) a practical implementation of a primal-dual interior point method is developed [10] although its polynomial-time complexity has not proven yet. Our paper will exploit both the expressive power of the exponential cone and the practical implementation that can be used to solve the resulting optimization problem, as detailed below.
The Kullback-Leibler (KL) divergence [18] is a popular divergence measure in information theory that can be used to quantify the divergence of one distribution from another (see Definition 8) and we prove that it is exponential cone representable (see Definitions 3 and 6, and Proposition 2). Although the robust counterpart of KL divergence constrained DRO is proven to be a tractable convex program [16], to the best of our knowledge, its exponential cone representability has not been exploited in the literature before. Also, its practical performance against stochastic programming have not been analyzed in detail except for a limited number of applications from power systems [8,21].
In this paper, we consider KL divergence constrained DRO problems and propose their dual exponential cone constrained reformulation under the mild assumption of conic representability. This allows us to solve the corresponding robust counterpart using a conic programming solver such as MOSEK [26]. We also present how the generic formulation can be specialized for two classical problems: Newsvendor and Uncapacitated Facility Location. Although the DRO methodology has been applied to variations of these problems [14,27,20,22,31,1], to the best of our knowledge, their KL divergence constrained versions have not been studied in detail. Our computational results suggest that solutions obtained via a DR approach give slightly higher cost realizations when central tendencies such as mean and median are considered compared to solutions obtained via stochastic programming in an out-of-sample analysis. However, the dispersion (measured by the standard deviation and range of the cost realizations) and the risk (measured by the average of worst cost realizations and the third quartile) metrics improve significantly with solutions obtained via a DR approach.
The rest of the paper is organized as follows: In Section 2, we review basic concepts from convex analysis and probability theory which serve as the basis of our main result about conic reformulation of KL divergence constrained DRO problems in Section 3. Then, we analyze two applications, namely, the Newsvendor Problem in Section 4 and the Uncapacitated Facility Location Problem in Section 5, and present the results of our computational study. Finally, we conclude our paper in Section 6.

Preliminaries
Before stating our main result in Section 3, we will first review some important concepts from convex analysis in Section 2.1 and probability theory in Section 2.2.

Convex Analysis
For a set X ⊆ R m , we denote its interior as int(X), its relative interior as ri(X) and its closure as cl(S). We use the shorthand notation [n] for the set {1, . . . , n}.
We will first review some basic concepts from convex analysis related to cones.
Examples of regular cones include the nonnegative orthant, Lorentz (or second-order) cone and the cone of positive semidefinite matrices. We will refer to these cones as canonical cones in this paper.
Definition 2 (Dual cone). The dual cone to a cone K ⊆ R m is defined as It is well-known that the dual cone to a regular cone is also regular. In addition, the three canonical cones mentioned above are self-dual.
We will now define the exponential cone, which is the key ingredient of this paper.
Definition 3 (Exponential cone). The exponential cone, denoted as K exp , is defined as As opposed to the three canonical cones mentioned above, the exponential cone is not selfdual although it is a regular cone.
Proposition 1 (See e.g. [32]). The dual cone to the exponential cone (or simply the dual exponential cone) is given as The following definitions are instrumental in the description of conic programming problems: Definition 4 (Conic inequality). A conic inequality with respect to a regular cone K is defined as x K y, meaning that x − y ∈ K. We will denote the relation x ∈ int(K) alternatively as x ≻ K 0.

Definition 5 (Conic representability of a set).
A set X ⊆ R n is called conic representable if it can be expressed as for some appropriately chosen regular cone K.
If any of the variables in this representation is integer, then X is called a mixed-integer conic representable set.

Definition 6 (Conic representability of a function). A function is called conic representable if its epigraph is a conic representable set.
In this paper, when we say that "a set or function is conic representable", we will implicitly assume that the cone used in its representation is either one of the three canonical cones or the (dual) exponential cone.

Probability Theory
Definition 7 (Probability simplex). The probability simplex in dimension S is denoted as The following function can be used to measure the "divergence" of one distribution from another.
Definition 8 (KL Divergence). For two discrete probability distributions p ∈ ∆ S and q ∈ ri(∆ S ), the KL divergence from p to q is defined as We note that the KL Divergence does not define a distance metric between two probability distributions since it is not symmetric. However, it has the following useful property.
Proof. Due to Definition 6, it suffices to show that the epigraph of the function D KL (p||q) is an exponential cone representable set. Since the set {(x, y, t) ∈ R 3 : t ≥ x log(x/y)} has the exponential cone representation (y, x, −t) ∈ K exp [25], we obtain an exponential cone representation for the function D KL (p||q) as follows: The following proposition gives an upper bound on the KL-divergence of a given distribution from any other distribution.
Proof. Notice that the objective function of the optimization problem sup p∈∆ S {D KL (p||q)} is convex and its feasible region is a polytope. Therefore, there exists an optimal solution which is an extreme point of ∆ S . Note that the extreme points of ∆ S are the unit vectors of R S , denoted by e s for s ∈ [S]. Then, we have In the second equality, we use the fact that lim x→0 + x log(x/y) = 0 for y > 0. Proposition 3 is useful to quantify the ambiguity sets in KL divergence constrained DRO problems as we will see later.

Main Results
In this section, we present our main result about the reformulation of a KL divergence constrained DRO problem as a conic program under mild conditions.

Generic Problem Formulation
We first give the generic problem setting considered in this paper. Suppose that there are m random variables ξ i ∈ R, i ∈ [m], each with a discrete distribution q i ∈ ri(∆ S i ) estimated from the historical data as . Under this probabilistic setting, we define the ambiguity set , where ǫ i ∈ R + controls the divergence from the historical data (or robustness level). Then, we consider the following KL divergence constrained DRO problem, where each expectation is taken with respect to an ambiguous distribution p i ∈ P i (q i , ǫ i ). In problem (2), h is a real-valued function defined on R n ; H i is a real-valued function defined on R n × R; and Y is a subset of R n . Observe that given y decisions, the inner maximization problem is decomposable over the random elements ξ i , i ∈ [m].

Robust Counterpart and Conic Reformulation
We will now obtain the robust counterpart [5] of problem (2) utilizing Conic Duality under mild conditions. Theorem 1. Consider the KL divergence constrained DRO problem (2) as described in Section 3.1, and assume that ǫ i > 0, i ∈ [m]. Then, the robust counterpart is given as follows: Proof. We will start the proof by analyzing the inner maximization problems. Given a vector y ∈ Y, let us write the i-th inner maximization problem explicitly as the following exponential cone constrained program: Here, constraints (4b)-(4e) model the relation p i ∈ P i (q i , ǫ i ), as stated in Proposition 2.
Recall that ǫ i > 0 for each i ∈ [m]. Then, each inner maximization problem (4) satisfies essential strict feasibility [3] (e.g. consider p i s = q i s and δ s i = ǫ i /|S i | for s ∈ [S i ]), and its optimal value is bounded above (e.g. by max s∈[S i ] H i (y, d i s )). Therefore, strong duality holds between problem (4) and its conic dual given as follows: Here, α i , β i and (u i s , v i s , w i s ) are the dual variables associated with the primal constraints (4b), (4c) and (4d), respectively. Notice that problem (5) is a dual exponential cone constrained program.
As the final step in the proof, we write the dual of each inner maximization problem and obtain the robust counterpart of problem (2) as problem (3).
We will now discuss the consequences of Theorem 1 under additional structural properties such as convexity and conic representability. Corollary 1. Consider the KL divergence constrained DRO problem (2) as described in Theorem 1. In addition, let us assume that Y is a convex set, h(y) and H i (y, ξ i ) are convex functions in y, i ∈ [m]. Then, the robust counterpart (3) is a convex program.
Corollary 2. Consider the KL divergence constrained DRO problem (2) as described in Theorem 1. In addition, let us assume that Y is a (mixed-integer) conic representable set, h(y) and H i (y, ξ i ) are conic representable functions, i ∈ [m]. Then, the robust counterpart (3) is a dual exponential cone constrained (mixed-integer) program.
As an application of Corollary 2, we will consider the Newsvendor Problem in Section 4 and the Uncapacitated Facility Location Problem in Section 5. The common characteristic of these two problems is that the set Y is a mixed-integer linear set, the function h is a linear function and the functions H i are the maxima of linear functions (hence, they are polyhedrally representable).

Application to the Newsvendor Problem
In this section, we analyze a toy example, the KL divergence constrained DR version of the single-period, single-product Newsvendor Problem. In this case, since there is only one random variable ξ (that is, m = 1), representing the unknown demand, we will omit the superscript i for convenience.

Problem Formulation
Consider the generic formulation (2) with the following specifications: We let y ∈ Y := Z + be the order quantity, and consider functions where c b is the back-order penalty for unsatisfied demand and c h is the inventory cost. Notice that H(y, ξ) is a piecewise linear convex function in y and can be rewritten as This observation will be useful to linearize constraint (3c).
By omitting i indices and simplifying the notation of problem (3) by taking into account the special structure of the newsvendor problem, we obtain the following dual exponential cone constrained MIP as its robust counterpart:

Experimental Setup
To compare the effect of robustness level of KL divergence constrained DR version of the Newsvendor Problem, we propose Algorithm 1. Note that setting ǫ = 0 in problem (6) reduces it to stochastic program while and larger values of ǫ indicate more robustness (and conservativeness) in solutions. Algorithm 1. Input: A probability distribution D, the number of samples R, the set of robustness levels T . 1: Sample R random variates from D for training, and obtain the empirical distribution q and the maximum KL divergence ǫ(q) as computed in Proposition 3. 2: Solve problem (6) with ǫ := θǫ(q) for each θ ∈ T to obtain a decision y * (θ). 3: Sample R random variates from D for testing, and then compute the cost realizations for each realization under the decision y * (θ).
We implement Algorithm 1 in the Python programming language and use MOSEK 9.2 [26] to solve the dual exponential cone constrained MIP problem (6).

Results
For this illustration, we choose the following cost coefficients: We will now specify the parameters of Algorithm 1. Firstly, we experiment with three different discrete distributions: • Discrete Uniform Distribution with parameters 0 and 10, denoted as Uniform(0, 10). • Binomial Distribution with parameters 10 and 0.5, denoted as Binomial(10, 0.5). • Poisson Distribution with parameter 5, denoted as Poisson(5).
We sample R = 100 random variates separately to obtain "training" and "test" datasets. Then, we repeat the experiments for the following "robustness" levels:  Tables 1-3 for Uniform, Binomial and Poisson distributions, respectively. In particular, we report the average and standard deviation of the cost realizations, abbreviated as "Avg." and "St. Dev.", respectively. In addition, we compute the average of the worst 10% of the realizations, abbreviated as "Worst 10%", to quantify the risk. We observe that as the robustness level θ increases, the optimal order quantity y * increases (recall that θ = 0.00 corresponds to the stochastic programming approach). Moreover, with increasing θ, the average cost increases while the standard deviation and the average of worst realizations decrease for each distribution. This is an expected behavior when robust optimization is utilized. We note that Binomial distribution is the least sensitive with respect to θ as the order quantity (and performance measures) do not change after θ ≥ 0.05. On the other hand, Uniform and Poisson distributions are more sensitive with respect to this parameter.
We also repeat the experiments with even higher values of θ and observe that only the results corresponding to the Poisson distribution changes, which we attribute to its right-skewness. However, the order quantities in those cases are very high, which result in overly conservative policies and deteriorated performance measures. In addition to the summary statistics, we also provide the box plots of the cost realizations in Figures 1-3 for Uniform, Binomial and Poisson distributions, respectively. We observe that as the robustness level θ increases, the median of the cost realizations increases while the range shrinks for each distribution. We also note that the maximum and upper quartile values decrease for θ ∈ [0.05, 0.15]. This is a desired property since it implies that the risk of the stochastic programming approach (θ = 0.00) can be lowered.

Application to the Uncapacitated Facility Location Problem
In this section, we analyze the KL divergence constrained DR version of the Uncapacitated Facility Location (UFL) Problem.

Deterministic Version
We first remind the reader the deterministic version of the well-known UFL Problem. Suppose that we have m customers each with demand d i , i ∈ [m]. The demand must be satisfied by opening new facilities. There are n potential facilities each with a fixed cost of f j , j ∈ [n]. The unit transportation cost between each customer i and facility j is given as The objective is to minimize the total fixed cost and transportation cost.
The UFL Problem can be modeled as an integer program by defining two sets of binary decision variables. The first set of decision variables, denoted as y j , represent the status of each facility j, and the second set of decision variables, denoted as x ij , represents the assignment of a customer i to a facility j. The complete model is given as follows: Here, constraint (7b) guarantees that each customer is served by exactly one facility while constraint (7c) ensures that each customer is served by an open facility. We point out two useful observations about the UFL Problem. Firstly, in any feasible solution to problem (7), at least one facility must be opened, hence we must have n j=1 y j ≥ 1.
Secondly, given any optimal y * vector, the optimal objective function value can be computed as since each customer can be served by the closest open facility.

Distributionally Robust Version
Now, suppose that we replace the deterministic demand d i with a random variable ξ i having an empirical distribution q i ∈ ri(∆ S i ), with realizations d i s , s ∈ [S i ]. Then, the DR version of the UFL Problem can be modeled as an instance of the generic model (2)  {t ij }, i = 1, . . . , m, due to (9). In the remainder of this subsection, we will obtain the robust counterpart of the KL divergence constrained DR UFL Problem as a dual exponential cone constrained MIP by the help of Lemma 1.

A Lemma
The following lemma will be critical to linearize constraint (3c). Proof. Let y ∈ {0, 1} n satisfying (8) be given. We define the following nonempty sets T := {j : y j = 1} and T * := argmin{t j : j ∈ T }. Notice that g(y) = t l for l ∈ T * . Also, let us define the quantity for convenience. Notice that we have This observation helps us to rewrite z l as Now, we will look at the following cases to compute or bound z l : Case 1: Let l * ∈ T * . Then, we have z l * = t l * . Case 2: Let l ∈ T * , and choose any j * ∈ T * . Then, we have This analysis indicates that where l * ∈ T * . Hence, we conclude that equation (10) holds true.
An alternative proof of Lemma 1 can be obtained via LP duality: First, one would write the problem min{t j : y j = 1} as an IP by introducing additional binary variables x j . Secondly, this IP can be relaxed as an LP due to the totally unimodular structure. Then, the extreme points of the feasible region of the dual LP can be characterized, enabling the dual LP to be solved in closed form (see dual based arguments in [12,9]).

The Final Formulation
Taking into account the special structure of the UFL Problem and utilizing Lemma 1 by setting g := H i for each i ∈ [m], we obtain the following dual exponential cone constrained MIP:

Results
For this illustration, we assume that there are 12 customers and three potential facilities located in the unit interval. Their precise locations are given as 2h − respectively, and are shown in Figure 4. As evident from the figure, potential facilities are located evenly across the unit interval and there are two clusters of customers which are also distributed evenly in their respective regions. The fixed cost of opening facilities are given as f 1 = f 3 = 10 for the two facilities in the middle of these clusters (marked by a large diamond), and f 2 = 5 for the other facility (marked by a small diamond). Finally, the unit transportation cost between a facility-customer pair is assumed to be equal to the their distance from each other. We specify the parameters of Algorithm 2 regarding the generation of random variates similar to that of Algorithm 1 as described in Section 4.2.2.
The summary statistics of our experiments are reported in Tables 4-6 for Uniform, Binomial and Poisson distributions, respectively. We first observe that the optimal solutions and the performance measures are similar for every distribution, therefore, we will summarize our observations together. Due to the choice of parameters and the locations of the facilities and customers as can be seen from Figure 4, there is a fundamental trade-off in this instance: We can i) either open a single facility at the middle of the line segment with the lower fixed cost and serve customers via longer distances, ii) or open two facilities at the middle of two customer clusters with higher fixed cost and serve customers via shorter distances. In the stochastic programming approach (θ = 0.00), the first policy becomes optimal whereas in the DR approach (θ ≥ 0.05), the second policy becomes optimal. We note that considering the ambiguity of the demand distributions increases the average cost only slightly whereas both the standard deviation and the average of the worst 10% of the realizations decrease significantly. We remind the reader that the total fixed cost of the stochastic programming approach is only 5 while the fixed cost of the DR approach is 20. This also shows that the corresponding transportation cost, which is affected by the random uncertainty, is significantly smaller in the DR approach. In addition to the summary statistics, we also provide the box plots of the cost realizations in Figures 5-7 for Uniform, Binomial and Poisson distributions, respectively. We observe that the median of the cost realizations either stays the same or increases slightly in the DR approach while the range shrinks significantly compared to the stochastic programming approach for each distribution. We also note that the maximum and upper quartile values decrease considerably with the DR approach (especially for Binomial and Poisson distributions).

Conclusion
In this paper, we analyzed the KL divergence constrained DRO problems and proposed their dual exponential cone constrained reformulations utilizing the exponential cone representability property of KL divergence and Conic Duality. The resulting robust counterpart can be solved by a commercial conic programming solver directly. We specialized our results to the Newsvendor and UFL Problems by providing problem specific reformulations, and conducted a computational analysis comparing the performance of the solutions obtained via DR approach and stochastic programming from different aspects. We observed that although the mean and median of the cost realizations deteriorate slightly when the DR approach is preferred; the range, standard deviation and worst case values of the cost realizations improve significantly compared to stochastic programming approach.
Two future research directions seem promising: We would like to test the success of the proposed method on different problems with real datasets, and adapt our results to the decisiondependent setting, which is a recent active research area in the DRO literature [29,1,23].