I give a clean and self-contained presentation of the multiplicative weights framework that is somewhat more direct than the most popular analysis (see the comparison below). The framework provides a simple algorithm for (approximately) solving many important problems (e.g., maximum flow, densest subgraph, linear classification, zero-sum games and even general linear programs) and is a beautiful demonstration of leveraging continuous analysis to solve discrete problems. Warning: this is a technical post.

All of the above problems can be written in the following simple form:

Symbol $A \in \mathbb{R}^{n \times m}$ represents a matrix, $x \in \mathbb{R}^m$ is a vector we are trying to optimize over, $\max : \mathbb{R}^n \to \mathbb{R}$ returns the maximum entry of a vector (e.g., $\max([1, 3, -4]^T) = 3$), and $K$ is some convex set (often called the easy constraints).

Example: Casting maximum flow in the above form. (click to expand) Suppose we are given an uncapacitated directed graph $G = (V, E)$ and we want to compute the maxflow between some $s, t \in V$. Suppose that the optimal value of this problem is $\OPT$ (i.e., there are $\OPT$ edge-disjoint paths between $s$ and $t$). We cast the problem in the above form. On a high-level, easy constraints $K$ will correspond to unit flows from $s$ to $t$. We first fix the representation of $K$: a flow $x \in K$ is a vector of dimension $\vert E \vert$, one scalar for each directed edge $e \in E$. Precisely, $x_e$ will represent the amount of flow that is pushed through $e$. This determines the representation. Now, define $K$ to be the convex hull of all simple paths from $s$ to $t$. On a side-note, this is equivalent to saying the flow $x \in K$ is conserving in $V \setminus \{s, t\}$, pushes 1 unit from $s$, and contains no circulations; equivalency is due to the integrality of the flow polytope. While it might be strange to call this polytope with exponential number of vertices “easy constraints”, we only interact with $K$ by optimizing a linear function over $K$. However, optimizing a linear function $f(x) = \inner{c, x} = \sum_{e \in E} c_e \cdot x_e$ over $x \in K$ is exactly the problem of finding a shortest path between $s$ and $t$ with edge costs being $c \in \mathbb{R}^{\vert E \vert}$. We want the objective $\max(Ax)$ of our form to correspond to the maximum amount of flow going through any edge in the graph (call this the congestion). Using the above representation for members of $K$, this means we set $A$ to be the identity matrix $A = I$. Minimizing congestion might seem counterintuitive (or at least unrelated) to maxflow which wishes to maximize the flow from $s$ to $t$. However, we claim that minimizing the congestion over $K$ is equivalent to maxflow. To show one side of the equivalence, if there are $\OPT$ disjoint paths we can find a unit flow with congestion $1/\OPT$. Moreover, one can easily show the converse holds (if there are unit-flows with smaller congestion it leads to more edge-disjoint paths) and hence solving the above problem solves the maximum flow. Of course, the optimal value in the disjoint-path formulation and congestion-minimizing formulation will be reciprocals of each other, but the problems are otherwise equivalent.

High-level comparison with other analyses: (click to expand) This post is inspirated by my personal struggles I had a few years back while trying to learn the multiplicative weights framework. Most popular analyses motivate the approach by the expert prediction algorithm [AHZ]. While the approach is intuitive by itself, my intuition completely dissapeared when using them to solve problems such as maximum flow. This is because the experts from [AHZ] essentially correspond to dual variables which are largely disconnected from the original (primal) problem. This analysis keeps the entire discussion in the primal. I have not seen this analysis written down anywhere, but I am sure researchers in the area are well-aware of it. The analysis is essentially equivalent to the Frank-Wolfe method of optimization applied to the log-sum-exp function (denoted below as $\smax$) over $x \in K$. One can say that multiplicative weights is an instance of the Frank-Wolfe method.

On a high-level, multiplicative weights smoothen the problem by replacing (regular) max with a smooth version of max. The smoothened problem is solved iteratively: the current solution $x_{t-1}$ is updated by linearizing the smooth max around $x_{t-1}$, finding the optimizer of the linearization $h_t$ via any black-box method and updating $x_{t} \gets x_{t-1} + h_t$. The following fact introduces the smooth max and a few of its simple-to-prove properties.

Fact (smooth max): Define $\smax : \mathbb{R}^n \to \mathbb{R}$ as where $\beta$ is some “accuracy” parameter (increasing $\beta$ increases accuracy but decreases smoothness). The following properties hold: $\smax$ approximates max: The gradient is some probability distribution over $[n]$: $$

abla \mathrm{smax}_{\beta}(x) = ( \frac{1}{Z} \exp(\beta \cdot x_i) )_{i=1}^n$$ where $Z := \sum_{i=1}^n \exp(\beta \cdot x_i)$ is the normalization factor $\smax_{\beta}$ is $\beta$-smooth with respect to $\dnorm{\cdot}_\infty$: , where $\pmb{0} = [0,0,\ldots,0]$.

We introduce and compare the original problem, the smoothened version and the linearized version around $x_{t-1}$.

$$\begin{align} \text{min. }\ & \mathrm{max} (Ax) \\ \text{s.t. }\ & x \in K \end{align}$$ $$\to$$ $$\begin{align} \text{min. }\ & \color{red}{\mathrm{s}}\mathrm{max}_{\color{red}{\beta}} (Ax) \\ \text{s.t. }\ & x \in K \end{align}$$ $$\to$$ $$\begin{align} \text{min. }\ & \inner{ A^T

abla \smax_{\beta}(A x_{t-1}), h_t } \\ \text{s.t. }\ & h_t \in K \end{align}$$

In the maxflow problem, the parameter $\beta$ is chosen to be approximately $\eps$. Let’s elaborate on the objective of the last problem. Let $\Phi(x) := \smax_{\beta}(Ax)$ be the objective of the smoothened (middle) problem. Then $\inner{ A^T

abla \smax_{\beta}(A x_{t-1}), h_t }$ is exactly the linearization of $\Phi(x)$ around $x_{t-1}$. This is because $\Phi(x_{t-1} + h_t) - \Phi(x_{t-1})$ is approximately $\inner{

abla \Phi(x_{t-1}), h_t }$ and $

abla \Phi(x_{t-1}) = A^T

abla \smax_{\beta}(A x_{t-1})$ by chain rule.

We can now present the full multiplicative weights algorithm. We repeat the linearize-solve-update loop for a total of $T$ rounds (to be specified later). Solving the linearized problem is done via some black-box method (called the oracle) that needs to be supplied to the algorithm upfront. Precisely, the oracle is supplied a vector $p \in \mathbb{R}^n$ and needs to return $\arg\min_{x \in K} \inner{ p, x }$. Luckily, the linearized problem is often much simpler than the original one and can be easily optimized.

How to choose the number of rounds $T$? It will mostly depend on two important parameters: the accuracy $\eps$ we require from the final solution, and a new parameter $\rho$ that we call the width of the oracle. The only requirement on the width is that $\rho$ must be larger than for any $x$ that can be returned by the oracle (e.g., $\max_{x \in K} \dnorm{Ax}_\infty$ suffices, but sometimes we can implement the oracle that achieves a better bound).

Algorithm (Multiplicative Weights): Input: matrix $A \in \mathbb{R}^{n \times m}$, oracle for the linearized problem, width $\rho$, accuracy $\eps$.

Define , $x_0 := 0$, $\Phi(x) := \smax_{\beta}(Ax)$, .

For $t = 1 \ldots T$ via the oracle $x_t := x_{t-1} + h_t \qquad$ (equivalent: $x_t = \sum_{i=1}^t h_i$)

Output $(1/T) \cdot x_T$





Example: Multiplicative weights for maximum flow. (click to expand) Suppose we want to solve maxflow between $s$ and $t$ with $\eps$ relative error. We assume for simplicity that the graph is directed and uncapacitated which allows us to set $\rho = 1$. Set $\beta$ and $T$ accordingly. Let $\OPT$ be the optimal value of the problem when cast in the aforementioned standard form (which is the reciprocal of the number of edge-disjoint paths between $s$ and $t$, note that the relative error is unchanged when taking reciprocals). Initialize a “congestion” vector $x := [0, 0, \ldots, 0] \in \mathbb{R}^{\vert E \vert}$ that remembers for each edge how many times has it been used. We repeat the following for $T$ rounds: for each directed edge $e$ we compute a cost $c_e := \exp(\beta x_i) > 0$. Normalize this vector of costs by dividing all entries by that makes their sum equal to $1$ (note: this is unnecessary, but we do it to stay true to the algorithm). Find the shortest path $P$ between $s$ and $t$ with respect to the edge costs $c$ (e.g., with a Dijkstra). Update $x$ by incrementing $x_e$ for each edge $e$ that was used in the shortest path $P$. After the above loop terminates, the collection of all shortest paths found throughout the algorithm provide us with $T$ paths that incur congestion of at most $T \cdot (OPT + \eps)$. Scaling by $1/T$, we find a unit-flow that incurs congestion $OPT + \eps$ and we are done. Note: in the capacitated version we would need to set $\rho := 1 / c_{\min}$, where $c_{\min}$ is the minimum positive edge capacity. This is a significant downside of the method and a long line of research has been developed in order to reduce this width.

Low-level comparison with other analyses: (click to expand) Many other authors take a dual approach of the one stated above, in which one maintains a set of “multiplicative weights” or “dual variables” $y \in \mathbb{R}^n$ (one for each constraint of $K$, i.e., row of $A$). The dual roughly tells us how “important” a constraint is (e.g., if $y_3 = 100$ and $y_7 = 1$, it would mean that constraint 3 gets violated much more often and more egregiously than constraint 7). Initially, $y \gets [1, 1, \ldots, 1]$ and then for each $h$ we multiplicatively update $y$ with a value proportional to the violation $Ah$ via the formula for $i \in [n]$. This multiplicative reweighting of the constraint importance is the reason behind the method’s name. The “linearize+solve” steps now correspond to solving $\min_{h \in K} \inner{y, A h} = \min_{h \in K} \inner{A^T y, h}$. Iteratively performing this procedure yields the same dynamics as the one described in the pseudocode above. Pattern-matching $\inner{A^T y, h}$ with $\inner{ A^T

abla \smax_{\beta}(A x_{t-1}), h }$ one can, correctly, presume that the connection between the methods is that $y =

abla \smax_{\beta}(A x_{t-1})$. This is indeed the case (up to a factor of proportionality that can be ignored), $

abla \smax(A x_{t-1})$ is a probability distribution proportional to $\exp(\beta A(h_1 + \ldots + h_{t+1}))$ (see property 2 of $\smax$ Fact); $y$ can be easily deduced to have the same value.

We now prove that the above algorithm works.

Theroem: Let $\OPT$ be the value of the optimal solution and $\tilde{x} := x_T/T$ be the output of the algorithm. Then $\tilde{x} \in K$ and $\max(A \tilde{x}) \le OPT + \eps$.

Proof: It is sufficient to show $\max(A x_T) \le T \cdot (OPT + \eps)$ since it is equivalent to $\max(A \tilde{x}) = \max(A x_T/T) \le OPT + \eps$, by scaling. Furthermore, note that solving the smoothened version of the problem $\Phi(x) = \smax_{\beta}(Ax)$ is “harder” than solving the original one since $\Phi(x) \ge \max(A x)$ by property 1 of Fact (smooth max). Therefore, it is sufficient to show $\Phi(x_T) \le T \cdot (OPT + \eps)$. Also, note that $x_T/T = \sum_{t=1}^T \frac{1}{T} h_t \in K$ since $K$ is convex and $h_t \in K$. We track the evolution of $\Phi(x_t) = \Phi(h_1 + h_2 + \ldots + h_t)$ by arguing about the increase $\Phi(x_{t-1} + h_t) - \Phi(x_{t-1})$ in each round. Up to first order terms, the increase is exactly $\inner{

abla \Phi(x_{t-1}), h_t}$, the linearized problem we are optimizing. We argue that this (approximate) increase is small, $\inner{

abla \Phi(x_{t-1}), h_t} \le \OPT$. However, this is immediate because the linearized problem is a “relaxation” of the original problem. Let $x_{\ast} \in K$, $\max(A x_{\ast}) = \OPT$ be the optimal solution. By definition of $\Phi$ we have that where $

abla \smax(\cdot)$ is some probability distribution. It is easy to see that for any probability distribution $p$, hence the oracle will always return a value not larger than $\OPT$. We argued that the first-order increase in $\Phi$ is $\inner{

abla \Phi(x_{t-1}), h_t} \le \OPT$. In reality, we have to account for higher-order errors via property 3 of Fact (smooth max). This allows us to effectively bound $\Phi$: Straighforward algebra with $T = \frac{\ln n}{\beta \cdot \eps}$ gives us that the right-hand side is at most $T \cdot (OPT + \eps)$ and we are done.





A few final points:

One can often get a better dependence on $\rho$ for special problems. E.g., one can get a linear instead of quadratic dependence of $\rho$ for positive packing and positive covering problems such as maxflow. See [AHZ] for details.

It would be interesting to rederive matrix multiplicative weights using an analogue of the above analysis.

The algorithm is numerically stable: it can be implemented using only addition and exponentiation of positive integers.

References: