## Notes of Claire Mathieu’s lecture nr 5

Scribe: Eric Colin de Verdière

Here is the plan for today:

1. Duality in linear programming,
2. duality in semidefinite programming,
3. application to correlation clustering (work by Warren Schudy and
Claire Mathieu, SODA’10).

1. Duality

As an example, let us consider the following linear program (taken from
Va\v sek Chvátal’s book):

${\max (4x_1+x_2+5x_3+3x_4)}$

subject to ${x_i\ge0}$ for each ${i}$ and:

${\left\{\begin{array}{rcll}x_1-x_2-x_3+3x_4 & \leq & 1 & (1)\\5x_1+x_2+3x_3+8x_4 & \leq & 55 & (2)\\-x_1+2x_2+3x_3-5x_4 & \leq &3 & (3)\\\end{array}\right.}$

To prove a lower bound on the value ${OPT}$ of this program, it suffices to
find a feasible solution: ${(1,0,1,0)}$ is feasible and has value 9, so
${OPT\geq9}$.

Can we prove an upper bound on ${OPT}$? Yes! The second equation,
multiplied by two, reads ${10x_1+2x_2+6x_3+16x_4\leq 110}$. And its LHS is
at least the value of the objective (since the ${x_i}$‘s are nonnegative).
In other words, the second constraint, multiplied by two, has its
coefficients larger than in the objective function. So ${OPT\leq 110}$.

We may try some other combinations. (2)+(3) gives a proof that ${OPT\leq 55+3=58}$.

More generally, consider a nonnegative linear combination of the
constraints: ${(1)\times y_1+(2)\times y_2+(3)\times y_3}$ for nonnegative
${y_i}$‘s. The condition that all coefficients are larger than in the
objective function rewrites as:

${\left\{\begin{array}{rcl} y_1+5y_2-y_3 & \ge & 4\\ -y_1+2y_2+2y_3 & \ge & 1\\ -y_1+3y_2+3y_3 & \ge & 5\\ 3y_1+8y_2-5y_3 & \ge & 3\\\end{array}\right.}$

As long as these constraints are satisfied, we have ${y_1+55y_2+3y_3\ge OPT}$. So we want to minimize ${y_1+55y_2+3y_3}$ subject to these
constraints. This is a linear program in the ${y_i}$‘s, called the
dual linear program. Of course, taking the dual of the dual gives
the initial linear program.

Theorem 1

• If both values of the primal and dual LP are finite, then their
values are equal: ${\max(\text{primal})=\min(\text{dual})}$.
• If the solution to the primal is unbounded
(${\max(\text{primal})=+\infty}$), then the dual is not feasible
(${\min(\text{dual})=\min(\emptyset)}$).
• Dually, if the primal is not feasible
(${\max(\text{dual})=\max(\emptyset)}$), then the dual is unbounded
(${\min(\text{dual})=-\infty}$).

No case distinction is needed under the convention that
${\max(\emptyset)=-\infty}$ and ${\min(\emptyset)=+\infty}$.

Proof:
${\leq}$ is easy, we did this on an example. The other direction is
harder, and we will skip it.
$\Box$

2. Semi-definite programming duality

In general, in approximation algorithms, we follow the following steps:

• Solve a linear programming relaxation (say a maximization problem);
• round the solution to obtain an integer solution;
• for the analysis, argue that the value of the output is at least ${c}$
times the value of the linear program, for some value of~${c}$. This gives
a ${c}$-approximation.

Duality gives a new possibility for approximation algorithms: If we can
find one feasible solution ${(y_i)}$ in the dual, such that the value of the
output is at least ${c}$ times the value of this feasible solution, then we
know that ${OPT}$ is at most ${c}$ times the output.

SDP also have a dual, so this proof technique can in principle be used.
I will illustrate this by an example on correlation clustering, but SDP
duality potentially goes far beyond.

Let us consider the following SDP: ${\max\sum_{ij} c_{ij}x_{ij}}$

subject to: for every ${k}$, ${\sum_{ij} a_{ijk}x_{ij}\le b_k}$

and the SDP constraint: ${X=(x_{ij})}$ is semi-definite positive.

(This presentation differs from the one used in previous lectures, but it
is more convenient to demonstrate duality.)

We rewrite the SDP constraint as: For every ${u}$, ${u^T X u\ge 0}$. This
constraint is linear in the ${x_{ij}}$‘s. In this sense the SDP program is
just a LP with an infinite number of constraints.

We can write the dual like for a linear program. Eventually, we find that
${M=\sum\lambda_w w^T w}$, where ${w}$ is a vector, ${\lambda_w\ge0}$, and ${M}$
depends on the parameters of the primal. We recognize that this is
equivalent to having ${M}$ positive semi-definite.

Interlude by Nati Linial: This works for convex programming also, via
polarity.

More precisely, let ${X\subseteq{\mathbb R}^n}$ containing ${0}$ in its interior. Let
${X^o=\{y\in{\mathbb R}^n\mid\forall x\in X, x.y\ge0\}}$ be the polar
of~${X}$. Observe that if ${X}$ is the set of PSD matrices, then ${X^o=X}$.

${M}$ is PSD if and only if, for every ${A}$ PSD, we have ${A.M\geq0}$. (This
is not entirely trivial but easy.)

For each ${u}$, ${u^T X u\geq 0}$ is equivalent to: For each ${A\geq0}$, we
have ${\sum x_{ij}a_{ij}\ge0}$. This is the same phenomenon as LP duality.

3. Correlation clustering

Today we consider a graph with similarity or dissimilarity constraints, and
we want to partition the vertex set by minimizing the number of
disagreements. (This is more difficult than the previous clustering
problem we studied: the exact problem is the same, but in terms of
approximation, this is different.)

More precisely, given the complete graph where all edges are labeled either
${+}$ (similar) or ${-}$ (dissimilar), find a partition of the vertex set that
minimizes the number of inconsistent edges: an edge is inconsistent if it
is labeled ${+}$ but its two vertices are in different pieces of the
partition, or it is labeled ${-}$ but its two vertices are in the same piece
of the partition.

The problem is of course NP-hard, so let us look for an approximation
algorithm. If ${OPT=0}$, there exists a partition satisfying all the
constraints. In this case, the partition is easy to find using a simple
greedy algorithm: Iteratively take a vertex~${i}$ and put together with~${i}$
every ${j}$ such that ${ij}$ is a ${+}$. That’s one cluster. Set it aside and
repeat.

A simple randomized variant provides a good approximation for the problem
(even if ${OPT\neq0}$): We may take vertex ${i}$ uniformly at random. Let us
call this variant “Randomized greedy”.

Theorem 2
Randomized greedy is a 3-approximation for this problem.

Here is an example showing why randomization is necessary. Denote by
${\{1,\ldots,n\}}$ the vertex set. Put a ${+}$ edge for edge ${ij}$ iff ${i}$ or
${j}$ is equal to~${1}$. If ${1}$ is picked by the algorithm, then we are really
far from the optimum: we get a solution of cost ${\Theta(n^2)}$. On the
other hand, if we use “Randomized greedy”, then ${1}$ is picked with
probability ${\frac1n}$, and in the remaining cases the solution has cost
${\Theta(n)}$. On average, we obtain a solution with cost
${\Theta(n)=\Theta(OPT)}$.

Proof:
The analysis uses the notion of “bad triangle”. A triangle ${ijk}$ is
bad if exactly one of its edges is labeled ${-}$. Certainly, ${OPT}$ is at
least the maximum number of edge-disjoint bad triangles.

More generally, to every bad triangle~${t}$, we associate a value
${a_t\in\{0,1\}}$. We consider positive combinations of bad triangles
satisfying the following condition: For each edge, the sum of the ${a_t}$,
for every bad triangle ${t}$ containing~${e}$, is at most one. The same
argument shows that ${OPT\ge\sum_t a_t}$.

Analysis: Let ${a_t}$ be the probability that, during a random execution of
“Randomized greedy”, one of ${t}$‘s vertices is chosen before ${t}$
disappears. Then I claim that the ${a_t/3}$ form a feasible solution in
the dual.

Proof: Let ${e}$ be an edge, contained in ${k}$ bad triangles. Consider the
first vertex of the union of these triangles picked by the algorithm. If
it is an endpoint of~${e}$, then the algorithm destroys ${k}$ bad triangles.
If it is one of the ${k}$ other vertices, it destroys a single triangle.
Thus the sum of the ${a_t}$, where ${t}$ ranges over all bad triangles
containing~${e}$, is at most~3.

The rest of the proof is left as exercise.
$\Box$

Theorem 3
Consider now the complete graph with weighted edges ${w_e^+}$ and ${w_e^-}$
such that ${w_e^++w_e^-=1}$ (instead of ${+}$ and ${-}$, we have weights).
Then “Randomized greedy” (where you put the neighbor ${v}$ of the picked
vertex~${u}$ in the same cluster as~${u}$ iff ${w_{uv}^+}$ is at least ${.5}$) is
a 5-approximation.

Kleinberg proved that the problem does not admit a PTAS (unless P=NP).
Which variant of the problem admits a PTAS? Let us change the model and
move away from the worst case to the “planted model”.\footnote{The name
comes from the original study of such models in the max-clique problem;
cliques were “planted”, imposed on a random graph. See Uriel Feige’s
work.}

4. Other model

Assume that we are given a (possibly non-complete) graph with ${+}$ and ${-}$
on the edges. Assume that there is a perfect clustering, but that the
input is generated by a random perturbation: Flip independently the label
of every edge with probability~${p}$.

Theorem 4
There is an algorithm that achieves an ${(1+\frac1{\sqrt n})}$-approximation in expectation.

This slightly improves the state of the art, but with very different
(SDP) techniques.

\note{From here on, refer to Claire’s slides also.}

View the solution as a set of vectors ${v_i}$, such that ${v_{ij}}$ is ${0}$ iff
the ${i}$th vertex is in cluster~${j}$. In this sense we want to minimize the
Hamming distance between the input and ${v^T v}$.

The algorithm is now very simple: Solve the dual SDP relaxation using the
ellipsoid method. This defines ${x_{ij}}$ for every edge ${ij}$, which we
interpret as ${w_{ij}^+}$. This gives another clustering problem. We solve
it using the 5-approximation algorithm described before.

We assume here ${p}$ is a constant (say ${.3}$).

Let ${B}$ be the “true” clustering before applying the noise, ${E}$ be the
input, and ${X}$ be the solution of the SDP.

Lemma 5
${dist(B,X)\leq 1000 n\sqrt n}$.

Proof:
Assume not. Imagine that we add to the initial SDP, called SDP1, the
constraint that ${d(B,X)\leq 1000n\sqrt n}$. Let SDP2 be the resulting
SDP. We get a new solution that must have a larger value.

\note{To be filled…}

We obtain that a matrix ${M=1000n\sqrt n \text{Id} + \text{some noise}}$
must be SDP, where the noise comes from the perturbation of the input.
For this step, we find an oracle (a mathematician) and asks him about the
eigenvalues of random matrices. It turns out that the eigenvalues of the
noisy matrix are in expectation ${O(n\sqrt n)}$, so all eigenvalues are
nonnegative, so the matrix~${M}$ is SDP (with high probability). So with
high probability, the dual is feasible, so the two SDP are different.
$\Box$

In applications, ${w_e^+}$ and ${w_e^-}$ do not sum up to one. This extension
of the model has been studied also. The problem becomes more difficult to
approximate (best known is a ${O(\log n)}$-approximation or worse).

If ${w_{ij}^+=w_{ij}^-=0}$, it is just as if the edge ${ij}$ were not there.
And if the input is not really a complete graph, things become much more
complicated.

References:

• 3-approximation and 5-approximation: Ailon, Charikar, Newman, 2005.
• Inapproximability: Jon Kleinberg, 1990’s.

SDP duality has been used in problems related to the Unique Games
Conjecture. Maybe this will prove useful for other problems as well.
There are also recent algorithms for solving SDP that take advantage of the
dual, by Arora and one of his PhD students.

Next course in March: detailed presentation of the subexponential algorithm
for UGC presented by Arora and Steurer.