Approximate kernel clustering
1. Statistical motivation
Let be random variables, covariance matrix, called the kernel. We want to cluster , i.e. find a partition of , form the much smaller matrix
Given a small, , positive semidefinite matrix , called the hypothesis matrix, find a clustering of such that is maximal. Notation: .
This contains many problems, related for instance to locating low dimensional manifolds in large data spaces.
Question: is there a polytime algorithm to compute (approximate) .
Theorem 1 Exact computation is NP-hard, but there exists a polytime algorithm which computes up to the factor .
2. Towards optimal results ?
Say is centered if . In case is centered, we believe that our results are optimal.
Example 1 identity matrix. Then the goal is to find a partition which maximizes the sum of correlations within each piece.
Example 2 . Then the goal is the positive semidefinite Grothendieck problem.
There is a natural method, which consists in replacing signs in by inner products between unit vectors. Grothendieck showed that the integrality gap of this SDP relaxation is bounded, and . Later on, Rietz showed that it is equal to . For general matrices , the best is . Nemirovsky et al., Alon and the Makarychev bros.
Example 3 Laplacian of a graph, i.e. , if , otherwise.
, so MAX CUT. So there is some inapproximability threshold for CLUST.
Theorem 2 Assume is centered and is centered and spherical (i.e. ‘s on the diagonal) or identity matrix. Then there exists a polytime algorithm that computes up to
- If , , sharp assuming UGC.
- If , , sharp assuming UGC.
- If , .
The last value may be tight as well. It depends on a geometric conjecture which I discuss now.
3. A geometric puzzle
Puzzle: Partition the plane into 3 measurable sets . Define the Gaussian moment of a set as the vector whose components are the integrals of coordinate functions with respect to Gaussian measure. For which partition of the plane is maximal ?
Theorem 3 Maximum is attained at partitions into 3 equal sectors.
Next puzzle: Partition -space into measurable sets. Similar question. We do not know the answer, but we formulate the Propeller conjecture: only 3 pieces are needed, and are products of -dimensional sectors with complementary subspaces.
Theorem 4 In a partition of which maximises the sum of the squares of Gaussian moments, the non empty pieces are products of vector subspaces and pieces of a simplicial and conical partition in lower dimension.
If we could prove that the simplices are regular, the Propeller Conjecture would follow.
A similar phenomenon (only a constant number of pieces are necessary) already showed up in work of Keith Ball.
4. Back to the algorithm for CLUST
Given a positive definite matrix , view it as a Gram matrix . Let be the radius Euclidean ball containing all ‘s. Let be the maximum if over all vectors arising as Gaussian moments of a partition of Euclidean space.
Theorem 5 There is a polytime algorithm which, given computes up to . Assuming UGC, no other algorithm can do better.
For , the UGC threshold is
4.1. Connection with Grothendieck’s constant
Grothendieck’s inequality is
Given vectors in , there is a generalization
Theorem 6 The best in the generalized Grothendieck inequality above is where .
4.2. Reduction from UGC to CLUST
Clustering is like minimizing over maps . Reduction amounts to assigning to all functions a number . Assume that all coordinate functions have . If has no influential coordinate, . This implies that the UGC threshold is .