November 27, 2006

The ACM Computing Classification: G.1.3 Numerical Linear Algebra, F.2.1 Numerical Algorithms and Problems . The first author is partially supported by the NSF grant DMS 0245380. The second author is an Alfred P. Sloan Research Fellow. He is also partially supported by the NSF grant DMS 0401032.
<ph f="cmbx">Sampling from large matrices: an approach through geometric functional analysis </ph>

Mark Rudelson

Roman Vershynin

Department of Mathematics, University of Missouri, Columbia, MO 65211, U.S.A. E-mail address : rudelson@math.missouri.edu Department of Mathematics, University of California, Davis, CA 95616, U.S.A. E-mail address : vershynin@math.ucdavis.edu

1 Introduction

This paper studies random submatrices of a large matrix A   . The study of random submatrices spans several decades and is related to diverse areas of mathematics and computer science. Two main reasons for the interest in random submatrices are:
We address both aspects of random submatrices in this paper. We show how to approximate A   by its random submatrix in the spectral norm, and we compute the asymptotics of the spectral and the cut norms of random submatrices. We look for a dimension-free picture, which is only controlled by the norms of the matrix A   and not by its size (or rank). This yields improvements upon known algorithms for computing low rank approximations, Singular Value Decompositions, and approximations to MAX-2CSP problems.

1.1 The spectral norm: low rank approximations and SVD

Can one approximate A   by only knowing a random submatrix of A   of a fixed size?
One is especially interested in the sample complexity of this problem, the minimal size of a submatrix which yields a good approximation with a small error in some natural norm, and with high probability.
This problem belongs to a class of problems the Statistical Learning Theory is concerned with. These problems inevitably bear the assumption that the the object to be learned belongs to a relatively small “target” class. To be able to learn A   from a matrix of small size thus of small rank, we have to assume that A   itself has small rank–or can be approximated by an (unknown) matrix of a small rank. We thus strive to find a low rank approximation to a matrix A   , whenever such an approximation exists, from only knowing a small random submatrix of A   .
Solving this problem is essential for development of fast Monte-Carlo algorithms for computations on large matrices. An extremely large matrix A   – say, of the order of 10 5 × 10 5   – is impossible to upload into the Random Access Memory (RAM) of a computer; it is instead stored in an external memory. On the other hand, sampling a small submatrix of A   , storing it in RAM and computing its small rank approximation is feasible.
The crucial assumption that A   is essentially a low rank matrix holds in many applications. For example, this is a model hypothesis in the Latent Semantic Indexing (see [6, 21, 4, 9, 5, 2). There A   is the “document-term matrix”, which is formed of the frequencies of occurrence of various terms in the documents of a large collection. The hypothesis that the documents are related to a small number of (unknown) topics translates into the assumption that A   can be approximated by an (unknown) low rank matrix. Finding such an approximation would determine the “best” topics the collection is really about. Other examples where this problem arises include clustering of graphs [10, DNA microarray data, facial recognition, web search (see [12), lossy data compression and cryptography (see [5).
The best fixed rank approximation to A   is obviously given by the partial sums of the Singular Value Decomposition (SVD) A = j σ j ( A ) u j v j   where σ j ( A )   is the nonincreasing and nonnegative sequence of the singular values of A   , and u j   and v j   are left and right singular vectors of A   respectively. The best rank k   approximation to A   in both the spectral and Frobenius norms1 is thus A P k   , where P k   is the orthogonal projection onto the top k   left singular vectors of A   . In particular, for the spectral norm we have
min B , r a n k B k A B = A A P k = σ k + 1 ( A ) . (1)
However, computing P k   , which gives the first elements of the SVD of a m × n   matrix A   , is often impossible in practice because (1) it would take many passes through A   , which is prohibitively slow for a matrix stored in an external memory; (2) this would take superlinear time in m + n   . Instead, it was proposed in [16, 11, 12, 13to use the Monte-Carlo methodology: namely, approximate the k   -th partial sum of the SVD of A   by the k   -th partial sum of the SVD of a random submatrix of A   . In this paper, we show that this can be done:
  • (1) with sample complexity O ( r log r )   , that is by sampling only O ( r log r )   random rows of A   , if A   is is approximable by a rank r   matrix;
  • (2) in one pass through A   if the matrix is stored row-by-row, and in two passes if its entries are stored in arbitrary order;
  • (3) using RAM space and time O ( n + m )   (and polynomial in r   and k   ).
Theorem 1.1. Let A   be an m × n   matrix, and k > 0   be an integer. For ɛ ( 0 , 1 )   , and let
d C ( r ɛ 4 ) log ( r ɛ 4 ) , where r = A F 2 / A 2 . (2)
Let A ~   be a d × n   matrix consisting of d   normalized rows of A   , picked independently with replacement, with probabilities proportional to the squares of their Euclidean lengths. Let P k   be the orthogonal projection onto the top k   left singular vectors of A ~   . Then with high probability
A A P k σ k + 1 ( A ) + ɛ A . (3)
Here and in the sequel, C , c , C 1 ,   denote positive absolute constants.
Comparing  3 with the best approximation  1 given by the SVD, we see an additional error ɛ A   which can be made small by increasing the size d   of the sample.

1 the spectral norm is the operator norm l 2 l 2   , that is A = sup | x | = 1 | A x | = σ 1 ( A )   ; the Frobenius (Hilbert-Schmidt) norm is defined as A F 2 = i , j | A i j | 2 = j σ j 2   .

1.1.1 Sample complexity.

The approximation scheme in Theorem  1.1 is not new; it was developed in [16, 11, 12, 13.
However, the sample complexity d   in Theorem  1.1 improves significantly upon the previously known estimates. The sample complexity is essentially linear in the “average sparsity” r   of the matrix; r   is obviously bounded by the rank of A   . We can view r   as a relaxation of the rank, which is stable under small perturbations of A   (unlike the rank itself !) Thus for matrices A   which are essentially low-rank matrices, r   is also small.
This complexity is optimal in many natural cases. For example, if A   is a matrix of an orthogonal projection, then one must have r a n k ( A P k ) r a n k ( A )   for a meaningful approximation of A   by A P k   . But r a n k ( A P k ) r a n k P k d   , thus d r a n k ( A )   . One can also show that the logarithm is needed in  2 .

1.1.2 Law of large numbers for operator-valued random variables

The new feature in our proof of Theorem  1.1 is a use of the first author's argument about random vectors in the isotropic position [22. It yields a law of large numbers for operator-valued random variables. We apply it for independent copies of a rank one random operator, which is given by a random row of the matrix A T A   .

1.2 The cut-norm: decay, approximation to MAX-CSP problems

Alon, Fernandez de la Vega, Kannan and Karpinski [1, 3reduced the question of finding additive approximations to the MAX-CSP problems (which are NP-hard) to computing the cut-norm of random submatrices.
The input of a MAX-2CSP problem is a set F   of m   distinct Boolean functions f i   on n   Boolean variables x j   , where each function f i   depends on only two variables. The output M a x ( F )   is the maximal number of functions that can be simultaneously set to 1   by a truth assignment to the variables. This class of problems includes many graph problems, in particular the MAX-CUT problem, in which one wants to maximize, for a given graph, the number of edges that join a subset of vertices with its complement.
The exact problem is NP-hard. To approximately solve MAX-2CSP, it was proposed in [1, 3to use the Monte-Carlo methodology, that is to approximate the problem by the induced problem on a small random sample of variables. For a subset Q   of the variables { x 1 , , x n }   , then induced problem F | Q   consists of the functions from F   that depend on the variables from Q   . Our goal is to find the minimal q   (the sample complexity) such that for random subset Q   of q   variables,
| M a x ( F ) n 2 q 2 M a x ( F | Q ) | < ɛ n 2 . (4)
(note that the maximal number of formulas in F   and F | Q   is of order of n 2   and q 2   respectively).
Goldreich, Goldwasser and Ron [17were first to find a polynomial time algorithm for approximating MAX-2CSP within the additive error ɛ n 2   . The sample complexity in [17was q = O ( ɛ 5 )   . Independently, Fernandez de la Vega [14developed a different polynomial time algorithm for MAX-CUT. Alon, Fernandez de la Vega, Kannan and Karpinski [1, 3improved the sample complexity in  4 to q = O ( ɛ 4 log ( 1 / ɛ ) )   , which has remained best known up to now. Actually, the same sample complexity of this order was shown in [3to be sufficient for all MAX-rCSP problems (in which the functions dpend on r 2   variables). By improving the central part of the argument of [3, which is an estimate of the cut-norm of random submatrices, we shall reduce the sample complexity for MAX-2CSP to q = O ( ɛ 2 )   .
We shall call the random subset of expected cardinality q   a subset of an n   -element set formed by including each element independently with probability q / n   .
Theorem 1.2. For any ɛ > 0   , there exists a positive integer q = O ( ɛ 2 )   which satisfies the following. Let F   be a collection of Boolean functions on n   variables x 1 , , x n   , each function depending on two variables. Let Q   be a random subset of the variables { x 1 , , x n }   of expected cardinality q   .
Then  4 holds with high probability.
This result follows, by the argument of [3, from an estimate of the cut-norm of random submatrices. The cut norm of an n × n   matrix A   is the maximum sum of the entries of its sumbatrix, A C = max I , J | i I , j J A i j |   and it is equivalent to the l l 1   operator norm. We want to see how the cut norm of A   decreases when we pass to its random submatrix A | Q × Q = ( A i j ) i , j Q   , where Q   is a random subset of { 1 , , n }   of expected cardinality q   . Intuitively, A | Q × Q   is ( q / n ) 2   times smaller thatn A   if A   is diagonal-free, but only ( q / n )   times smaller than A   if A   is a diagonal matrix. The general case combines both types of decay:
Theorem 1.3. Let A   be an n × n   matrix. Let Q   be a random subset of { 1 , , n }   of expected cardinality q   . Then E A | Q × Q C C ( q n ) 2 A C + C ( q n ) d i a g ( A ) C + C ( q n ) 3 / 2 ( A Col + A T Col )   where A Col   is the sum of the Euclidean lengths of the columns of A   , and d i a g ( A )   is the diagonal part of A   .
Remark 1.4. Note that by Cauchy-Schwartz inequality A Col n A F   .
This result improves in several ways the estimate from [3. The argument of [3is combinatorial, while our proof uses the technique of probability in normed spaces, and includes decoupling, symmetrization, and application of a version of Slepian's lemma for Bernoulli random variables due to Talagrand.

1.3 The spectral norm: decay

Perhaps the most important matrix norm is the spectral norm (the l 2 l 2   operator norm). Nevertheless, its decay under passing to submatrices has not been sufficiently understood. Given an n × n   matrix A   , its submatrix A | Q = ( A i j ) i Q , j n   consists of the rows of A   indexed by a subset Q   . If Q   is a random subset of epected cardinality q   (as above), the length of any vector x R n   reduces by the factor of q n   when one orthogonally projects x   onto R Q   . One should then expect similar type of decay of the spectral norm–something like E A | Q q n A   . However, similarly to the previous section, the decay for diagonal matrices is different–for example, there is no decay at all for the identity matrix. The correct asymptotics in this case is given by A ( k ) = the average of k biggest Euclidean lengths of the columns of A ,   with k = n / q   . For diagonal matrices, it is the average of k   biggest absolute values of the entries. General matrices again combines both types of decay:
Theorem 1.5. Let A   be an n × n   matrix. Let Q   be a random subset of { 1 , , n }   of expected cardinality q   . Then the random submatrix A | Q   formed by the columns of A   indexed by Q   satisfies E A | Q C q n A + C log q A ( n / q ) .  
Remark 1.6. The example considered in Remark  2.2 below shows that the coefficient log q   is necessary here. The same example implies that the logarithmic term is required in Theorem  1.1 as well.
Generalizing an earlier result of Lunin [20, Kashin and Tzafriri [18(see [25) essentially proved the existence of a subset Q   of cardinality δ n   and such that A | Q C q n A + C A F n .   Note that A F n = ( 1 n i = 1 n | A i | 2 ) 1 / 2   is the average of the lengths of all columns of A   .
As the example of diagonal operators shows, for random subsets Q   this term has to be replaced by the average of the few biggest columns. Talagrand [23proved deep results on the more general operator norms l 2 X   , where X   is a 2   -smooth Banach space.
However, the decay on q n   in his results is logarithmic.

1.4 Dimension-free approach

Many problems on random submatrices, of both theoretical and practical importance, have functional-analytic rather than linear-algebraic nature. These problems, like those this paper considers, are about estimating operator norms. We thus see a matrix A   as a linear operator A   between finite dimensional normed spaces – say, between l 2 n   and l 2 n   for the spectral norm, and between l n   and l 1 n   for the cut norm.
From this perspective, the dimension n   of the underlying normed space should play a minor role, while the real control of the picture should be held by (hopefully few) quantities tied to the operator rather than the space. As a trivial example, if A   is not of full rank then the dimension n   of the range space is useless comparing to the rank of A   . Further, we are looking for stable results, those not ruined by small perturbations of the linear operators – this is a natural demand in applications, and this differs our analytic perspective from the linear algebraic one. It would thus be natural to look for stable quantities tied to linear operators, which govern the picture. For example, operator norms are stable quantities, while the rank is not.
This paper advances a dimension free approach to matrices. The low rank approximations in Theorem  1.1 are only controlled by the average sparsity r   of the matrix, which is a stable relaxation of the rank. The norms of random matrices in Theorems  1.3 and  1.5 are essentially controlled by the norms of the original matrix (and naturally by the sampling factor, the ratio of the size of the submatrix to the size of the original matrix). The size n   of the matrix, which is the dimension of the space, does not play a separate role in these results.
Another part of this picture which we omitted in this paper is the invertibility of submatrices. Dimension-free versions of the invertibility results due to Bourgain and Tzafriri can be found in [25, se also [8for random submatrices.
Acknowledgement. This project started when the authors participated in the PIMS Thematic Programme on Asymptotic Geometric Analysis at the University of British Columbia in Summer 2002. The first author was a PIMS postdoctoral fellow at that time. We are grateful to PIMS for its hospitality. The final part of this research was done when the first authour visited University of California, Davis. We are also grateful for R. Kannan for his comments on the initial version of this manuscript.

2 Low rank approximations

In this section, we prove Theorem  1.1 and discuss the algorithm for finding low rank approximations. Our argument is based on the law of large numbers for operator-valued random variables.

2.1 Law of large numbers for operator-valued random variables

Theorem  1.1 is about random independent sampling the rows of the matrix A   . The result of such sampling can be viewed as an empirical process taking values in the set of rows. If we sample enough rows, then the matrix constructed from them would nicely approximate the original matrix A   in the spectral norm. For the scalar random variables, this effect is the classical Law of Large Numbers. For example, let X   be a bounded random variable and let X 1 X d   be independent copies of X   . Then Markov inequality implies E | 1 d j = 1 d X j E X | C d .   The operator-valued analog of this inequality is harder to prove. In this case instead of estimating the absolute value of the difference between the sample mean and the expectation, we have to estimate the norm. Thus instead the large deviation estimate for a single random variable, we have to deal with estimating the supremum of a random process. This requires deeper probabilistic techniques. The following Theorem generalizes the main result of [22.
Theorem 2.1. Let y   be a random vector in R n   . Assume for normalization that E y y = 1   . Let d   be a natural number and let y 1 y d   be independent copies of y   . Then E 1 d i = 1 d y i y i E y y C log d d ( E y log d ) 1 / log d ,   provided that the last expression is smaller than 1   .
Remark 2.2. This estimate is in general optimal. A simplest example is A = I   , the identity matrix in R n   . Let us sample random rows of A   (scaled by n   for normalization); this amounts to the random vector X   taking values n e 1 , , n e n   each with probability 1 / n   , where ( e i )   is the canonical basis of R n   .
Then E X X = I   . Here and in the sequel, by u v   we denote the linear operator whose matrix is u T v   . We wish to replace this mean by the sample mean 1 d j = 1 d X j X j   , where X 1 X d   are independent copies of X   . Then E 1 d j = 1 d X j X j I = E max i = 1 n | n d | { j | X j = n e i } | 1 | .   If we want this quantity to be O ( 1 )   , then it is not hard to check that d   should be of order at least n log n   . Therefore, the coefficient log d / d   in Theorem  2.1 is optimal.

2.2 Proof of Theorem  2.1 .

The proof consists of two steps. First we introduce a Rademacher series that majorizes the expectation of the norm. Then we use the main Lemma of [22to obtain a bound for it.
The first step is relatively standard. Let ɛ 1 ɛ d   be independent Bernoulli variables taking values 1 , 1   with probability 1 / 2   and let y 1 y d , y ¯ 1 y ¯ d   be independent copies of y   . Denote E y , E ɛ   the expectation according to y   and ɛ   respectively. Since y i y i y ¯ i y ¯ i   is a symmetric random variable, we have
E y 1 d d i = 1 y i y i E y y E y E y ¯ 1 d d i = 1 y i y i 1 d d i = 1 y ¯ i y ¯ i =
E ɛ E y E y ¯ 1 d d i = 1 ɛ i ( y i y i y ¯ i y ¯ i ) 2 E y E ɛ 1 d d i = 1 ɛ i y i y i .
To estimate the last expectation, we need the following Lemma [22.
Lemma 2.3. Let y 1 y N   be vectors in R l   and let ɛ 1 ɛ N   be independent Bernoulli variables taking values 1 , 1   with probability 1 / 2   . Then E i = 1 N ɛ i y i y i C log l max i = 1 N y i i = 1 N y i y i 1 / 2 .  
Remark 2.4. We can consider the vectors y 1 y N   as vectors in their linear span, so the dimension of the ambient space may be always assumed less or equal to N   .
Applying the Lemma, we get
E 1 d d i = 1 y i y i E y y
C log d d ( E max i = 1 d y i 2 ) 1 / 2 ( E d i = 1 y i y i ) 1 / 2 . (5)
We have
( E max i = 1 d y i 2 ) 1 / 2 ( E ( d i = 1 y i log d ) 2 / log d ) 1 / 2
d 1 / log d ( E y log d ) 1 / log d .
Thus, denoting D = E 1 d d i = 1 y i y i E y y ,   we obtain by  5 
D C log d d ( E y log d ) 1 / log d ( D + 1 ) 1 / 2
C log d d ( E y log d ) 1 / log d ( D 1 / 2 + 1 ) .
If C log d d ( E y log d ) 1 / log d 1 / 2 ,   we get D 2 C log d d ( E y log d ) 1 / log d   which completes the proof of Theorem 1.

2.3 Proof of Theorem  1.1 

By the homogeneity, we can assume A = 1   .
The following lemma of Drineas and Kannan [11(see also [12) reduces Theorem  1.1 to a comparison of A   and a sample A ~   in the spectral norm.
Lemma 2.5 (Drineas, Kannan).
A A P k 2 σ k + 1 ( A ) 2 + 2 A T A A ~ T A ~ . (6)
  • Proof.
    A A P k 2 = sup x ker P k , | x | = 1 A x 2 = sup x ker P k , | x | = 1 A T A x , x
    sup x ker P k , | x | = 1 ( A T A A ~ T A ~ ) x , x + sup x ker P k , | x | = 1 A ~ T A ~ x , x
    = A T A A ~ T A ~ + σ k + 1 ( A ~ T A ~ ) .
    By a result of perturbation theory, | σ k + 1 ( A T A ) σ k + 1 ( A ~ T A ~ ) | A T A A ~ T A ~   . This proves the claim.
For j = 1 m   let x j   be the j   -th row of the matrix A   . Then A T A = j = 1 m x j x j .   We shall view the matrix A T A   as the true mean of a bounded operator valued random variable, whereas A ~ T A ~   will be its empirical mean; then we shall apply the Law of Large Numbers for operator-valued random variables, that is Theorem  2.1 . To this end, define a random vector X R m   as P ( X = A F x j x j ) = x j 2 A F 2 .   and let X 1 X d   be independent copies of X   . Let the matrix A ~   consist of rows 1 d X 1 1 d X d   . (The normalization of A ~   is different than in the statement of Theorem  1.1 : in the proof, it is convenient to multiply A ~   by the factor 1 d A F   . However note that the singular vectors of A ~   and thus P k   do not change.) Then A T A = E X X , A ~ T A ~ = 1 d d i = 1 X j X j ,   and ( E X log d ) 1 / log d sup X = A F .   Applying Theorem  2.1 we get E A ~ T A ~ A T A C log d d A F   whenever the last quantity is less than 1. Let δ < 1   be an absolute constant. Choose d C ( r / δ 2 ɛ 4 ) log ( r / δ 2 ɛ 4 )   , then C log d d A F < δ ɛ 2 .   Hence, by Chebychev's inequality, E A ~ T A ~ A T A ɛ 2 / 2   with probability at least 1 2 δ   . Finally, by  6  A A P k σ k + 1 ( A ) + 2 A T A A ~ T A ~ 1 / 2 σ k + 1 ( A ) + ɛ .   This proves Theorem  1.1 .

2.4 Algorithmic aspects of Theorem  1.1 .

Finding a good low rank approximation to a matrix A   amounts, due to Theorem  1.1 , to sampling a random submatrix A ~   and computing its SVD (actually, only left singular vectors are needed). The algorithm works well if the “average sparsity” r = A 2 / A F 2   is small. This is the case, in particular, when A   is essentially a low-rank matrix, because r r a n k ( A )   .
First, the algorithm samples d = O ( r log r )   random rows of A   . Namely, it should take d   independent samples of the random variable X   whose law is P ( X = A j A j ) = A j 2 A F 2   where A j   is the j   -th row of A   , and   is the Euclidean norm. This sampling can be done in one pass through A   if the matrix is stored row-by-row, and in two passes if its entries are stored in arbitrary order ([10,Section5.1.
Then the algorithm computes the SVD of the d × n   matrix A ~   . This can be done in time O ( d n ) +   the time needed to compute the SVD of a d × d   matrix. The latter can be done by one of the known methods. This takes significantly less time than computing SVD of the original m × n   matrix A   . In particular, the running time of this algorithm is linear in the dimensions of the matrix (and polynomial in d   ).

3 The cut norm and MAX-CSP problems

We will first prove Theorem  1.3 on the cut norm of random submatrices and then indicate how it, together with the argument of [3, implies Theorem  1.2 on approximating MAX-2CSP with the sample complexity O ( ɛ 2 )   .

3.1 Decay of the cut-norm

The proof of Theorem  1.3 uses tools of Probability in Banach spaces: decoupling, symmetrization, and Slepian's Lemma (more precisely, its version for the Rademacher random variables due to M.Talagrand). It is easy to show that the cut norm of any operator U : R n R n   is equivalent to U : n 1 n   , which we denote by U 1   . We start with a decoupling lemma due to Bourgain and Tzafriri [7.
Lemma 3.1. Let ( ξ i )   be a finite sequence of bounded i.i.d. random variables, and ( ξ i )   be its independent copy. Then for any sequence of vectors ( x i j )   in a Banach space with x i i = 0   , E i , j ξ i ξ j x i j 20 E i , j ξ i ξ j x i j .  
Let δ 1 δ n   be independent Bernoulli random variables taking value 1 with probability δ : = q / n   . For Δ = ( δ 1 δ n )   denote by P Δ   the coordinate projection on the random set of coordinates { j | δ j = 1 }   .
Let D ( A )   be the diagonal part of A   . Write P Δ A P Δ = P δ ( A D ( A ) ) P Δ + P Δ D ( A ) P Δ = i j δ i δ j A i j e i e j + i δ i A i i e i e i .   We apply the Lemma  3.1 to estimate the first summand, taking x i j = A i j e i e j   if i j   and x i j = 0   if i = j   . Then by the triangle inequality E P Δ A P Δ 1 20 E P Δ ( A D ( A ) ) P Δ 1 + δ i | A i i | ,   where Δ = ( δ i )   is an independent copy of the sequence ( δ i )   . Then to complete the proof it is enough to assume that the diagonal of A   is zero and prove the inequality in the stated in the theorem for E P Δ A P Δ 1   .
Denoting the unit ball of l n   by B n   , we write
E P Δ A P Δ 1 = E sup x B n i δ i | A P Δ x , e i |
= E sup x B n i ( δ i δ ) | A P Δ x , e i | + δ E A P Δ 1
Since δ i δ   are mean zero, one can replace δ   by δ i   , an independent copy of δ i   . More precisely, the first term does not exceed E sup x B n i ( δ i δ ) | A P Δ x , e i |   The random variable δ i δ i   is symmetric, hence it is distributed identically with ɛ i ( δ i δ i )   , where ɛ i   are Rademacher random variables independent of everything else (i.e. 1 , 1   valued symmetric r.v.'s). Therefore the expression above is bounded by 2 E sup x B n i ɛ i δ i | A P Δ x , e i | .   To estimate it, we use Slepian's inequality for Rademacher random variables, proved by Talagrand (see [19). This estimate, which allows to remove the absolue values, is known as Gine-Zinn argument. Namely, for any y 1 , , y n R n   , E sup x B n i ɛ i | x , y i | E sup x B n i ɛ i x , y i = E i ɛ i y i 1 .   So,
E sup x B n i ɛ i δ i | A P Δ x , e i | E P Δ A * ( i ɛ i δ i e i ) 1
= E j δ j | A * ( i ɛ i δ i e i ) , e j |
= δ E j | i ɛ i δ i A j i |
δ E δ max ɛ { 1 , 1 } n j | i ɛ i δ i A j i |
= δ E P Δ A 1 .
Hence we proved that
E P Δ A P Δ 1 δ ( E P Δ A 1 + E A P Δ 1 ) . (7)
By essentially repeating the above argument, we obtain
E P Δ A 1 E A * ( i ɛ i δ i e i ) 1 + δ E A 1
= E j | i ɛ i δ i A j i | + δ E A 1
and interchanging the expectation and the sum, we then apply Cauchy-Schwartz inequality:
2 j E δ ( i δ i | A j i | 2 ) 1 / 2 + δ E A 1
2 j ( E δ i δ i | A j i | 2 ) 1 / 2 + δ E A 1
= 2 δ 1 / 2 j ( i | A j i | 2 ) 1 / 2 + δ E A 1 .
Also, E A P Δ 1 = E P Δ A * 1 2 δ 1 / 2 j ( i | A i j | 2 ) 1 / 2 + δ E A 1 .   Putting this into  7 , we complete the proof.

3.2 Approximation of MAX-2CSP

Let us briefly indicate how Theorem  1.2 on approximating MAX-2CSP reduces to Theorem  1.3 on the cut norm of random submatrices. This reduction was done by Alon, Fernandez de la Vega, Kannan and Karpinski [1, 3.
For a z { 0 , 1 } 2   , one sets up the n × n   matrix A = A ( z )   whose ( i , j )   -th entry is the number of functions that depend on x i   and x j   and that are made true by the truth assignment x i = z 1   , x j = z 2   . Then the solution M a x ( F )   can be computed as the maximum of the polynomial P ( x ) = i , j A i j ( 0 , 0 ) ( 1 x i ) ( 1 x j ) + A i j ( 0 , 1 ) ( 1 x i ) x j + A i j ( 1 , 0 ) x i ( 1 x j ) + A i j ( 1 , 1 ) x i x j   over the Boolean cube { 0 , 1 } n   . Similarly, the solution M a x ( F | Q )   to the induced problem is the maximum of the polynomial P | Q   determined by the submatrix A | Q × Q   .
The problem is then to compare the maxima of the two polynomials, P   and P | Q   .
This can be proved (although still non-trivially, see (34) in [3) in the case when A ( z )   are “cut-matrices” (or their scalar multiples). A cut matrix is the indicator function of the cartesian product I × J   for some I , J { 1 , , n }   . To pass from cut matrices to general matrices, one uses the algorithmic version of Szemeredi's Regularity Lemma [15, which allows to approximate any matrix in the cut norm by a linear combination of a small number of cut matrices. It is easy to check that the maximum of P ( x )   is stable with respect to small variations of A ( z )   in the cut norm. But it is a nontrivial question if small variations of A ( z )   will result in small variations in the induced problem, that is of the random submatrix A ( z ) | Q × Q   . We thus have to show that the cut norm of a matrix decreases proportinally when one passes to its random submatrix. This is where Theorem  1.3 is used.
Specifically, for a matrix A = A ( z )   , the algorithmic version of Szemeredi's Regularity Lemma [15states that there exists at most 4 / ɛ 2   cut-matrices whose sum, denoted by B   , satisfies
A B C ɛ n A F , A B F A F , A B 2 ɛ n A F , (8)
where   denotes the maximal absolute value of the entries. Note that
A F n A 2 2 2 n (9)
because the number of different Boolean functions that can depend on two fixed variables is 2 2 2   . Thus the approximation is good in the cut norm, A B C = O ( ɛ n 2 )   .
To prove  4 , it remains to show that this error will decrease to ( A B ) | Q × Q C = O ( ɛ q 2 )   for the induced problem. This is done by using Theorem  1.3 (see remark below it) along with  8 and  9 :
E ( A B ) | Q × Q C C ( q n ) 2 A B C + C ( q n ) n A B + C q 3 / 2 n A B F
C ɛ q 2 + C ɛ q + C q 3 / 2 = O ( ɛ q 2 ) if q ɛ 2 .
This establishes nice approximations of both the original and the induced problem, which leads to Theorem  1.2 as described in the previous paragraph.

4 The decay of the spectral norm

In this section we prove Theorem  1.5 .
By homogeneity we can assume that A = 1   . The matrix A   with columns x 1 x n   can be written as A = j n e j x j .   We have to compute E : = E A | Q = E j n δ j e j x j = E j n δ j x j x j 1 / 2 ,   where δ j   are { 0 , 1 }   -valued independent random variables with E δ j = δ = q n   . This will be done by a usual symmetrization and applying Lemma  2.3 .
Now we pass to a detailed proof. The standard symmetrization procedure (see [19Lemma 6.3) yields
E E j n ( δ j δ ) x j x j 1 / 2 + δ A 1 / 2
2 E δ ( E ɛ j n ɛ j δ j x j x j ) 1 / 2 + δ .
Denote J ( δ ) = { j | δ j = 1 }   . We apply Lemma  2.3 to bound E ɛ j J ( δ ) ɛ j x j x j   .
By Remark  2.4 , we can assume l   in this Lemma equal n ( δ ) : = e + j n δ j .   Then using the Cauchy-Schwartz inequality we obtain
E E δ ( C log n ( δ ) max j J ( δ ) x j j J ( δ ) x j x j 1 / 2 ) 1 / 2 + δ
C ( E δ ( log n ( δ ) max j = 1 n δ j x j 2 ) ) 1 / 2 ( E δ j = 1 n δ j x j x j 1 / 2 ) 1 / 2
+ δ . (10)
To estimate the fist term in the product we use the following
Lemma 4.1. Let a 1 a 2 a n 0   and let δ 1 δ n   be independent Bernoulli random variables taking value 1   with probability δ > 2 / n   . Then δ 4 e log δ n j = 1 1 / δ a j E ( log n ( δ ) max j = 1 n δ j a j ) 4 δ log δ n j = 1 1 / δ a j .  
  • Proof. To prove the upper estimate note that max j = 1 n δ j a j j = 1 1 / δ δ j a j + a 1 / δ .   Hence,
    E ( log n ( δ ) max j = 1 n δ j a j ) E ( log n ( δ ) j = 1 1 / δ δ j a j ) + a 1 / δ E log n ( δ ) . (11)
    Jensen's inequality yields E log n ( δ ) log ( E i = 1 n δ i + e ) 2 log δ n .   By the linearity of expectation, the first term equals j = 1 1 / δ a j E ( δ j log n ( δ ) ) j = 1 1 / δ a j E ( δ j log ( i j δ i + 1 + e ) )   Here we estimated n ( δ )   replacing δ j   by 1. Taking the expectation first with respect to δ j   and then with respect to the other δ i   , we obtain that the last expression is bounded by δ j = 1 1 / δ a j log ( δ n + 1 + e ) 2 δ j = 1 1 / δ a j log δ n .   Finally, substituting this bound into  11 , we obtain E ( log i = 1 n δ i max j = 1 n δ j a j ) ( 2 δ j = 1 1 / δ a j + 2 a 1 / δ ) log δ n 4 δ j = 1 1 / δ a j log δ n .   To prove the lower bound, we estimate the product in Lemma  4.1 from below to make the terms independent. We have
    E ( log n ( δ ) max j = 1 n δ j a j ) E ( log ( i = 1 / δ + 1 n δ i + e ) max j = 1 1 / δ δ j a j )
    = E log ( i = 1 / δ + 1 n δ i + e ) E max j = 1 1 / δ δ j a j . (12)
    These terms will be estimated separately. Since P ( i = 1 / δ + 1 n δ i δ n / 2 ) 1 / 2   , E log ( i = 1 / δ + 1 n δ i + e ) 1 2 log δ n 2 .   Let 1 k 1 / δ   . Denote by A k   the event δ k = 1 , δ j = 0   for 1 j 1 / δ , j k   .
    Then P ( A k ) = δ ( 1 δ ) 1 / δ 1 δ / e .   Hence, E max j = 1 1 / δ δ j a j k = 1 1 / δ a k P ( A k ) δ e j = 1 1 / δ a j .   Substituting this estimate into  12 finishes the proof.
Now we can complete the proof of Theorem  1.5 . Combining Lemma  4.1 and  10 , we get E C δ log 1 / 4 δ n A ( 1 / δ ) 1 / 2 E 1 / 2 + δ .   Recall that δ = q / n   . It can be easily checked that E a E 1 / 2 + b   implies E 4 a 2 E + 2 b   .
Hence, E 4 C 2 log 1 / 2 q A ( n / q ) + 2 q / n .   This completes the proof.
References

  1. N. Alon, W. de la Vega, R. Kannan, M. Karpinski, Random Sampling and approximation of MAX-CSPs, 34'th ACM STOC (2002)
  2. Y. Azar, A. Fiat, A. Karlin, F. McSsherry, J. Saia, Spectral analysis of data, Proceedings of the 33rd ACM Symposium on THeory of Computing, 2001
  3. N. Alon, W. de la Vega, R. Kannan, M. Karpinski, Random Sampling and approximation of MAX-CSPs, Journal of Computer and System Sciences 67 (2003), 212-243
  4. M. W. Berry, S. T. Dumais, G. W. O'Brian, Using linear algebra for intelligent information retrieval, SIAM Review 37 (1995), 573–595
  5. M. W. Berry, Z. Drmac, E. R. Jessup, Matrices, vector spaces and information retrieval, SIAM Review 41 (1999), 335–362
  6. M. J. Berry, G. Linoff, Data mining techniques. John-Wiley, 1997
  7. J. Bourgain and L. Tzafriri: Invertibility of “large” sumatricies with applications to the geometry of Banach spaces and harmonic analysis. Israel J. Math., 57 (1987) 137–223
  8. P. G. Casazza, R. Vershynin, Kadison-Singer meets Bourgain-Tzafriri, submitted
  9. S. T. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. H. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science 41 (1990), 391–407
  10. P. Drineas, A. Frieze, R. Kannan, S. Vempala, V. Vinay, Clustering in large graphs via Singular Value Decomposition, Journal of Machine Learning, to appear (2004)
  11. P. Drineas, R. Kannan, Pass efficient algorithms for approximating large matrices, Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, MD, 2003), 223–232, ACM, New York, 2003
  12. P. Drineas, R. Kannan, M. Mahoney, Fast Monte-Carlo Algorithms for Matrices II: Computing a low-rank approximation to a matrix, preprint
  13. P. Drineas, M. Mahoney, R. Kannan, Fast Monte-Carlo Algorithms for Matrices III: Computing an Efficient Approximate Decomposition of a Matrix, preprint
  14. W. Fernandez de la Vega, MAX-CUT has a randomized approximation scheme in dense graphs, Random Structures and Algorithms 8 (1996), 187–199
  15. A. Frieze and R. Kannan, The Regularity Lemma and approximation schemes for dense problems, Proceedings of the 37th Annual IEEE Symposium on Foundations of Computing (1996), 12–20
  16. A. Frieze, R. Kannan and S. Vempala, Fast Monte-Carlo Algorithms for finding low-rank approximations, Proceedings of the Foundations of Computer Science, 1998, pp. 378–390, journal version in Journal of the ACM 51 (2004), 1025-1041
  17. O. Goldreich, S. Goldwasser, D. Ron, Property testing and its connection to learning and approximation, Proc. 37th IEEE FOCS 96, 339–348; journal version in Journal of the ACM 45 (1998), 653–750
  18. B. Kashin, L. Tzafriri, Some remarks on the restrictions of operators to coordinate subspaces, unpublished notes
  19. M. Ledoux and M. Talagrand, Probability in Banach spaces, Springer, 1991
  20. A. A. Lunin, On operator norms of submatrices, Math. USSR Sbornik 27 (1975), 481–502
  21. C. H. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala, Latent semantic indexing: A probabilistic analysis, Proceedings of the ACM symposium on principles of database systems, 1998
  22. M. Rudelson, Random vectors in isotropipc position, J. Funct. Anal. 164 (1999), no. 1, 60–72.
  23. M. Talagrand, Sections of smooth convex bodies via majorizing measures, Acta MAth. 175 (1995), 273–300
  24. M. Talagrand, Majorizing measures: the generic chaining, Ann. Probab. 24 (1996), 1049–1103
  25. R. Vershynin, John's decompositions: selecting a large part, Israel Journal of Mathematics 122 (2001), 253–277

Department of Mathematics, University of Missouri, Columbia, MO 65211, U.S.A. E-mail address : rudelson@math.missouri.edu Department of Mathematics, University of California, Davis, CA 95616, U.S.A. E-mail address : vershynin@math.ucdavis.edu