Learning symmetric k -juntas in time n o ( k ) , , , , April 2005

1 Introduction

We consider a fundamental problem in computational learning theory: learning in the presence of irrelevant information. One formalization of the problem is as follows: We want to learn an unknown boolean function of

n

variables, which depends only on

k ≪ n

variables (typically

k

O (log n)

). We call such a function a

k

-junta. We are provided with a set of labelled examples

〈 x, f (x) 〉

, where the

x

's are picked uniformly and independently at random from the domain

{0, 1}^{n}

(this is the PAC model with uniform distribution). We wish to identify the

k

relevant variables and the truth table of the function.

The problem was first posed by Blum [?] and Blum and Langley [?] , and it is considered [?, ?] to be one of the most important open problems in the theory of uniform distribution learning. It has connections with learning DNF formulas and decision trees of super-constant size, see [?, ?, ?, ?, ?] for details. The general case is believed to be hard and has even been used to propose a cryptosystem [?] . A trivial algorithm runs in time roughly

n^{k}

by doing an exhaustive search over all possible sets of relevant variables. Two important classes of juntas are learnable in polynomial time: parity and monotone functions. Learning parity functions can be reduced to solving a system of linear equations over

F_{2}

[?] . Monotone functions have non-zero singleton Fourier coefficients (e.g., see [?] ).

For the general case, the first significant breakthrough was given in [?] learning with confidence

1 - δ

in time

n^{0.7 k} p o l y (2^{k}, n, log 1 / δ)

. Note that we allow the running time to be polynomial in

2^{k}

, since this is the size of the truth-table which is output. In the typical setting of

k = O (log n)

, this becomes polynomial in

n

In this paper we consider the class of symmetric

k

-juntas, functions which are symmetric on their relevant variables. The only non-trivial algorithm known for this case is the standard Fourier based algorithm, described in Section 2 . The analysis of the running time of this algorithm reduces to the following question:

What is the smallest $t$ such that every symmetric boolean function on $k$ variables, which is not a constant or a parity function, has a non-zero Fourier coefficient of order at least $1$ and at most $t$ ?

A bound of

t_{0}

implies a running time of roughly

n^{t_{0}}

. A bound of

\frac{2 k}{3}

was provided in [?] . This was improved to

\frac{3 k}{31}

in [?] . Here we show a bound of

O (k / log k)

(Theorem 3.3 ), giving the first algorithm for learning symmetric

k

-juntas in time

n^{o (k)}

Techniques Our techniques involve a mix of number theory, combinatorics and probability. We start by reducing our problem to finding 0/1 solutions to a system of Diophantine equations involving binomial coefficients, as in [?] . We then take a departure from [?] by further reducing this to the problem of showing that a certain integer-valued polynomial

P

is constant over the set

{0, 1, . . ., k}

. We manage to prove this in two steps: First, we show that

P

is constant over the union of two small intervals

{0, . . ., t} \cup {k - t, . . ., k}

. This is obtained by looking at

P

modulo carefully chosen prime numbers. To choose these prime numbers we use the Siegel-Walfisz theorem on the density of primes in arithmetic progressions with modulus of moderate growth. In the second step, we extend the constant nature of

P

to the whole interval

{0, . . ., k}

by repeated applications of Lucas' Theorem. One additional interesting aspect of our proof is the use of an equivalence between a) the vanishing of Fourier coefficients and b) the equality of moments of certain random variables under the uniform measure on the hypercube and under the measure defined by the function itself. This equivalence helps us eliminate a lot of case analysis.

2 Preliminaries

Symmetric Juntas Given a boolean function

f

n

variables

x_{1}, . . ., x_{n}

, we will say that

x_{i}

is a relevant variable for

f

if there exist

x, y \in {0, 1}^{n}

which differ only in the

i

-th coordinate and

f (x) \neq f (y)

. Variables that are not relevant are called irrelevant. We will call

f

k

-junta if

f

has at most

k

relevant variables.

We consider the class of symmetric juntas. A boolean function

f : {0, 1}^{k} \to {0, 1}

k

variables is a symmetric function if for any permutation

π \in S_{k}

f (x_{1}, . . ., x_{k}) = f (π (x_{1}), . . ., π (x_{k}))

Hence the value of

f

(x_{1}, . . ., x_{k})

depends only on the weight of

(x_{1}, . . ., x_{k})

, which is the number of variables that are set to

1

. A symmetric

k

-junta is a function on

n

variables which is symmetric on the

k

variables it depends on.

We will describe a symmetric boolean function on

k

variables by a

(k + 1)

-bit string

f_{0} f_{1} . . . f_{k}

, where

f_{i}

is the value of

f

on an input of weight

i

. The following four special symmetric functions on

k

variables will appear often: the two constant functions

0

and

1,

the parity function

\oplus,

and its complement

\bar{\oplus}

Learning in the PAC model We consider the PAC learning model [?] , in which we wish to learn a Concept Class

C = \cup_{n} C_{n},

where each

C_{n}

is a collection of boolean functions from

{0, 1}^{n} \to {0, 1}

. In our case,

C_{n}

is the class of symmetric

k

-juntas on

n

variables. Let

ε

be an accuracy parameter and

δ

a confidence parameter.

A learning algorithm

A

for

C

has access to an oracle for

f \in C_{n}

. A query to the oracle outputs a labeled example

〈 x, f (x) 〉,

where

x

is drawn from

{0, 1}^{n}

according to some probability distribution

D

A

is said to be a learning algorithm for the class

C

under the distribution

D

if for all

f \in C

, it outputs, with probability at least

1 - δ

, a hypothesis

h

such that

P r_{x} [h (x) = f (x)] \geq 1 - ε

. We will be concerned only with the uniform distribution and we will obtain an algorithm with accuracy parameter

ε = 0

, i.e., we identify the exact function

f

Fourier Transform We will consider functions of the form:

f : {0, 1}^{n} \to R

. An orthonormal basis for the functions defined on the Boolean cube can be given by the characters of the group

Z_{2}^{n}

. In particular, for every

S \subseteq {1, . . ., n}

, define the following function:

χ_{S} (x) = (- 1)^{\sum_{i \in S} x_{i}} .

Any real-valued function on the Boolean cube can be expressed as a linear combination of the functions

χ_{S}

. Given

f

, we have that

f (x) = \sum_{S} \hat{f} (S) χ_{S} (x)

, where

\hat{f} (S)

is the Fourier coefficient of

f

S

and is equal to the inner product of

f

with

χ_{S}

\hat{f} (S) = \frac{1}{2^{n}} \sum_{x \in {0, 1}^{n}} f (x) χ_{S} (x) .

Fourier-based Learning Let

f

be a

k

-junta. It is known that we can exactly calculate the Fourier coefficients of

f

in the uniform distribution PAC model, with confidence

1 - δ

in time

p o l y (2^{k}, n, log \frac{1}{δ})

, using standard Chernoff-Hoeffding bounds (see [?, ?] ). Observe further, that if

x_{i}

is an irrelevant variable for a

k

-junta

f

, then for any

S \subseteq {x_{1}, . . ., x_{n}}

containing

x_{i}

\hat{f} (S) = 0

. Hence if

\hat{f} (S) \neq 0

, for some

S

, then

S

contains only relevant variables.

This suggests the following algorithm: Starting with

l = 1

, compute the Fourier coefficients of all subsets of

{x_{1}, . . ., x_{n}}

of size

l

. Collect the union of all relevant variables that correspond to subsets with non-zero Fourier coefficients. Stop as soon as you collect all

k

relevant variables.

Since the function is symmetric, for any two sets

S, T

of relevant variables such that

| S | = | T |

, we have

\hat{f} (S) = \hat{f} (T)

. Hence the first time that we will identify some relevant variables in the algorithm, we will actually be able to identify all the relevant variables. Once we find the relevant variables, finding the truth-table of the function can be done in time

p o l y (2^{k}, log \frac{1}{δ})

The above algorithm would take time roughly

n^{k}

for

f \in {0, 1, \oplus, \bar{\oplus}}

. However, these particular functions are well known to be learnable in time

p o l y (n, log \frac{1}{δ})

. Hence the following is true:

Fact 2.1. If every symmetric function

f \notin {0, 1, \oplus, \bar{\oplus}}

has a non-zero Fourier coefficient of order between 1 and

t

, then we can learn symmetric

k

-juntas in time

n^{t} p o l y (2^{k}, n, log \frac{1}{δ})

3 Main Section

3.1 An Equivalent Formulation

We state an equivalent condition for the existence of a non-zero Fourier coefficient of a boolean function

f

, as proved in [?] . Let

f : {0, 1}^{k} \to {0, 1}

be a boolean function. For a vector

x = (x_{1}, \dots, x_{k}),

and a set

S \subseteq [k]

, let

x_{S}

be the projection of

x

on the indices of

S

. Let

σ \in {0, 1}^{| S |} .

Define the following probabilities:

p_{S, σ} (f) : = P r [f (x) = 1 | x_{S} = σ]

Unless mentioned, all probabilities are over the uniform distribution on

{0, 1}^{k}

. For

t \geq 1

, call a boolean function

f

k

variables

t

-null, if for all sets

S \subseteq [k],

with

| S | = t,

and for all

σ \in {0, 1}^{t},

the probabilities

p_{S, σ} (f)

are all equal to each other. The following lemma reveals the connection with the Fourier coefficients of

f

Lemma 3.1. [?] Let

f

be a boolean function on

k

variables. Then

f

t

-null for some

1 \leq t \leq k,

if and only if, for all

\emptyset \neq S \subseteq [k]

with cardinality at most

t

\hat{f} (S) = 0 .

It is clear that if

s \leq t

and

f

t

-null then it is also

s

-null.

When we consider the case of symmetric functions,

p_{S, σ} (f)

just depends on

t : = | S |

and the weight

w

σ

. We denote this by

p_{t, w} (f) .

It is clear that:

\begin{matrix} p_{t, w} (f) = \frac{1}{2^{k - t}} \sum_{i = 0}^{k} f_{i} (\binom{k - t}{i - w}) \end{matrix}

(1)

where

(\binom{l}{m})

0

m < 0

m > l

, and

(\binom{0}{0})

1

. It follows that

f

t

-null if for

0 \leq w \leq t

p_{t, w} (f)

are all equal. It is easy to see that the constant boolean functions

{0, 1}

are

t

-null for all

t

with

1 \leq t \leq k

. The parity functions

{\oplus, \bar{\oplus}}

are also

t

-null for all

t

satisfying

1 \leq t < k

. From Lemma 3.1 and Equation 1 we get:

Corollary 3.2. All symmetric boolean functions

f \notin {0, 1, \oplus, \bar{\oplus}}

have a non-zero Fourier coefficient of order at most

t_{0}

(and at least

1

) iff

{0, 1, \oplus, \bar{\oplus}}

are the only solutions to

\begin{matrix} \sum_{i = 0}^{k - t_{0}} f_{i} (\binom{k - t_{0}}{i}) = \sum_{i = 1}^{k - t_{0} + 1} f_{i} (\binom{k - t_{0}}{i - 1}) = \dots = \sum_{i = t_{0}}^{k} f_{i} (\binom{k - t_{0}}{i - t_{0}}) \end{matrix}

(2)

In the next section, we show that this is true for

t_{0} \leq C k / log k

for large enough

k

3.2 A bound of $O (k / log k)$ .

The following is our main theorem.

Theorem 3.3. There is an absolute constant

C > 0

such that for large

k

, every symmetric boolean function

f

k

bits with

f \notin {0, 1, \oplus, \bar{\oplus}}

has a non-zero Fourier coefficient of order at most

C k / log k

and at least

1

The rest of this section is devoted to proving Theorem 3.3 . Suppose

f

is a boolean function on

G = Z_{2}^{k}

, such that all its Fourier coefficients of order up to

k - N

are

0

. Then the values

f_{j}

f

satisfy 2 with

t_{0} = k - N

, which, changing parameters, can be rewritten as:

\begin{matrix} \sum_{j} (\binom{N}{j}) f_{ν + j} = c_{N}, for all ν = 0, \dots, k - N . \end{matrix}

(3)

We want to show that if

k - N \geq C k / log k

, for some appropriately large constant

C > 0

, then

f_{j}

is either constant or alternates between

0

and

1

. We prove this for all

k

sufficiently large.

Define

X_{j} = f_{j + 1} - f_{j}

, for

j = 0, \dots, k - 1

, and observe that the sequence

X_{j}

satisfies the homogeneous version of 3 :

\begin{matrix} \sum_{j} (\binom{N}{j}) X_{ν + j} = 0, for all ν = 0, \dots, k - N - 1 . \end{matrix}

(4)

Remark. In 4 the number

N

can be replaced by any other integer

N_{1}

in the interval

[N, k]

. This follows since all the non-constant Fourier coefficients up to order

k - N

are

0

From 4 the sequence

X_{j}

may be defined for all

j \in Z

and

X_{j} \in Z

for all

j

. From the theory of recurrence relations we know then that the sequence

X_{j}

may be written as a linear combination of the following sequences:

(- 1)^{j}, (- 1)^{j} j, (- 1)^{j} j^{2}, \dots, (- 1)^{j} j^{N - 1} .

The reason for this is that

- 1

is the only root of the characteristic polynomial of the recurrence,

φ (z) = \sum_{j} (\binom{N}{j}) z^{j} = (1 + z)^{N}

. Therefore there is a polynomial

P (x)

, of degree at most

N - 1

, such that

X_{j} = (- 1)^{j} P (j), for all j \in Z .

Clearly

P (x)

takes integer values on integers and in particular

P (j) \in {- 1, 0, 1}

for

j = 0, \dots, k - 1

. From the well known characterization of integer-valued polynomials it follows that we may write

\begin{matrix} P (x) = \sum_{j = 0}^{N - 1} a_{j} (\binom{x}{j}), with a_{j} \in Z . \end{matrix}

(5)

p \geq N

is a prime, and since all the factors that appear in denominators in 5 are strictly less than

p

(hence invertible mod

p

), it follows that the sequence

P (j) m o d p

j \in Z

, may be viewed as a polynomial with coefficients in

Z_{p}

and therefore is a

p

-periodic sequence mod

p

, i.e.

\begin{matrix} P (j + p) = P (j) m o d p, for all j \in Z and p \geq N . \end{matrix}

(6)

If, in addition,

0 \leq j < j + p < k

, when all

P

-values that appear in 6 are in

{- 1, 0, 1}

, it follows that we have the non-modular equality

\begin{matrix} P (j + p) = P (j), (N \leq p \leq j + p < k) . \end{matrix}

(7)

We want to show that

f \in {0, 1, \oplus, \bar{\oplus}}

. Since

X_{j} = f_{j + 1} - f_{j}

it is enough to show that either

X_{j}

is identically

0

or that

X_{j} = (- 1)^{j}

X_{j} = (- 1)^{j + 1}

. This is equivalent to showing that

P

is a constant polynomial, constantly equal to

- 1, 0

1

Notation. 1. In what follows we repeatedly use the letter

C

to denote a positive constant which depends on no parameter (unless we say otherwise). As is customary, this constant

C

need not be the same in all its occurences.

2. We define

ε

by the relation

k - N = ε k

and assume

ε \geq C / log k

, with

C

a large enough positive constant.

We shall need various primes in intervals from now on. The version of the prime number theorem that we will be using is the Siegel-Walfisz theorem (see [?,Theorem2] ). Define the logarithmic integral

L i x = \int_{2}^{x} \frac{d t}{log t} \sim \frac{x}{log x}, (x \to \infty) .

The Euler function

φ (q)

denotes the number of moduli mod

q

which are coprime to

q

Theorem A (Siegel-Walfisz) Let

π (x; M, a)

be the number of primes

\leq x

which are equal to

a m o d M

and assume that

(M, a) = 1

. Then if

M \leq (log x)^{A}

A

a constant, we have

\begin{matrix} π (x; M, a) = \frac{L i x}{φ (M)} + O (x exp (- c \sqrt{log x}), (as x \to \infty) . \end{matrix}

(8)

where

c

depends on

A

only (the constant in the

O (\cdot)

term is absolute).

For

π (x)

, the number of primes up to

x

without any restriction, the prime number theorem says

π (x) = L i (x) + O (x exp (- c \sqrt{log x})

, for some constant

c

These theorems guarantee that, for

x \to \infty

, the interval

[x, x + Δ]

has the “expected” number of primes whenever

Δ \geq C x / (log x)^{A}

, whatever the constant

A

, even if we impose the condition that these primes are equal to

a m o d M

, as long as

M \leq (log x)^{B}

, for any constant

B

We use the above theorems along with the

p

-periodicity of

P

to deduce that

P

is in fact

2

-periodic on the union of

2

small sub-intervals of

[0, k - 1]

Lemma 3.4. The polynomial

P

satisfies the 2-periodicity condition

P (j) = P (j + 2),

whenever

j, j + 2 \in A = [0, k - N] \cup [N, k - 1]

Proof. Assume

q < r

are two primes in

[N, N + h]

, where

h = (k - N) / 3 = \frac{ε}{3} k

. (The length of the interval

[N, N + h]

is large enough for the prime number theorem to guarantee the existence of many primes in it.) From 7 it follows that the finite sequences

P (0), \dots, P (k - q) and P (q), \dots, P (k)

are identical. Applying 7 again with

r

we get that the finite sequences

P (0), \dots, P (k - r) and P (r), \dots, P (k)

are identical. It follows that

\begin{matrix} P (j + r - q) = P (j), if N + h \leq j \leq N + 2 h and r > q primes in [N, N + h] . \end{matrix}

(9)

We now assume that the difference

M = r - q

is the smallest difference between two primes in

[N, N + h]

. By the prime number theorem

M \leq C log k

. Hence, we can apply Theorem A . Since

φ (M) \leq M \leq C log k

in that case Theorem A guarantees that the number of primes equal to

a m o d M

[N, N + h]

is at least

C \frac{h}{{log}^{2} k} \sim C \frac{k}{{log}^{3} k},

whenever

(M, a) = 1

. All that matters here is that this number is positive.

Let

t \in [N, N + h]

be the smallest prime which is equal to

- 1 m o d M

. By Theorem A , applied to

M

and

- 1

, its existence is guaranteed and furthermore that

t \sim N

. The same theorem guarantees that we can find a prime

s \in (t, N + h]

such that

s = 1 m o d M

. Then

s - t = 2 m o d M

s - t = ℓ M + 2

, for some nonnegative integer

ℓ

. Therefore, for

N + h \leq j \leq N + 2 h

we have

\begin{matrix} P (j) & = & P (j + s - t) (applyingdiff-periodicity 9for the primes s, t) \end{matrix}

\begin{matrix} = & P (j + ℓ M + 2) \end{matrix}

\begin{matrix} = & P (j + (ℓ - 1) M + 2) (applyingdiff-periodicity 9for the primes r, q) \end{matrix}

\begin{matrix} \cdot \cdot \cdot \end{matrix}

\begin{matrix} = & P (j + 2) . \end{matrix}

This

2

-periodicity

\begin{matrix} P (j) = P (j + 2) \end{matrix}

(10)

is transferred to all

j, j + 2 \in A

by using 7 repeatedly for appropriate primes

p

Notice that in the sequence

X_{j}

, if one erases the

0

's then one sees an alternation of

- 1

and

1

(this follows from the fact that

f_{j} \in {0, 1}

). This property greatly reduces the number of allowed patterns in

X_{j}

and in fact it implies that

P

is constant in

A

Lemma 3.5. The polynomial

P

is constant in

A

(defined in Lemma 3.4 ).

Proof. From Lemma 3.4 the values of $P$ in $[N, k - 1]$ must be a $2$ -periodic sequence. The only essentially different non-constant $2$ -periodic patterns for the values of $P$ in $[N, k - 1]$ are $010101 \dots$ and $(- 1) 1 (- 1) 1 \dots$ and they both violate the property that $X_{j} = (- 1)^{j} P (j)$ must satisfy, namely that if one erases the $0$ 's then one must see an alternation of $1$ and $- 1$ . Therefore $P$ is constant in each of the two intervals of $A$ . From the $p$ -periodicity 7 it follows that the constant is the same in both intervals.

We now extend the set on which

P

is constant to a superset of

A

that contains a small interval around

k / 2

. We will make use of the following theorem which follows from Lucas' Theorem [?,Ch.3] .

Theorem 3.6. If

r

is a prime which does not divide

n

then

(\binom{m r}{n}) = 0 m o d r

. Also, if

0 \leq m < r

then

(\binom{m r}{l r}) = (\binom{m}{l}) m o d r

Lemma 3.7. Let

a = (1 / 2 - ε / 2) k

and

b = (1 / 2 + ε / 2) k

. Then

P (l) = P (0)

for

a \leq l \leq b

Proof. We shall apply Theorem 3.6 with $m = 2$ and with a prime $r$ such that $2 r - N$ takes the minimal possible nonnegative value. It follows from the prime number theorem that $2 r - N = o (ε k)$ .
And it follows from the remark after 4 that $\sum_{j} (- 1)^{j} (\binom{2 r}{j}) P (j + ν) = 0, (ν \in Z) .$ Taking residues mod $r$ and using Theorem 3.6 for $m = 2$ we obtain $P (ν) - 2 P (ν + r) + P (ν + 2 r) = 0 m o d r, (ν \in Z) .$ By our particular choice of $r$ we have $P (ν) = P (ν + 2 r) = P (0)$ whenever $ν \in [0, k - N - o (ε k)]$ .
It follows that $P (ν + r) = P (0)$ . Applying this for all $ν \in [0, k - N - o (ε k)]$ we get $P (l) = P (0)$ for all $l$ in the interval $(a + o (ε k), b - o (ε k))$ . To get rid of the $o (ε k)$ terms in the interval above, just choose a slightly larger $r$ and apply again for all $ν \in [0, k - N - o (ε k)]$ .

So far we have proved

P (l) = P (0)

on the set

A_{2} = [0, k - N] \cup [a, b] \cup [N, k - 1],

which consists of three equispaced intervals of roughly equal size

ε k

. We consider

2

cases for

P

The first is when

P

0

A_{2}

and the second is when

P

1

- 1

In the case that

P

0

A_{2}

, we shall need the following theorem, which already gives a lot of significant information about the function

f

. It should be thought of as analogous to the fact that the moments of a (vector) random variable can be read off the Fourier Transform of its distribution (the characteristic function) by looking at derivatives at

0

Theorem 3.8. Suppose

f : G = Z_{2}^{k} = {0, 1}^{k} \to R

is nonnegative (and not identically

0

) and has all its Fourier coefficients of order at most

r

(and at least 1) equal to

0

. Let

μ

denote the uniform probability measure on the cube

G

and

ν

denote the probability measure on

G

defined by

ν (A) = \sum_{x \in A} f (x) / \sum_{x \in G} f (x), (A \subseteq G) .

Let also

X_{1}, \dots, X_{k}

denote the coordinate functions on

G

, which we view as random variables. Then for all

i_{1} < i_{2} < \dots < i_{s}

0 \leq s \leq r

, we have

E_{ν} (X_{i_{1}} \dots X_{i_{s}}) = E_{μ} (X_{i_{1}} \dots X_{i_{s}}) .

Proof. Let

F = \sum_{x \in G} f (x)

. We assume for simplicity that

i_{1} = 1, \dots, i_{s} = s

. Then, writing

x = (x_{1}, x_{2}, \dots, x_{k})

and

[s] = {1, \dots, s}

, we have

\begin{matrix} E_{ν} (X_{1} \dots X_{s}) & = & \frac{1}{F} \sum_{x \in G} f (x) x_{1} \dots x_{s} \end{matrix}

\begin{matrix} = & \frac{1}{F} \sum_{x \in G} f (x) \frac{1 + (- 1)^{x_{1} + 1}}{2} \dots \frac{1 + (- 1)^{x_{s} + 1}}{2} \end{matrix}

\begin{matrix} = & \frac{1}{2^{s} F} \sum_{x \in G} f (x) \sum_{S \subseteq [s]} (- 1)^{| S | + \sum_{i \in S} x_{i}} \end{matrix}

\begin{matrix} = & \frac{| G |}{2^{s} F} \sum_{S \subseteq [s]} (- 1)^{| S |} \frac{1}{| G |} \sum_{x \in G} f (x) (- 1)^{\sum_{i \in S} x_{i}} \end{matrix}

\begin{matrix} = & \frac{| G |}{2^{s} F} \sum_{S \subseteq [s]} (- 1)^{| S |} \hat{f} (S) \end{matrix}

\begin{matrix} = & \frac{| G |}{2^{s} F} \hat{f} (0) (by the vanishing of \hat{f}) \end{matrix}

\begin{matrix} = & 2^{- s} \end{matrix}

\begin{matrix} = & E_{μ} (X_{1} \dots X_{s}) \end{matrix}

Remarks. 1. For functions

f : {0, 1}^{k} \to {0, 1}

, the above theorem follows directly from the definition of

t

-nullity in Section 3.1 . However, as we shall see in the proof of Lemma 3.10 we need to apply this theorem for functions whose range is not

{0, 1}

2. If the nonnegative function

f

is symmetric then the identity of moments up to order

r

with those of the uniform distribution (

r

-wise independence) and the vanishing of the non-constant Fourier coefficiens of weight up to

r

are equivalent. This can be proved by induction on

r

. We do not use this here.

Corollary 3.9. Under the assumptions and definitions of Theorem 3.8 the random variable

S = X_{1} + \dots + X_{k}

has the same power moments under the probability measures

μ

and

ν

, up to order

r

Proof. The power $S^{s}$ , $s \leq r$ , can be written as a sum of terms of the type $X_{i_{1}} \dots X_{i_{t}}$ , for $t \leq s$ .
One uses the fact that $X_{j}^{2} = X_{j}$ .

Lemma 3.10. If

P

0

A_{2}

, then

f \in {0, 1}

Proof. Suppose the polynomial $P$ is constantly equal to $0$ on the set $A_{2}$ and that $f \notin {0, 1}$ . The sequence $f_{j}$ is constant in each of the three intervals of $A_{2}$ . By possibly considering $1 - f$ (whose Fourier coefficients vanish exactly where those of $f$ do), we may assume that $f_{j} = 0$ on the middle interval $(a, b)$ . Define the nonnegative function $g : G \to R$ by $g (x_{1}, \dots, x_{k}) = f (x_{1}, \dots, x_{k}) + f (1 - x_{1}, \dots, 1 - x_{k}),$ and observe that the Fourier coefficients of $g$ of weight at most $k - N$ vanish. Let $τ$ be the distribution of the random variable $S = X_{1} + \dots + X_{k}$ under the measure induced by $g$ on $G$ (each vertex $x \in G$ has probability proportional to $g (x)$ ). Note that this is a well defined probability distribution since we assumed that $f$ and $1 - f$ are not the $0$ function. Clearly $τ$ is symmetric about $k / 2$ and has no mass in $(a, b)$ , since both $f (x_{1}, \dots, x_{k})$ and $f (1 - x_{1}, \dots, 1 - x_{k})$ are $0$ when $x_{1} + \dots + x_{k} \in (a, b)$ . The $s$ -th moment with respect to the measure $τ$ of the variable $S$ in Corollary 3.9 is the expression $M (τ, s) = \frac{1}{F} \sum_{j} g_{j} (\binom{k}{j}) j^{s},$ where again $F = \sum_{j} g_{j} (\binom{k}{j})$ . By Corollary 3.9 this must equal the $s$ -th moment with respect to the binomial measure $μ$ , which is the quantity $M (μ, s) = 2^{- k} \sum_{j} (\binom{k}{j}) j^{s} .$ But the variance of $S$ under $μ$ is $M (μ, 2) - M (μ, 1)^{2} = k,$ since under $μ$ the random variables $X_{1}, \dots, X_{k}$ are independent, while the variance of $S$ under $τ$ is $M (τ, 2) - M (τ, 1)^{2} \geq C ε^{2} k^{2},$ as half the mass of $τ$ sits to the left of $\frac{1 - ε}{2} k$ and half to the right of $\frac{1 + ε}{2} k$ . These orders of magnitude are different whenever $ε \geq C / \sqrt{k}$ , which is true in our case as $ε \geq C / log k$ . This contradiction proves that $P$ cannot equal $0$ on $A_{2}$ .

Extending

A_{2}

[0, k - 1]

The rest of the proof goes as follows. By Lemma 3.10 , we may assume that

P (l) = 1

- 1

for

l \in A_{2}

. Without loss of generality, assume

P

1

A_{2}

. We apply Theorem 3.6 for

m = 4, 8, 16, \dots

successively and each time we choose a prime

r

such that

m r - N

is minimized. Theorem 3.6 gives for all

ν \in Z

\begin{matrix} P (ν) - m P (ν + r) + (\binom{m}{2}) P (ν + 2 r) - \dots + P (ν + m r) = 0 m o d r . \end{matrix}

(11)

When

ν \in [0, k - N]

the numbers

ν + l r

for even

l

in 11 are in the set

A_{m / 2}

and therefore the corresponding

P

values are all

1

, by induction on

m

. In order to deduce that 11 holds as an identity of integers (not residue classes) it is enough to guarantee that the sum of the absolute values of all terms is less than

r

. This amounts to the inequality

2^{m} < r

. Given that

m r \sim k

this is true if we can guarantee that

\begin{matrix} m \leq c_{1} log k, \end{matrix}

(12)

for some small enough constant

c_{1}

. Therefore, as long as

m

satisfies the bound 12 , we have that, for

ν \in [0, k - N]

\begin{matrix} P (ν) - m P (ν + r) + (\binom{m}{2}) P (ν + 2 r) - \dots + P (ν + m r) = 0 . \end{matrix}

(13)

Since the total weights of the positive and negative terms in 13 are the same, it follows that the

P (ν + l r)

terms corresponding to odd

l

are also

1

Each time we perform this operation we deduce that

P

1

on a collection of intervals

A_{m}

which consists of

A_{m / 2}

and one interval of length

ε k

in the middle of the gap between any two succesive intervals of

A_{m / 2}

. So

A_{m}

has

m + 1

disjoint equispaced intervals of length

ε k

. We apply this operation until we have

ε m \sim 1

, which implies that we have covered the whole interval

[0, k - 1]

with our set

A_{m}

. We need to make sure that 12 still holds then. Since

ε m \sim 1

this is achieved by setting

ε = C / log k

, for a large enough constant

C

. At the end of this process, there could still be some very small possibly uncovered intervals of size

o (ε k)

. However since we have already shown that

P (l) = 1

on a set of

k - o (ε k)

entries, we can use the fact that

P

has degree at most

N - 1

to obtain that

P (l) = 1

on the whole interval

[0, k - 1]

This concludes the proof of the Theorem 3.3 , which implies:

Corollary 3.11. The class of symmetric

k

-juntas can be learned exactly under the uniform distribution with confidence

1 - δ

in time

n^{O (k / log k)} \cdot p o l y (2^{k}, n, log (1 / δ))

4 Discussion

The main open question is to obtain tight upper and lower bounds on the running time of the Fourier-based algorithm for symmetric juntas. It may even be that for large

k

, every symmetric function has a non-zero Fourier coefficient of constant order.

It should also be noted that in the case of balanced symmetric functions, i.e., symmetric functions with

P r [f (x) = 1] = 1 / 2

, a bound of

O (k^{0.548})

follows from [?] (see [?] ). Hence to improve our result, one may focus on finding new techniques for unbalanced functions.

Learning symmetric $k$ -juntas in time $n^{o (k)}$

Evangelos MarkakisGeorgia Institute of Technology, Atlanta GA 30332, USA, E-mail: {vangelis, aranyak}@cc.gatech.edu

Aranyak Mehta †

April 2005