Learning symmetric
-juntas in time
Mihail N. KolountzakisSchool of Mathematics, Georgia Institute of Technology, Atlanta GA 30332, USA, and Department of Mathematics, Univ. of Crete, GR-71409 Iraklio, Greece. E-mail: kolount@gmail.com. Partially supported by European Commission IHP Network HARP (Harmonic Analysis and Related Problems), Contract Number: HPRN-CT-2001-00273 HARP
Evangelos MarkakisGeorgia Institute of Technology, Atlanta GA 30332, USA, E-mail: {vangelis, aranyak}@cc.gatech.edu
Aranyak Mehta
†
April 2005
Abstract
We give an algorithm for learning symmetric
-juntas (boolean functions of
boolean variables which depend only on an unknown set of
of these variables) in the PAC model under the uniform distribution, which runs in time
. Our bound is obtained by proving the following result: Every symmetric boolean function on
variables, except for the parity and the constant functions, has a non-zero Fourier coefficient of order at least 1 and at most
. This improves the previously best known bound of
[?] , and provides the first
time algorithm for learning symmetric juntas.
1 Introduction
We consider a fundamental problem in computational learning theory: learning in the presence of irrelevant information. One formalization of the problem is as follows: We want to learn an unknown boolean function of
variables, which depends only on
variables (typically
is
). We call such a function a
-junta. We are provided with a set of labelled examples
, where the
's are picked uniformly and independently at random from the domain
(this is the PAC model with uniform distribution). We wish to identify the
relevant variables and the truth table of the function.
The problem was first posed by Blum [?] and Blum and Langley [?] , and it is considered [?, ?] to be one of the most important open problems in the theory of uniform distribution learning. It has connections with learning DNF formulas and decision trees of super-constant size, see [?, ?, ?, ?, ?] for details. The general case is believed to be hard and has even been used to propose a cryptosystem [?] . A trivial algorithm runs in time roughly
by doing an exhaustive search over all possible sets of relevant variables. Two important classes of juntas are learnable in polynomial time: parity and monotone functions. Learning parity functions can be reduced to solving a system of linear equations over
[?] . Monotone functions have non-zero singleton Fourier coefficients (e.g., see [?] ).
For the general case, the first significant breakthrough was given in [?] learning with confidence
in time
. Note that we allow the running time to be polynomial in
, since this is the size of the truth-table which is output. In the typical setting of
, this becomes polynomial in
.
In this paper we consider the class of symmetric
-juntas, functions which are symmetric on their relevant variables. The only non-trivial algorithm known for this case is the standard Fourier based algorithm, described in Section 2 . The analysis of the running time of this algorithm reduces to the following question:
-
What is the smallest
such that every symmetric boolean function on
variables, which is not a constant or a parity function, has a non-zero Fourier coefficient of order at least
and at most
?
A bound of
implies a running time of roughly
. A bound of
was provided in [?] . This was improved to
in [?] . Here we show a bound of
(Theorem 3.3 ), giving the first algorithm for learning symmetric
-juntas in time
.
Techniques Our techniques involve a mix of number theory, combinatorics and probability. We start by reducing our problem to finding 0/1 solutions to a system of Diophantine equations involving binomial coefficients, as in [?] . We then take a departure from [?] by further reducing this to the problem of showing that a certain integer-valued polynomial
is constant over the set
. We manage to prove this in two steps: First, we show that
is constant over the union of two small intervals
. This is obtained by looking at
modulo carefully chosen prime numbers. To choose these prime numbers we use the Siegel-Walfisz theorem on the density of primes in arithmetic progressions with modulus of moderate growth. In the second step, we extend the constant nature of
to the whole interval
by repeated applications of Lucas' Theorem. One additional interesting aspect of our proof is the use of an equivalence between a) the vanishing of Fourier coefficients and b) the equality of moments of certain random variables under the uniform measure on the hypercube and under the measure defined by the function itself. This equivalence helps us eliminate a lot of case analysis.
2 Preliminaries
Symmetric Juntas Given a boolean function
on
variables
, we will say that
is a relevant variable for
if there exist
which differ only in the
-th coordinate and
. Variables that are not relevant are called irrelevant. We will call
a
-junta if
has at most
relevant variables.
We consider the class of symmetric juntas. A boolean function
on
variables is a symmetric function if for any permutation
,
.
Hence the value of
at
depends only on the weight of
, which is the number of variables that are set to
. A symmetric
-junta is a function on
variables which is symmetric on the
variables it depends on.
We will describe a symmetric boolean function on
variables by a
-bit string
, where
is the value of
on an input of weight
. The following four special symmetric functions on
variables will appear often: the two constant functions
and
the parity function
and its complement
.
Learning in the PAC model We consider the PAC learning model [?] , in which we wish to learn a Concept Class
where each
is a collection of boolean functions from
. In our case,
is the class of symmetric
-juntas on
variables. Let
be an accuracy parameter and
a confidence parameter.
A learning algorithm
for
has access to an oracle for
. A query to the oracle outputs a labeled example
where
is drawn from
according to some probability distribution
.
is said to be a learning algorithm for the class
under the distribution
if for all
, it outputs, with probability at least
, a hypothesis
such that
. We will be concerned only with the uniform distribution and we will obtain an algorithm with accuracy parameter
, i.e., we identify the exact function
.
Fourier Transform We will consider functions of the form:
. An orthonormal basis for the functions defined on the Boolean cube can be given by the characters of the group
. In particular, for every
, define the following function:
Any real-valued function on the Boolean cube can be expressed as a linear combination of the functions
. Given
, we have that
, where
is the Fourier coefficient of
at
and is equal to the inner product of
with
:
Fourier-based Learning Let
be a
-junta. It is known that we can exactly calculate the Fourier coefficients of
in the uniform distribution PAC model, with confidence
in time
, using standard Chernoff-Hoeffding bounds (see [?, ?] ). Observe further, that if
is an irrelevant variable for a
-junta
, then for any
containing
,
. Hence if
, for some
, then
contains only relevant variables.
This suggests the following algorithm: Starting with
, compute the Fourier coefficients of all subsets of
of size
. Collect the union of all relevant variables that correspond to subsets with non-zero Fourier coefficients. Stop as soon as you collect all
relevant variables.
Since the function is symmetric, for any two sets
of relevant variables such that
, we have
. Hence the first time that we will identify some relevant variables in the algorithm, we will actually be able to identify all the relevant variables. Once we find the relevant variables, finding the truth-table of the function can be done in time
.
The above algorithm would take time roughly
for
. However, these particular functions are well known to be learnable in time
. Hence the following is true:
Fact 2.1.
If every symmetric function
has a non-zero Fourier coefficient of order between 1 and
, then we can learn symmetric
-juntas in time
.
3 Main Section
3.1 An Equivalent Formulation
We state an equivalent condition for the existence of a non-zero Fourier coefficient of a boolean function
, as proved in [?] . Let
be a boolean function. For a vector
and a set
, let
be the projection of
on the indices of
. Let
Define the following probabilities:
Unless mentioned, all probabilities are over the uniform distribution on
. For
, call a boolean function
on
variables
-null, if for all sets
with
and for all
the probabilities
are all equal to each other. The following lemma reveals the connection with the Fourier coefficients of
.
Lemma 3.1.
[
?]
Let
be a boolean function on
variables. Then
is
-null for some
if and only if, for all
with cardinality at most
,
It is clear that if
and
is
-null then it is also
-null.
When we consider the case of symmetric functions,
just depends on
and the weight
of
. We denote this by
It is clear that:
|
(1)
|
where
is
if
or
, and
is
. It follows that
is
-null if for
,
are all equal. It is easy to see that the constant boolean functions
are
-null for all
with
. The parity functions
are also
-null for all
satisfying
. From Lemma 3.1 and Equation 1 we get:
Corollary 3.2.
All symmetric boolean functions
have a non-zero Fourier coefficient of order at most
(and at least
) iff
are the only solutions to
|
(2)
|
In the next section, we show that this is true for
for large enough
.
3.2 A bound of
.
The following is our main theorem.
Theorem 3.3.
There is an absolute constant
such that for large
, every symmetric boolean function
on
bits with
has a non-zero Fourier coefficient of order at most
and at least
.
The rest of this section is devoted to proving Theorem 3.3 . Suppose
is a boolean function on
, such that all its Fourier coefficients of order up to
are
. Then the values
of
satisfy 2 with
, which, changing parameters, can be rewritten as:
|
(3)
|
We want to show that if
, for some appropriately large constant
, then
is either constant or alternates between
and
. We prove this for all
sufficiently large.
Define
, for
, and observe that the sequence
satisfies the homogeneous version of 3 :
|
(4)
|
Remark. In 4 the number
can be replaced by any other integer
in the interval
. This follows since all the non-constant Fourier coefficients up to order
are
.
From 4 the sequence
may be defined for all
and
for all
. From the theory of recurrence relations we know then that the sequence
may be written as a linear combination of the following sequences:
The reason for this is that
is the only root of the characteristic polynomial of the recurrence,
. Therefore there is a polynomial
, of degree at most
, such that
Clearly
takes integer values on integers and in particular
for
. From the well known characterization of integer-valued polynomials it follows that we may write
|
(5)
|
If
is a prime, and since all the factors that appear in denominators in 5 are strictly less than
(hence invertible mod
), it follows that the sequence
,
, may be viewed as a polynomial with coefficients in
and therefore is a
-periodic sequence mod
, i.e.
|
(6)
|
If, in addition,
, when all
-values that appear in 6 are in
, it follows that we have the non-modular equality
|
(7)
|
We want to show that
. Since
it is enough to show that either
is identically
or that
or
. This is equivalent to showing that
is a constant polynomial, constantly equal to
or
.
Notation. 1. In what follows we repeatedly use the letter
to denote a positive constant which depends on no parameter (unless we say otherwise). As is customary, this constant
need not be the same in all its occurences.
2. We define
by the relation
and assume
, with
a large enough positive constant.
We shall need various primes in intervals from now on. The version of the prime number theorem that we will be using is the Siegel-Walfisz theorem (see [?,Theorem2] ). Define the logarithmic integral
The Euler function
denotes the number of moduli mod
which are coprime to
.
Theorem A (Siegel-Walfisz) Let
be the number of primes
which are equal to
and assume that
. Then if
,
a constant, we have
|
(8)
|
where
depends on
only (the constant in the
term is absolute).
For
, the number of primes up to
without any restriction, the prime number theorem says
, for some constant
.
These theorems guarantee that, for
, the interval
has the “expected” number of primes whenever
, whatever the constant
, even if we impose the condition that these primes are equal to
, as long as
, for any constant
.
We use the above theorems along with the
-periodicity of
to deduce that
is in fact
-periodic on the union of
small sub-intervals of
.
Lemma 3.4.
The polynomial
satisfies the 2-periodicity condition
whenever
.
-
Proof.
Assume
are two primes in
, where
. (The length of the interval
is large enough for the prime number theorem to guarantee the existence of many primes in it.) From 7 it follows that the finite sequences
are identical. Applying 7 again with
we get that the finite sequences
are identical. It follows that
|
(9)
|
We now assume that the difference
is the smallest difference between two primes in
. By the prime number theorem
. Hence, we can apply Theorem A . Since
in that case Theorem A guarantees that the number of primes equal to
in
is at least
whenever
. All that matters here is that this number is positive.
Let
be the smallest prime which is equal to
. By Theorem A , applied to
and
, its existence is guaranteed and furthermore that
. The same theorem guarantees that we can find a prime
such that
. Then
or
, for some nonnegative integer
. Therefore, for
we have
| |
| |
This
-periodicity
|
(10)
|
is transferred to all
by using 7 repeatedly for appropriate primes
.
Notice that in the sequence
, if one erases the
's then one sees an alternation of
and
(this follows from the fact that
). This property greatly reduces the number of allowed patterns in
and in fact it implies that
is constant in
.
Lemma 3.5.
The polynomial
is constant in
(defined in Lemma 3.4 ).
-
Proof.
From Lemma 3.4 the values of
in
must be a
-periodic sequence. The only essentially different non-constant
-periodic patterns for the values of
in
are
and
and they both violate the property that
must satisfy, namely that if one erases the
's then one must see an alternation of
and
. Therefore
is constant in each of the two intervals of
. From the
-periodicity 7 it follows that the constant is the same in both intervals.
We now extend the set on which
is constant to a superset of
that contains a small interval around
. We will make use of the following theorem which follows from Lucas' Theorem [?,Ch.3] .
Theorem 3.6.
If
is a prime which does not divide
then
. Also, if
then
.
Lemma 3.7.
Let
and
. Then
for
.
-
Proof.
We shall apply Theorem 3.6 with
and with a prime
such that
takes the minimal possible nonnegative value. It follows from the prime number theorem that
.
And it follows from the remark after 4 that
Taking residues mod
and using Theorem 3.6 for
we obtain
By our particular choice of
we have
whenever
.
It follows that
. Applying this for all
we get
for all
in the interval
. To get rid of the
terms in the interval above, just choose a slightly larger
and apply again for all
.
So far we have proved
on the set
which consists of three equispaced intervals of roughly equal size
. We consider
cases for
.
The first is when
is
on
and the second is when
is
or
.
In the case that
is
on
, we shall need the following theorem, which already gives a lot of significant information about the function
. It should be thought of as analogous to the fact that the moments of a (vector) random variable can be read off the Fourier Transform of its distribution (the characteristic function) by looking at derivatives at
.
Theorem 3.8.
Suppose
is nonnegative (and not identically
) and has all its Fourier coefficients of order at most
(and at least 1) equal to
. Let
denote the uniform probability measure on the cube
and
denote the probability measure on
defined by
Let also
denote the coordinate functions on
, which we view as random variables. Then for all
,
, we have
-
Proof.
Let
. We assume for simplicity that
. Then, writing
and
, we have
| |
| |
| |
| |
| |
| |
Remarks. 1. For functions
, the above theorem follows directly from the definition of
-nullity in Section 3.1 . However, as we shall see in the proof of Lemma 3.10 we need to apply this theorem for functions whose range is not
.
2. If the nonnegative function
is symmetric then the identity of moments up to order
with those of the uniform distribution (
-wise independence) and the vanishing of the non-constant Fourier coefficiens of weight up to
are equivalent. This can be proved by induction on
. We do not use this here.
Corollary 3.9.
Under the assumptions and definitions of Theorem 3.8 the random variable
has the same power moments under the probability measures
and
, up to order
.
-
Proof.
The power
,
, can be written as a sum of terms of the type
, for
.
One uses the fact that
.
Lemma 3.10.
If
is
on
, then
.
-
Proof.
Suppose the polynomial
is constantly equal to
on the set
and that
. The sequence
is constant in each of the three intervals of
. By possibly considering
(whose Fourier coefficients vanish exactly where those of
do), we may assume that
on the middle interval
. Define the nonnegative function
by
and observe that the Fourier coefficients of
of weight at most
vanish. Let
be the distribution of the random variable
under the measure induced by
on
(each vertex
has probability proportional to
). Note that this is a well defined probability distribution since we assumed that
and
are not the
function. Clearly
is symmetric about
and has no mass in
, since both
and
are
when
. The
-th moment with respect to the measure
of the variable
in Corollary 3.9 is the expression
where again
. By Corollary 3.9 this must equal the
-th moment with respect to the binomial measure
, which is the quantity
But the variance of
under
is
since under
the random variables
are independent, while the variance of
under
is
as half the mass of
sits to the left of
and half to the right of
. These orders of magnitude are different whenever
, which is true in our case as
. This contradiction proves that
cannot equal
on
.
Extending
to
.
The rest of the proof goes as follows. By Lemma 3.10 , we may assume that
or
for
. Without loss of generality, assume
is
on
. We apply Theorem 3.6 for
successively and each time we choose a prime
such that
is minimized. Theorem 3.6 gives for all
|
(11)
|
When
the numbers
for even
in 11 are in the set
and therefore the corresponding
values are all
, by induction on
. In order to deduce that 11 holds as an identity of integers (not residue classes) it is enough to guarantee that the sum of the absolute values of all terms is less than
. This amounts to the inequality
. Given that
this is true if we can guarantee that
for some small enough constant
. Therefore, as long as
satisfies the bound 12 , we have that, for
,
|
(13)
|
Since the total weights of the positive and negative terms in 13 are the same, it follows that the
terms corresponding to odd
are also
.
Each time we perform this operation we deduce that
is
on a collection of intervals
which consists of
and one interval of length
in the middle of the gap between any two succesive intervals of
. So
has
disjoint equispaced intervals of length
. We apply this operation until we have
, which implies that we have covered the whole interval
with our set
. We need to make sure that 12 still holds then. Since
this is achieved by setting
, for a large enough constant
. At the end of this process, there could still be some very small possibly uncovered intervals of size
. However since we have already shown that
on a set of
entries, we can use the fact that
has degree at most
to obtain that
on the whole interval
.
This concludes the proof of the Theorem 3.3 , which implies:
Corollary 3.11.
The class of symmetric
-juntas can be learned exactly under the uniform distribution with confidence
in time
.
4 Discussion
The main open question is to obtain tight upper and lower bounds on the running time of the Fourier-based algorithm for symmetric juntas. It may even be that for large
, every symmetric function has a non-zero Fourier coefficient of constant order.
It should also be noted that in the case of balanced symmetric functions, i.e., symmetric functions with
, a bound of
follows from [?] (see [?] ). Hence to improve our result, one may focus on finding new techniques for unbalanced functions.