On Approximate Pattern Matching for a Class of Gibbs Random Fields
J.-R. Chazottes
*
*
CPhT, CNRS-Ecole polytechnique, 91128 Palaiseau Cedex, France, jeanrene@cpht.polytechnique.fr F. Redig
†
†
Faculteit Wiskunde en Informatica, Technische Universiteit Eindhoven, Postbus 513, 5600 MB Eindhoven, The Netherlands, f.h.j.redig@tue.nl E. Verbitskiy
‡
‡
Philips Research, Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands, evgeny.verbitskiy@philips.com
November 27, 2006
-
Abstract: We prove an exponential approximation for the law of approximate occurrence of typical patterns for a class of Gibbsian sources on the lattice
,
. From this result, we deduce a law of large numbers and a large deviation result for the the waiting time of distorted patterns.
Key-words: Gibbs measures, approximate matching, exponential law, lossy data compression, law of large numbers, large deviations.
1 Introduction
In recent years there has been growing interest in a detailed probabilistic analysis of pattern matching and approximate pattern matching. For example, in information theory, motivation comes from studying performance of idealized Lempel-Ziv coding schemes. In mathematical biology, one likes to have accurate estimates for the probability that two (e.g. DNA) sequences agree in a large interval with some error-percentage.
There is also considerable interest in the analysis of occurrence of patterns in the multi-dimensional setting, e.g., in the context of video-image compression [2] , and more generally, lossy data compression [5, 6] .
In this paper we study the following problem. Fix a pattern
in a cubic box of size
. Given a configuration
of a Gibbs random field, what is the size of the ”observation window” in which we do not necessary see exactly this pattern for the first time, but any pattern obtained by distorsion of the fixed pattern
? By this we mean a pattern which contains a fixed fraction
of spins different from those of
. We are interested in the behavior of the volume of this observation window, that we call ”approximate hitting-time”, when
grows.
Our main result (Theorem 1 ) can be phrased as follows. The distribution of the approximate hitting-time, when properly normalized, gets closer and closer to an exponential law. The normalization is the product of a certain parameter
and the probability of the set of distorted patterns
. In fact, we get a precise control of the error term which allows us to derive two corollaries for the approximate waiting-time: given a configuration
randomly chosen from an ergodic Gibbs random field, we increase the obervation window in the random configuration
drawn from the given Gibbs random field until we see approximately the pattern
. The first corollary implies a law of large numbers allowing to get the rate distorsion function almost-surely from this approximate waiting-time. The second corollary is related to large deviations bounds. While the law of large numbers for approximate waiting-times appears in [6] , the large deviation result is new. Moreover, the convergence in distribution to the exponential law of the rescaled approximate hitting-time is also new.
We briefly indicate the key ingredients needed to prove this exponential approximation.
First, we assume that the Gibbs random field satisfies a certain strong mixing condition (non-uniformly
-mixing condition). For instance, this property holds for all Markov random fields which satisfy the Dobrushin uniqueness condition. The second key ingredient is a result by Chi [4] allowing one to obtain the rate distorsion function ”à la Shannon-McMillan-Breiman”. We take advantage of our previous work [1] in which we deal with ”exact” hitting-times. The proof of the main result of the present work readily follows a large part of the proof in [1] but there is a crucial step which is different (”second moment estimate ”). Moreover, one has to restrict to ”good” patterns: if a pattern has ”too much overlap” with its translates by vectors of size of order
, then one cannot hope to obtain an exponential distribution. These good patterns are shown to be typical in the sense that the measure of good patterns approaches one exponentially fast as
diverges (Proposition 2 ). When we have a random field distributed according to a Bernoulli measure, the goodness assumption on patternscan be removed. In this case, we prove (Theorem 2) that for any pattern Theorem 1 applies. Surprisingly, our proof involves the strong invariance principle for simple random walks. We have no idea to provide a simpler proof.
Outline of the paper. In the next section, we set notations and definitions, and state our main theorems. In section 3, we apply the exponential approximation of the previous section to approximate waiting times for which we obtain a.s. strong approximation and large deviations results. In section 4, we give all proofs.
Acknowledgment. We thank Z. Chi for providing us his preprint [4] .
2 Set-up and main results
For the sake of simplicity, we consider a
-valued random field on the lattice
,
. The results hold for arbitrary finite alphabets as well. Configurations are denoted
and collected in the set
.
is provided with the Borel
-field, and for
,
denotes the
-algebra generated by
.
For a finite subset
and configurations
we denote by
|
(2.1)
|
the number of mismatches between
and
in the volume
, i.e., the Hamming distance between
and
times the cardinality of
.
We denote by
the cube
. A
-pattern is a map
.
It is naturally associated to its cylinder
. For a pattern
and
we denote by
the pattern supported on
defined by
(
).
For a pattern
we denote by
the set of configurations which
-match with
:
|
(2.2)
|
The set of configurations
can also be viewed as a set of
-patterns, and we will (with a slight abuse of notation) use the same symbol for the set of configurations and the set of
-patterns which are restrictions of configurations in
to
.
DEFINITION 1.
The approximate hitting time of
in a configuration
is defined as
|
(2.3)
|
For
(exact matching time or occurrence time of a pattern), we obtained in [1] an exponential approximation for the law of
under the hypotheses of non-uniform
-mixing and Gibbsianness of the random field. We recall here what this mixing assumption is. For
define
|
(2.4)
|
where the supremum is taken over all finite subsets
of
, with
(
) and
, with
. Note that this
differs from the usual
-mixing function since we divide by the size of the dependence set of the event
.
DEFINITION 2.
A random field is non-uniformly exponentially
-mixing if there exist constants
such that
|
(2.5)
|
A typical example of a Gibbs field satisfying this assumption is the 2d-Ising model above critical temperature. In general, it is satisfied in the so-called high-temperature regime of Dobrushin uniqueness. We refer the reader to [8, 9] for more details on this and on Gibbs measures in general.
An important property of Gibbs measures is the so-called “finite energy” property.
This means that there is a continuous version of the conditional probability
such that
|
(2.6)
|
where
is independent of
. This immediately implies the existence of
such that for all
, and all
|
(2.7)
|
We will use the following estimate:
LEMMA 1.
Under the assumption that
is a Gibbs measure, there exist
and
such that for any pattern
and any
-
Proof.
This is an immediate consequence of the estimate ( 2.7 ) and the estimate
with
if
.
Contrarily to the situation for exact matching, we will need an assumption on the patterns in order to obtain an exponential law. This can be compared with the condition of not being ”badly self-repeating” needed to obtain the exponential law for the return times in [1] . As we shall see, being a ”good” pattern is a typical property.
DEFINITION 3.
Given
, we say that a
-pattern
is called
-good if the set
is empty for all
such that
. The set of all
-good patterns is denoted by
. By abuse of notation, we use the same symbol for the set of configurations
such that
is
-good.
For
and
,
coincides with the set of non badly self-repeating patterns in [1] , definition 5.1.
We shall need a result by Chi [4] on the rate distorsion function. We recall briefly the definition of the rate distorsion function and refer the reader to [3] for moreinformation and background and to [6] for a discussion on lossy data compression.
Given a stationary and ergodic measure
and a stationary and ergodic Gibbs measure
, the rate distortion function
is defined as follows:
|
(2.8)
|
|
(2.9)
|
where the infimum taken over all joint distributions
on
such that the
-marginal of
is
and
is the relative entropy between
and
.
We have the following key result which follows from [4] and [6,Theorem25] .
PROPOSITION 1.
Let
be a stationary and ergodic measure and
be a stationary and ergodic Gibbs measure. Then
|
(2.10)
|
Moreover,
is a convex (and hence continuous) function of
and is non-zero in some interval
.
The property 2.10 is called the generalized asymptotic equipartition property in [6] . Throughout we will simply write
instead of
.
We can now state our main result.
THEOREM 1.
Suppose that
is a non-uniformly exponentially
-mixing Gibbs measure and
is a stationary and ergodic Gibbs measure. Assume that the rate distortion function ( 2.8 ) is strictly positive in
. Then for all
and
small enough: namely,
there exist
, such that and for every
,
, and
-almost all
with
, the following estimate holds:
|
(2.11)
|
where
is such that
Dependence of the parameters in Theorem 1 on
and
will be discussed after the proof, see Remark 1 .
Let us briefly comment on the difference between Theorem 1 and the one obtained in [1] for exact matching, that is the case corresponding to
. First of all, we need to restrict ourselves to special patterns, i.e.,
-good patterns, whereas in [1] result applies to all patterns. Secondly, the error term that we obtain in [1] is of the form
, where
. Of course, the factor
is uniformly exponentially small for Gibbs measures. This is no longer true for
if
is too large. This is precisely why we need Lemma 1 . Thirdly, a crucial step in the proof of Theorem 1 , which differs slightly from that in [1] for the case
, involves Proposition 1 . This explains why we need to restrict to typical configurations in the sense of this result.
Let us close this set of remarks by noticing that
has to be a stationary and ergodic measure, but not necessarily Gibbsian. But for later use of Theorem 1 we shall also need the latter assumption, so we already impose it to state the theorem.
The following proposition shows that ”
”, i.e. that a pattern being
-good, is a typical property.
PROPOSITION 2.
Let
be a stationary Gibbs measure. Then, if
and
is small enough, there exists
such that for all
|
(2.13)
|
It turns out that if the random field has a non-trivial dependence structure, then the restriction to
-good patterns is unavoidable. However, in the case of a random field distributed according to a Bernoulli measure, the exponential law 2.11 holds for all approximate patterns. This is expressed by the following theorem.
THEOREM 2.
If
is the Bernoulli measure with
, then ( 2.11 ) holds without the restriction that
is
-good.
3 Approximate waiting-time fluctuations
The purpose of this section is to derive two consequences of Theorem 1 and Proposition 2 . The first one implies a strong law of large numbers for the approximate waiting-time. It was previously derived in [6] directly using the mixing property 2.5 .
The second one concerns large deviations of the approximate waiting-time and it is a new result. Given two configurations
, the approximate waiting time is
.
PROPOSITION 3.
Under the assumptions of Theorem 1 and Proposition 2 , there exists
such that for all
:
|
(3.1)
|
-eventually almost surely. In particular
With Proposition 3 we recover the results of Theorems 26 and 27 in [6] . However, there is a substantial difference in conditions on random fields. We have to restrict ourselves to measures
which are stationary and ergodic Gibbs measures, while in [6]
is only assumed to be stationary and ergodic. On the other hand, we permit Gibbs
, while in [6]
must be Bernoulli. The reason for our assumptions on
is that Proposition 2 is valid for Gibbs measures. We do not know if it can be extended to more general situations.
Let us also remark that by a basic result in Probability Theory, this strong approximation implies that if a central limit theorem holds for
, then it holds also for
. Unfortunately, the former seems to be a difficult issue, except in the iid case. We refer the reader to [6] for some results in that direction.
We have the following (partial) large deviation results. We first need the following lemma showing that we can define the generalized conditional
-order Rényi entropy for Gibbs random fields. This was first done in [11] for (
-mixing) stochastic processes (
) with the difference that here we need to condition on
-good patterns and use the Gibbs property instead of mixing.
LEMMA 2.
Let
be stationary Gibbs measures and assume that
and
. Then, for all
, the following function is well-defined:
|
(3.2)
|
(
denotes the measure
conditioned on the set of good patterns.)
The generalized
-order Rényi entropy should be defined as
.
We now have the following theorem. By
we mean that
is bounded from above.
THEOREM 3.
Let
be a non-uniformly exponentially
-mixing Gibbs measure and
a stationary and ergodic Gibbs measure. If
is small enough, then for any
, we have
|
(3.3)
|
and
|
(3.4)
|
In particular,
|
(3.5)
|
It follows from this theorem that Theorem 4.5.20 in [7] applies to large deviations of
. We do not know under which conditions the function
is, for example, continuously differentiable for
small enough. It it were the case, we would get a large deviation principle for
with a rate function given by the Legendre transform of
for
.
4 Proofs
4.1 Proof of Theorem 1
The proof of Theorem 1 is quite similar to the proof of exponential law in [1] . We describe briefly the common approach and indicate the differences. We also provide the necessary modifications of the proof.
It is well known that a random variable
has an exponential distribution if and only if
or, equivalently,
The basic ingredient of the proof in [1] was Lemma 4.4 (”Iteration Lemma”). This result establishes that for a pattern
and any finite number of cubes
,
, with equal volumes
we have
|
(4.1)
|
In [1] we also observed that the Iteration Lemma remains valid if a pattern
is replaced by the event
, with
does not occur in volume
if any pattern
does not occur in volume
.
Another important ingredient of the proof is the control of the parameter of the exponential distribution. Lemma 4.3 (”The parameter”) in [1] concerns non-triviality of the parameter
, that is, the fact that it is neither null nor infinite. To prove Lemma 4.3, we established a uniform second moment estimate for the number of occurrences of a pattern
in a configuration
restricted to a box that has later to be taken of size
. It is the proof of this second moment estimate that we have to modify completely. In remark 4.1 in [1] , we noticed that if
are events such that
for some
, and such that
|
(4.2)
|
then this implies, together with the mixing property 2.5 and the Gibbs property 2.7 , that the desired uniform second moment estimate holds. In turn, this implies the non-triviality of the parameter 2.12 (Lemma 4.3 in [1] ). Thus, we turn to prove 4.2 when the event
is
, where
is a good and typical pattern. We assume that patterns
are such that
, with
chosen from the
-measure one set of Proposition 1 and such that
is good in the sense of Definition 3 .
We have to show for patterns
with
, there exists a finite
such that for all
|
(4.3)
|
First of all, since
(see Definition 3 ), the terms corresponding to
with
are equal to
. Therefore we have to estimate the sum
|
(4.4)
|
Note that for
with
, the intersection
is not very large:
Note also that
denotes the number of differences between
and
in the volume
, see 2.1 . Then we can write
|
(4.5)
|
where by
we mean
,
. For the sake of convenience, we simply write
for
and
for
in the course of this proof. We also introduce the short-hand notations
| |
| |
| |
|
(4.6)
|
With these notations what we have to estimate
|
(4.7)
|
The following estimate is a corollary of [4] and a basic property of a Gibbs measures:
for any configuration
:
|
(4.8)
|
Indeed, the unconditioned statement is proved in [4] , and conditioning can at most introduce a term of order
.
We proceed as follows
| |
| |
| |
| |
Therefore
Taking into account that
, and hence
, we conclude that
as
, and hence
is finite. This finishes the proof.
REMARK 1.
The parameters of Theorem 1 depend on the choice of
and
. The most interesting is the dependence of
and
. TheParameter Lemma of [
1]
in fact shows that a uniform choice
suffices. A more interesting question is whether we can give a uniform bound on
for a large set of
and
. The present modification of the second moment estimate together with the rest of Lemma 4.3 in [
1]
, which remains unchanged, gives that for some
, dependent on
alone, the following choice of
will suffice
The rate distortion function
is a monotonically decreasing function.
Hence, for a fixed
,
is a monotonically increasing function of
, and finally,
is monotonically decreasing in
. Therefore, if
, then for all
Therefore, for a fixed
we obtain a uniform (in
) bound on the parameter
.
4.2 Proof of Proposition 2
For
we know that most patterns are
-good for any
. Indeed, it is proved in [1] (lemma 5.3) that
, for some
.
Let us now argue that for small
this is still the case. Suppose
, that is, we are going to consider vectors
such that
. An element
of
satisfies
|
(4.9)
|
(Recall that
and
. This implies that there exists a set
and a disjoint translate
such that
such that
matches with error fraction
with
, this can be made as small as
, for some
, for
sufficiently small uniformly in
by Lemma 1 . Therefore we obtain that
|
(4.10)
|
for all
and
small enough.
4.3 Proof of Theorem 2
We consider the case
only, because the case
is completely analogous.
Start with the particular pattern
that we simply denote by
. The difficulty with this ”bad pattern” comes from the fact that the second moment estimate does not apply, because ( 4.2 ) fails. Therefore, we have to prove by other means that there exists
such that for all
,
|
(4.11)
|
which would imply the non-triviality of the parameter
. We will first show that there exists a sequence
such that
|
(4.12)
|
It will then follow easily from the Bernoulli character of
that
does not depend on the choice of the pattern, i.e., ( 4.12 ) holds with the same
for any pattern
. Then we can apply Theorem 1 for good patterns, and obtain
.
We have the following identities:
| |
| |
|
(4.13)
|
where
is the position of a simple random walk on
(with
) after
steps.
By Theorem 7.23 in [12] , together with the strong invariance principle [12] p. 53, we have
|
(4.14)
|
where
is a random variable with a Gumbel distribution. Therefore, if we choose
such that
|
(4.15)
|
then ( 4.12 ) holds.
If we now choose any other pattern
, then under
,
is again distributed as a simple random walk, so we find the same
, which completes the proof of the theorem.
4.4 Proof of Proposition 3
By using Theorem 1 we immediately get
Now we choose
with
such that
. This makes the first term in the rhs summable in
. The last one equals the
-measure of the complement of
, which is less than
by Proposition 2 . We thus get the upper bound in 3.1 by an application of the Borel-Cantelli Lemma. Now we turn to prove the lower bound in 3.1 . Proceeding as before, we get
We have used Theorem 1 and Proposition 2 . We now choose
, with
, to get a summable upper bound in
for the above probability. An application of Borel-Cantelli Lemma gives the desired result and the proof of the proposition is complete.
4.5 Proof of Lemma 2
We only consider the case
leaving the (very similar) proof for the case
to the reader. Let
be the system of all rectangular boxes of the form
Before proceeding, we have to extend definition 3 somewhat. We will denote by
the set of good patterns supported on
. We shall need Proposition 2 which remains valid if one replaces
with
and
by
in 2.13 .
We are going to prove that the function
defined as
satisfies
for all
such that
and
, where
is a constant (depends on
), and where
denotes the boundary of
. Of course,
is a surface order correction. If such a property holds (together with
, for all
,
which is obvious by stationarity of the measure), then a generalised sub-additive lemma, obtained as a combination of a lemma found in [8] and another one given in [7] , will guarantee that
exists, as we wish. For all
,
such that
and
, we have the following:
where
are constants. The first inequality follows from the Gibbs property and the second one is a simple consequence of the Hamming distance property. To complete the proof we again use the Gibbs property to get
where
is a constant. The second inequality is the consequence of Proposition 2 if
is large enough. The lemma is proved.
4.6 Proof of Theorem 3
Since the proof of this theorem is very similar to that of Theorem 2.7 in [1] , we only sketch it to indicate the little differences between them.
The starting point is of course to write
Then we can mimic the proof of Theorem 2.7 in [1] by using Theorem 1 and the analog of lemma 4.3 in [1] , which holds true when
is replaced by
, provided that
be a
-good pattern (see the beginning of the proof of Theorem 1 ), and
be
-typical in the sense of Proposition 1 . Notice that we integrate with respect to the conditional measure
which takes care of these two properties.
References
-
M. Abadi, J.-R. Chazottes F. Redig and E. Verbitskiy, Exponential distribution for the occurrence of rare patterns in Gibsian random fields, Commun. Math. Phys. 246 no. 2 (2004), 269–294.
-
M. Alzina, W. Szpankowski and A. Grama, 2d-pattern matching, image and video-compression: theory, algorithms and experiments, IEEE Transactions on Image Processing 11 no. 3, (2002), 318-331.
-
T. Berger, Rate Distorsion Theory: A Mathematical Basis for Data Compression. Englewood Cliffs, NJ: Prentice-Hall, 1971.
-
Z. Chi, Conditional large deviation principle for finite state Gibbs random fields. Preprint (2002).
-
A. Dembo, I. Kontoyiannis, The asymptotics of waiting times between stationary processes, allowing distortion, Ann. Appl. Probab. 9 (1999), no. 2, 413–429.
-
A. Dembo, I. Kontoyiannis, Source coding, large deviations, and approximate matching, IEEE Trans. Inform. Th. 48 no. 6 (2002), 1590–1615.
-
A. Dembo, O. Zeitouni, Large deviations techniques and applications. Second edition. Applications of Mathematics (New York) 38. Springer-Verlag, New York, 1998.
-
H.-O. Georgii. Gibbs Measures and Phase Transitions. Walter de Gruyter & Co., Berlin, 1988.
-
X. Guyon. Random Fields on a Network. Modeling, Statistics and Applications, Springer Verlag, New York, Berlin, 1995.
-
I. Kontoyiannis, Pattern matching and lossy data compression on random fields, IEEE Trans. Inform. Th. 49 no. 4 (2003), 1047–1051.
-
T. Luczak, W. Szpankowski, A suboptimal lossy data compression based on approximate pattern matching, IEEE Trans. Inform. Th. 43 no. 5 (1997), 1439–1451.
-
P. Révész, Random Walks in Random and Non-Random Environments, World Scientific, Singapore, New Jersey, London, Hong Kong, (1990).