A stochastic approximation algorithm with multiplicative step size adaptation
A
l
e
x
a
n
d
e
r
P
l
a
k
h
o
v
p
l
a
k
h
o
v
@
m
a
t
.
u
a
.
p
t
P
e
d
r
o
C
r
u
z
j
p
e
d
r
o
@
m
a
t
.
u
a
.
p
t
Department of Mathematics University of Aveiro — Portugal
Abstract
An algorithm of searching a zero of an unknown function
is considered,
,
, where
is the value of
measured at
with some error,
is this error. The step sizes
are random positive values and are calculated according to the rule:
if
, and
, otherwise. Here
,
. The function
may have one or more zeros; the random values
are independent and identically distributed, with zero mean and finite variance. Under some additional assumptions on
,
, and
, the conditions on
and
guaranteeing a.s. convergence of the sequence
, as well as the conditions on
,
guaranteeing a.s. divergence, are determined.
In particular, if
and
for any
, it is established that for
, convergence takes place, and for
, divergence. Due to the multiplicative rule of updating of
, it is natural to expect that
converges rapidly: like a geometric progression (if convergence takes place), but the limit value may not coincide with, but instead, approximates one of zeros of
. By adjusting the parameters
and
, one can reach necessary precision of approximation; higher precision is obtained at the expense of lower convergence rate.
Key words: stochastic approximation, accelerated convergence algorithms, step size adaptation.
AMS subject classification: 62L20 (Stochastic approximation), 90C15 (Stochastic programming), 93B30 (System identification)
1 Introduction
Consider the problem of finding a zero of a function
. If there are several zeros, it is required to find at least one of them. It is supposed that the function can be measured at any point, with some random error. The standard algorithm of stochastic approximation consists in calculating successive approximations of the required value,
,
,
, according to the rule
|
(1)
|
where
|
(2)
|
is the value of
measured at
,
is the measurement error;
,
,
is the sequence of step sizes of the algorithm. Usually it is assumed that the step sizes are positive real numbers satisfying the relations
,
. Then, under some additional assumptions on
and
, the algorithm a.s. converges to a zero point of
(see, e.g., [1, 2] ). In practice, however, the convergence rate of this algorithm may prove to be unsatisfactory, therefore, when solving practical tasks, various modifications of the algorithm are used. There are widely utilized heuristical algorithms using random, rather than deterministic, step size, which is corrected in the course of the algorithm, according to the current data [3, 6, 9, 11] . In particular, there is used the idea that prescribes to decrease the step size if the sequence of increments
changes the sign often enough, indicating that the current value
is close to the set of zeros of
, and hence, the measurement error
of the function is big enough with respect to the function itself
. Alternatively, one should increase the step size, or leave it unchanged. So, Kesten in the theoretical work [7] considered an algorithm using ( 1 ), ( 2 ), and the rule of modification of
:
|
(3)
|
where
,
;
,
,
is a sequence of positive numbers satisfying the relations
,
. Thus, the step size cannot increase in the course of algorithm; it can only decrease or remain unchanged. It is supposed that there is a unique zero of
. Kesten proved that
a.s. converges to this zero point. A multidimensional version of this algorithm is considered in [8] .
There are also heuristical procedures (in particular, in artificial neural networks), where at each moment
the step size is multiplied by a positive constant less than 1, if the measurement data indicate that
is close enough to the zero set of
, and by a constant more than 1, elsewhere [4, 5, 9, 10] . This kind of rules ensure sufficiently high convergence rate, however the step size converges like a geometric progression, therefore
, which means that the limit of
need not be a zero point of
, but instead, the sequence may ”get stuck” on its way to the set of zeros of
. Nevertheless, such a procedure may be justified if it gives a value close enough to one of the zeros of
. In the present paper, a stochastic approximation algorithm utilizing this rule of step size modification is considered. Namely, the rule ( 1 ), ( 2 ), jointly with the following rule
|
(4)
|
is used. Here
,
,
,
is a positive constant.
Let us point out the main differences between ( 4 ) and Kesten's rule ( 3 ). First, according to ( 4 ),
can both decrease and increase. Second, in Kesten's algorithm one always has
. On the other hand, it looks likely that in the case of convergence of the algorithm ( 1 ), ( 2 ), ( 4 ),
converges like a geometric progression (this conjecture will be justified in the section 3), therefore the limit of algorithm may not be a zero point of
.
Suppose that
is a sequence of i.i.d.r.v. with zero mean, besides
. Under some additional assumptions on
,
, and
, stated below, the process defined by ( 1 ), ( 2 ), ( 4 ) a.s. diverges if
, and converges if
, moreover the limit of
belongs to
. Here
,
, is a monotone decreasing family of sets of real numbers, besides every set
contains the set
of zeros of
, and
as
. (Here by definition
for any two sets of real numbers
and
.) This statement is a consequence of the main theorem, which will be stated in section 2 and proved in section 3. Thus, by adjusting the parameters
and
(for example, fixing
and letting
), one can reach necessary precision of the algorithm; higher precision is obtained at the expense of lower convergence rate.
2 Definition of the algorithm and statement of the main result
Consider the algorithm given by ( 1 ), ( 2 ), ( 4 ). The rule ( 4 ) means that at each instant
, step size is multiplied by
or by
, if the result of multiplication is less than
; otherwise, step size is set to be
. Thus, the maximal possible value of step size equals
.
The rule ( 4 ) can be written in the form
|
(5)
|
Let us take the following assumptions:
-
A1
Denote
,
the
-algebra generated by
,
, and
,
; then
does not depend on
.
-
A2
The values
are identically distributed, with zero mean and finite variance:
,
.
-
A3
(a) There exists
such that for any interval
,
; (b)
.
-
A4
and
.
-
A5
.
-
A6
There exists
such that
-
(a)
as
, and
-
(b)
.
Remark 1
From A4 and A6 (a) it follows that the set
is non-empty and is contained in
.
Remark 2
Note that assumptions A4–A6 guarantee convergence of the deterministic counterpart of algorithm ( 1 ), ( 2 ), ( 4 ) (that is, of the algorithm with
).
Moreover, under these conditions, any deterministic algorithm
converges, whatever the sequence
satisfying
.
Introduce the functions:
|
(6)
|
|
(7)
|
one has
,
,
. Further, define the sets of real numbers
|
(8)
|
obviously,
,
for any
.
Note that
is open. Indeed, let
, then there exists
such that
Then for
close enough to
one has
, hence
This implies that
, hence
.
Denote also
|
(9)
|
Denote by
the set of zeros of
, i.e.,
. Suppose that
,
, and
, where
is a small positive number. Then, with a probability close to 1,
also belongs to a small (possibly larger) neighborhood of
contained in
, and taking into account ( 6 ) and ( 8 ), one gets
| |
| |
Then, using ( 5 ) and ( 9 ), one obtains
| |
| |
| |
Thus, in a sense, the set
can be regarded to be a domain of decrease of step size: if several consecutive values of
belong to
and are close enough to each other, and if the first term of the sequence of corresponding step sizes
is small enough, then the sequence of their mean values
decreases.
Now, suppose that
,
, and that
. Analogously, for
small enough, one has
and then, using again ( 5 ) and ( 9 ) and taking into account that for
,
, one obtains
| |
| |
| |
Thus, the set
can be regarded as a domain of increase of step size: if several consecutive values of
belong to
and are close enough to each other, and if the first of the corresponding values of
is small enough, then the sequence of their mean values
increases.
Note that if
then, by virtue of ( 8 ),
, that is, all the zeros of
belong to the region of decrease of step size. On the other hand, if
then
, which means that the region of increase of step size coincides with
.
It seems likely that in the first case the algorithm can converge, and in the second one, cannot. This conjecture is confirmed by the following theorem, which is the main result of the paper.
Theorem Let the assumptions A1–A6 be satisfied; consider the process
defined by ( 1 ), ( 2 ), ( 4 ). Recall that
. Then (a) If
then
a.s. converges to a point from
.
(b) If
then
a.s. diverges.
Suppose that
for any real
and that
.
Then the function
coincides with
, is continuous, and is given by
is the unique minimum of
, and
. After a simple algebra, one can rewrite the hypotheses of theorem in the form (a)
, (b)
. Denote
;
,
is a monotone decreasing family of sets containing
and tending to
as
.
Thus, one comes to Corollary Let, in addition to assumptions A1–A6,
for any
, and
. Consider the process defined by ( 1 ), ( 2 ), ( 4 ). Then there exists a monotone decreasing family of sets
,
such that
,
as
, and (a) if
then
a.s. converges to a point from
; (b) if
then
a.s. diverges.
Remark 3
Theorem does not give any information about behavior of the algorithm for the values
,
such that
In particular, under the hypotheses of corollary, the case
remains unexplored. These issues will be addressed elsewhere.
3 Proof of theorem
First we prove 10 auxiliary lemmas, and then, basing on them, we prove theorem.
Here all statements about random variables are supposed to be true almost surely.
In the sequel, we shall mainly designate random values by Greek letters, and real numbers and functions from
to
, by Latin ones; the letters
,
,
,
will denote integer non-negative numbers. The function
and the random values
,
are exceptions; also, traditional notation
,
for small positive numbers will be used.
Lemma 1
If
then the sequence
converges.
Proof. Note that without loss of generality one can assume that
is bounded. Indeed, replacing
by
changes the process only with probability
. By taking
large enough, one can make this probability arbitrarily small.
Let
; define the stopping time
and introduce the new process
,
by
| |
| |
First, let us prove that the sequence
is bounded. Designate
; from A4 it follows that
. One has
|
(10)
|
Using that
and
, one obtains
|
(11)
|
If
, an even more precise estimate for
can be obtained. We shall distinguish between two cases: (i)
and (ii)
.
In case (i), designating
, one has
|
(12)
|
In the case (ii) one has
hence
|
(13)
|
Thus, in both cases (i) and (ii), from ( 10 ), ( 12 ), and ( 13 ) one gets
|
(14)
|
The overall number of values of
such that
is less than
; therefore, using ( 11 ) and ( 14 ), one concludes that
|
(15)
|
Denote
and
; using that
one gets
|
(16)
|
Using that
, one obtains that the martingale
is bounded; the value
is also bounded, so, by ( 16 ), one concludes that the sequence
is bounded.
Now, let us show that
converges. From the definition of
and
it follows that
Using that the sequence
is bounded and that
, one gets that the series
converges. Further, one has
hence the martingale
converges. This implies that
also converges.
Define the events
and
. One has
. If
then
for any
; this means that
for any
and
. The sequence
converges, therefore the sequence
also converges, and passing to the limit
one obtains that
converges. This means exactly that if
then
converges.
Lemma 2
If
then
.
Proof. Note that, using A3 (a), it is easy to show that there exists
such that
, whatever
.
Next, for any
there exist
and
such that the following holds: for any two random variables
and
satisfying the relations
,
one has
Choose a countable set of intervals
covering the set
, and denote
. Fix
and
, and define the auxiliary process
,
by formulas:
if
then
, and if
then
|
(17)
|
|
(18)
|
|
(19)
|
So, as
,
is forced to be contained in
.
For
, using that
,
,
, one obtains that
and
hence
Consider variables
and
providing a solution of the (deterministic) minimization problem:
subject to
| |
| |
and denote
,
,
. One has (i)
; (ii)
are identically distributed, and
; (iii) the set of random variables
as well as the set
, are mutually independent.
From (ii)–(iii) it follows that almost surely
, and from (i) it follows that
so, by virtue of ( 19 ),
does not go to zero.
Thus, there exists a random value
such that for infinitely many values of
,
.
Define a sequence of stopping times
,
,
inductively, letting
and
for
. The events
happen with probability more that
(recall the remark done in the beginning of proof ), and every event
,
does not depend on the set of events
. Therefore, for infinitely many values of
,
, takes place, i.e.,
, and hence, taking into account that
and
, for these values of
one has
. Thus, one concludes that
|
(20)
|
Suppose that
converges to a point from
, then for some
and
one has
as
, hence the process
,
coincides with
,
, and therefore
as
. The last relation contradicts ( 20 ), thus Lemma 2 is proved.
Lemma 3
Let
. Then for any open set
containing
there exists a positive constant
such that either (i) for some
,
, or (ii) for some
,
and
.
Proof. Designate by
the primitive of
such that
. Define the stopping time
The value of
will be specified below.
Consider the sequence
. Introducing shorthand notation
,
,
, and using that
, one gets
|
(21)
|
Next, we utilize the Taylor decomposition
being some point between
and
. Substituting
and recalling that
and
, one obtains
|
(22)
|
Using ( 21 ) and ( 22 ) and taking into account that each of the values
,
,
is mutually independent with
(see A1), one gets
|
(23)
|
If
then either (i)
and
, or (ii)
.
In the case (i) one has
|
(24)
|
where
; obviously,
. Let us fix a
such that
.
In the case (ii), designating
, one has
|
(25)
|
Using A6, one gets that
.
Denote
. The relations ( 24 ) and ( 25 ) imply that if
then
, hence, by virtue of ( 23 ),
|
(26)
|
Summing up both sides of ( 26 ) over
and denoting
, one obtains
One has
, and
is bounded, hence
. Thus, for arbitrary
This implies that a.s. either
, or
. Lemma 3 is proved.
Denote
. Recall that
is the primitive of
such that
; the assumption A6 implies that
. Denote
. Denote also
,
,
,
, and
. Obviously,
and
.
Fix an open set
containing
. Let
,
. We shall say that a (finite or infinite) deterministic sequence
is
-admissible if
and there exist deterministic sequences
such that 1)
; 2) if
then
,
; 3)
,
.
Proposition 1
There exists constants
and
such that any
-admissible sequence
has non-empty intersection with
.
Proof. Let
. Designate
;
takes values from
. We shall use shorthand notation
,
. One has
|
(27)
|
where
is a point between
and
.
Next, one has
|
(28)
|
where
is a point between
and
.
We are going to prove by induction that
|
(29)
|
For
, ( 29 ) follows from the condition
and the definition of
.
Now, let
; suppose that formula ( 29 ) is true for
and prove it for
. For
, one has
,
, therefore
; hence, by virtue of 2),
for
. One has
,
, and
, hence
,
, and so,
,
, thus
also belongs to
. This implies that
. Then, combining ( 27 ) and ( 28 ) and using that
and
, one obtains
|
(30)
|
One has
, hence
. Using also that
,
, and
, one gets from ( 30 ) that
and using the induction hypothesis, one concludes that
Formula ( 29 ) is proved.
Let
; here
stands for the integral part of
.
Then, taking into account that
, from ( 29 ) one concludes that
, thus Proposition 1 is proved.
.
Proposition 2
If
,
,
,
and
belong to
, then
.
Proof. Using notation
, one gets
where
is a point between
and
. Therefore,
Using that
,
,
,
, one obtains
, hence
. Further, using that
,
,
,
, one gets
This implies that
.
Lemma 4
For any open set
, containing
, and any
there exists
such that
Proof. Without loss of generality suppose that
. Define the event
where
and
are the same as in the proof of Proposition 1:
,
.
Denote
by virtue of A3 (a),
. Let us show that for any elementary event
, the sequence
is
-admissible.
One has
. Further, one has
, with
,
, and using that
and
, one gets
. Thus, conditions 1) and 3) are verified.
Now, let
,
. Let
be the minimal value such that
. If
then
. If
then
. If
then
; otherwise, using that
,
,
and
belong to
, and applying Proposition 2, one would conclude that
, which contradicts the definition of
.
Thus,
, and therefore,
. So, the condition 2) is also verified.
Now, applying Proposition 1 to the
-admissible sequence
, one concludes that there exists a non-negative
such that
.
This implies that
Lemma 5
If
then for any open set
containing
there exists
such that
.
Proof. Let us fix an open set
, and denote
. Combining Lemma 3 and Lemma 4, one concludes that for any
there exists
such that whatever the initial conditions
,
,
,
Then one can choose a measurable integer-valued function
defined on
such that for
one will have
Designate
the supremum being taken over all the initial conditions
,
,
. Fix
,
,
, then
|
(31)
|
Taking supremum of the left hand side of ( 31 ) over all
, one obtains
, hence
. Lemma 5 is proved.
.
Denote
.
Lemma 6
For any open bounded sets
,
such that
and for any
there exists
such that
Proof. Denote
. Denote also
where
for arbitrary sets of real numbers
,
. Using assumption A3 (a), one obtains that there exists
such that for any
and for any integer
,
This implies that if
then
Denoting
, one concludes that the following statements (i) and (ii) hold with probability at least
:
(i) dist
dist
, hence
; (ii) as
, one has
, hence
, therefore
.
Lemma 6 is proved.
Lemma 7
If
,
is an open set containing
, and
then for some
,
and
.
Proof. Without loss of generality, suppose that
is bounded and
.
Choose an open set
such that
,
; applying Lemmas 5 and 6, one gets that for
and for arbitrary initial conditions,
Repeating the argument of Lemma 5, one concludes that there exists
such that
and
.
From now on we suppose that
. Choose
such that
; using A3 (b), one obtains that for some
,
. Denote
and
.
Without loss of generality, suppose that
is bounded.
Lemma 8
Suppose that
, then there exist a constant
and a monotone decreasing function
such that
and
Proof. Define the sequences
and
by
| |
| |
Using ( 5 ) and definition of
, one obtains that for all
,
. The variables
are identically distributed, take the values
and
, and
| |
| |
| |
Moreover, the variables in the set
, as well as the variables in the set
, are independent.
Denote
. One has
where
,
the sum
(
) is taken over the even (odd) values of
. Both
and
are sums of i.i.d.r.v. with zero mean, hence both
and
tend to zero as
. Lemma 8 is proved.
Define the stopping times
. Recall that
is the primitive of
such that
. Fix an open set
such that
and
, and denote
.
Lemma 9
Let
,
, and
, then
here
is a positive constant, and
satisfies the statement of lemma 8.
Proof. We shall use shorthand notation of Lemma 3:
and
. According to ( 22 ), one has
This implies that
, with
Using Lemma 8, one gets
where
According to the Chebyshev inequality,
where
Using that the values
,
,
, and
are
-measurable, and using assumptions A1 and A2, one obtains that for
,
, and for
,
Therefore,
Similarly,
Taking
one gets that
. Lemma 9 is proved.
Lemma 10
If
then
.
Proof. From the definition of
one easily sees that if
for some
, then
. This implies that for any
|
(32)
|
Further, by virtue of Lemma 9, if
and
then
|
(33)
|
Combining ( 32 ) and ( 33 ), one gets that for any
|
(34)
|
Define the event
, then by virtue of ( 34 ),
|
(35)
|
Denote by
the complementary event,
.
By virtue of Lemma 7,
|
(36)
|
Using ( 35 ) and ( 36 ), one gets
| |
Taking into account that
can be chosen arbitrarily small and that
as
, one concludes that
.
Now, we are in a position to prove the theorem. Suppose that
, then
, and by Lemma 2,
diverges. So, the statement (b) of Theorem is proved.
On the other hand, according to Lemma 10, if
then
, and by Lemmas 1 and 2, the sequence
converges to a point from
.
Thus, the statement (a) of theorem is also established.
Acknowledgements This work was partially supported by the R&D Unit CEOC (Center for Research in Optimization and Control). The second author (PC) also gratefully acknowledges the financial support by the Portuguese program PRODEP `Medida 5 Acção 5.3 Formação Avançada de Docentes do Ensino Superior Concurso nr. 2/5.3/PRODEP/2001'.
References
-
Harold J. Kushner and G.George Yin, Stochastic approximation algorithms and applications., Applications of Mathematics. 35. (1997) Berlin: Springer.
-
M. Nevel'son and R. Has'minskii, Stochastic approximation and recursive estimation. Translated from the Russian by Israel Program for Scientific Translations. Translation edited by B. Silver., Translations of Mathematical Monographs. Vol. 47. (1976) Providence, R.I.: American Mathematical Society.
-
Y. Fang and T. J. Sejnowski, Faster learning for dynamic recurrent backpropagation, Neural Computation, 2 (1990), pp. 270–273.
-
Fernando M. Silva and Luis B. Almeida, Speeding up backpropagation, Advanced Neural Computers, (1990), R. Eckmiller (Ed.), Elsevier Science Publishers, Amsterdam pp. 151-158.
-
L. B. Almeida, T. Langlois, J. D. Amaral, and A. Plakhov, Parameter adaptation in stochastic optimization, Online Learning in Neural Networks, Saad, D. (Ed.), (1998), pp. 111–134.
-
P. J. Werbos, Neurocontrol and supervised learning: an overview and evaluation, in Handbook of Intelligent Control (Neural, Fuzzy, and Adaptive approaches), D. A. White and D. A. Sofge, eds., Van Nostrand Reinhold, New York, (1992), pp. 65–89.
-
H. Kesten, Accelerated stochastic approximation., Ann. Math. Stat., 29 (1958), pp. 41–59.
-
B. Delyon and A. Juditsky, Accelerated stochastic approximation., SIAM J. Optim., 3 (1993), pp. 868–881.
-
R. Salomon and van J. L. Hemmen, Accelerating Backpropagation through Dynamic Self-Adaptation, Neural Networks, 9:4, (1996) Elsevier Science Publishers, pp 589–601.
-
Roberto Battiti, Accelerated Backpropagation Learning: Two Optimization Methods, Complex Systems, Inc., (1989) pp. 331-342.
-
Marcus Frean, A “thermal” perceptron learning rule, Neural Computation, 4:6 (1992), The MIT Press, Cambridge–Massachusetts, pp 946–957.