A stochastic approximation algorithm with multiplicative step size adaptation

A l e x a n d e r P l a k h o v p l a k h o v @ m a t . u a . p t P e d r o C r u z j p e d r o @ m a t . u a . p t Department of Mathematics University of Aveiro — Portugal

Abstract
An algorithm of searching a zero of an unknown function φ : R R   is considered, x t = x t 1 γ t 1 y t   , t = 1 , 2 ,   , where y t = φ ( x t 1 ) + ξ t   is the value of φ   measured at x t 1   with some error, ξ t   is this error. The step sizes γ t > 0   are random positive values and are calculated according to the rule: γ t = min { u γ t 1 , g ¯ }   if y t 1 y t > 0   , and γ t = d γ t 1   , otherwise. Here 0 < d < 1 < u   , g ¯ > 0   . The function φ   may have one or more zeros; the random values ξ t   are independent and identically distributed, with zero mean and finite variance. Under some additional assumptions on φ   , ξ t   , and g ¯   , the conditions on u   and d   guaranteeing a.s. convergence of the sequence { x t }   , as well as the conditions on u   , d   guaranteeing a.s. divergence, are determined.
In particular, if P ( ξ 1 > 0 ) = P ( ξ 1 < 0 ) = 1 / 2   and P ( ξ 1 = x ) = 0   for any x R   , it is established that for u d < 1   , convergence takes place, and for u d > 1   , divergence. Due to the multiplicative rule of updating of γ t   , it is natural to expect that { x t }   converges rapidly: like a geometric progression (if convergence takes place), but the limit value may not coincide with, but instead, approximates one of zeros of φ   . By adjusting the parameters u   and d   , one can reach necessary precision of approximation; higher precision is obtained at the expense of lower convergence rate.
Key words: stochastic approximation, accelerated convergence algorithms, step size adaptation.
AMS subject classification: 62L20 (Stochastic approximation), 90C15 (Stochastic programming), 93B30 (System identification)

1 Introduction

Consider the problem of finding a zero of a function φ : R R   . If there are several zeros, it is required to find at least one of them. It is supposed that the function can be measured at any point, with some random error. The standard algorithm of stochastic approximation consists in calculating successive approximations of the required value, x 0   , x 1   , x 2 ,   , according to the rule
x t = x t 1 γ t 1 y t , t = 1 , 2 , , (1)
where
y t = φ ( x t 1 ) + ξ t (2)
is the value of φ   measured at x t 1   , ξ t   is the measurement error; γ 0   , γ 1   , γ 2 ,   is the sequence of step sizes of the algorithm. Usually it is assumed that the step sizes are positive real numbers satisfying the relations γ t =   , γ t 2 <   . Then, under some additional assumptions on φ   and ξ t   , the algorithm a.s. converges to a zero point of φ   (see, e.g., [1, 2). In practice, however, the convergence rate of this algorithm may prove to be unsatisfactory, therefore, when solving practical tasks, various modifications of the algorithm are used. There are widely utilized heuristical algorithms using random, rather than deterministic, step size, which is corrected in the course of the algorithm, according to the current data [3, 6, 9, 11. In particular, there is used the idea that prescribes to decrease the step size if the sequence of increments x t x t 1   changes the sign often enough, indicating that the current value x t   is close to the set of zeros of φ   , and hence, the measurement error ξ t   of the function is big enough with respect to the function itself φ ( x t 1 )   . Alternatively, one should increase the step size, or leave it unchanged. So, Kesten in the theoretical work [7considered an algorithm using ( 1 ), ( 2 ), and the rule of modification of γ t   :
γ t = γ ( s t ) , s t = { s t 1 if y t 1 y t > 0 s t 1 + 1 if y t 1 y t 0 , t = 2 , 3 , . (3)
where s 0 = 0   , s 1 = 1   ; γ ( 0 )   , γ ( 1 )   , γ ( 2 ) ,   is a sequence of positive numbers satisfying the relations γ ( m ) =   , γ 2 ( m ) <   . Thus, the step size cannot increase in the course of algorithm; it can only decrease or remain unchanged. It is supposed that there is a unique zero of φ   . Kesten proved that x t   a.s. converges to this zero point. A multidimensional version of this algorithm is considered in [8.
There are also heuristical procedures (in particular, in artificial neural networks), where at each moment t   the step size is multiplied by a positive constant less than 1, if the measurement data indicate that x t   is close enough to the zero set of φ   , and by a constant more than 1, elsewhere [4, 5, 9, 10. This kind of rules ensure sufficiently high convergence rate, however the step size converges like a geometric progression, therefore γ t <   , which means that the limit of { x t }   need not be a zero point of φ   , but instead, the sequence may ”get stuck” on its way to the set of zeros of φ   . Nevertheless, such a procedure may be justified if it gives a value close enough to one of the zeros of φ   . In the present paper, a stochastic approximation algorithm utilizing this rule of step size modification is considered. Namely, the rule ( 1 ), ( 2 ), jointly with the following rule
γ t = { min { u γ t 1 , g ¯ } if y t 1 y t > 0 , d γ t 1 if y t 1 y t 0 , t = 2 , 3 , . (4)
is used. Here 0 < d < 1 < u   , 0 < γ 0   , γ 1 g ¯   , g ¯   is a positive constant.
Let us point out the main differences between ( 4 ) and Kesten's rule ( 3 ). First, according to ( 4 ), γ t   can both decrease and increase. Second, in Kesten's algorithm one always has γ t =   . On the other hand, it looks likely that in the case of convergence of the algorithm ( 1 ), ( 2 ), ( 4 ), γ t   converges like a geometric progression (this conjecture will be justified in the section 3), therefore the limit of algorithm may not be a zero point of φ   .
Suppose that { ξ t }   is a sequence of i.i.d.r.v. with zero mean, besides P ( ξ t > 0 ) = P ( ξ t < 0 )   . Under some additional assumptions on φ   , ξ t   , and g ¯   , stated below, the process defined by ( 1 ), ( 2 ), ( 4 ) a.s. diverges if u d > 1   , and converges if u d < 1   , moreover the limit of { x t }   belongs to U ( ln u ln d )   . Here U ( λ )   , 0 < λ < 1   , is a monotone decreasing family of sets of real numbers, besides every set U ( λ )   contains the set Z   of zeros of φ   , and ( U ( λ ) , Z ) 0   as λ 1   . (Here by definition ( A , B ) = sup x A inf y B | x y |   for any two sets of real numbers A   and B   .) This statement is a consequence of the main theorem, which will be stated in section 2 and proved in section 3. Thus, by adjusting the parameters u   and d   (for example, fixing u   and letting d 1 / u 0   ), one can reach necessary precision of the algorithm; higher precision is obtained at the expense of lower convergence rate.

2 Definition of the algorithm and statement of the main result

Consider the algorithm given by ( 1 ), ( 2 ), ( 4 ). The rule ( 4 ) means that at each instant t   , step size is multiplied by u   or by d   , if the result of multiplication is less than g ¯   ; otherwise, step size is set to be g ¯   . Thus, the maximal possible value of step size equals g ¯   .
The rule ( 4 ) can be written in the form
ln γ ~ t = ln γ t 1 + ln u I ( y t 1 y t > 0 ) + ln d I ( y t 1 y t 0 ) , ln γ t = min { ln γ ~ t , ln g ¯ } . (5)
Let us take the following assumptions:
  • A1 Denote t   , t = 0 , 1 , 2 ,   the σ   -algebra generated by x i   , γ i   , and ξ i   , 0 i t   ; then ξ t + 1   does not depend on t   .
  • A2 The values ξ t   are identically distributed, with zero mean and finite variance: E ξ t = 0   , V a r ξ t = : S < +   .
  • A3 (a) There exists L > 0   such that for any interval I [ L , L ]   , P ( ξ 1 I ) > 0   ; (b) P ( ξ 1 = 0 ) = 0   .
  • A4 φ C 1 ( R )   and sup x | φ ( x ) | = : M <   .
  • A5 g ¯ < 2 / M   .
  • A6 There exists R > 0   such that
    • (a) x φ ( x ) > 0   as | x | R   , and
    • (b) inf | x | R φ 2 ( x ) > g ¯ M S 2 g ¯ M   .
Remark 1 From A4 and A6 (a) it follows that the set Z   is non-empty and is contained in ( R , R )   .
Remark 2 Note that assumptions A4–A6 guarantee convergence of the deterministic counterpart of algorithm ( 1 ), ( 2 ), ( 4 ) (that is, of the algorithm with ξ t 0   ).
Moreover, under these conditions, any deterministic algorithm x t = x t 1 γ t 1 φ ( x t 1 )   converges, whatever the sequence { γ t }   satisfying γ t g ¯   .
Introduce the functions:
k + ( z ) : = lim ε 0 + sup { P ( ( φ 1 + ξ 1 ) ( φ 2 + ξ 2 ) > 0 ) , | φ 1 z | < ε , | φ 2 z | < ε } , (6)
k ( z ) : = lim ε 0 + inf { P ( ( φ 1 + ξ 1 ) ( φ 2 + ξ 2 ) > 0 ) , | φ 1 z | < ε , | φ 2 z | < ε } ; (7)
one has k + ( z ) 1 / 2   , 0 k ± ( z ) 1   , lim z k ± ( z ) = 1   . Further, define the sets of real numbers
V ± ( a ) : = { x : k ± ( φ ( x ) ) < a } , V ± [ a ] : = { x : k ± ( φ ( x ) ) a } ; (8)
obviously, V + ( a ) V ( a )   , V ± ( a ) V ± [ a ]   for any a   .
Note that V + ( a )   is open. Indeed, let x V + ( a )   , then there exists ε > 0   such that sup { P ( ( φ 1 + ξ 1 ) ( φ 2 + ξ 2 ) > 0 ) , | φ 1 φ ( x ) | < ε , | φ 2 φ ( x ) | < ε } = : c < a .   Then for x   close enough to x   one has | φ ( x ) φ ( x ) | < ɛ / 2   , hence sup { P ( ( φ 1 + ξ 1 ) ( φ 2 + ξ 2 ) > 0 ) , | φ 1 φ ( x ) | < ε / 2 , | φ 2 φ ( x ) | < ε / 2 } c < a .   This implies that k + ( φ ( x ) ) < a   , hence x V + ( a )   .
Denote also
k : = ln ( 1 / d ) ln ( u / d ) . (9)
Denote by Z   the set of zeros of φ   , i.e., Z : = { x : φ ( x ) = 0 }   . Suppose that x V + ( k )   , x t 2 ( x ε , x + ε ) V + ( k )   , and γ t 2 < ε   , where ε   is a small positive number. Then, with a probability close to 1, x t 1   also belongs to a small (possibly larger) neighborhood of x   contained in V + ( k )   , and taking into account ( 6 ) and ( 8 ), one gets
P ( y t 1 y t > 0 | | x t 2 x | < ε , γ t 2 < ε ) =
= P ( ( φ ( x t 2 ) + ξ t 1 ) ( φ ( x t 1 ) + ξ t ) > 0 | | x t 2 x | < ε , γ t 2 < ε ) < k .
Then, using ( 5 ) and ( 9 ), one obtains
E [ ln γ t ln γ t 1 | | x t 2 x | < ε , γ t 2 < ε ]
ln u P ( y t 1 y t > 0 | | x t 2 x | < ε , γ t 2 < ε ) + ln d P ( y t 1 y t 0 | | x t 2 x | < ε , γ t 2 < ε )
< ln u k + ln d ( 1 k ) = 0 .
Thus, in a sense, the set V + ( k )   can be regarded to be a domain of decrease of step size: if several consecutive values of x t   belong to V + ( k )   and are close enough to each other, and if the first term of the sequence of corresponding step sizes γ t   is small enough, then the sequence of their mean values E γ t   decreases.
Now, suppose that x R \ V [ k ]   , x t 2 ( x ε , x + ε ) R \ V [ k ]   , and that γ t 2 < ε   . Analogously, for ε   small enough, one has P ( y t 1 y t > 0 | | x t 2 x | < ε , γ t 2 < ε ) > k ,   and then, using again ( 5 ) and ( 9 ) and taking into account that for ε < g ¯ / u 2   , γ ~ t = γ t   , one obtains
E [ ln γ t ln γ t 1 | | x t 2 x | < ε , γ t 2 < ε ] =
ln u P ( y t 1 y t > 0 | | x t 2 x | < ε , γ t 2 < ε ] ) + ln d P ( y t 1 y t 0 | | x t 2 x | < ε , γ t 2 < ε ] )
> ln u k + ln d ( 1 k ) = 0 .
Thus, the set R \ V [ k ]   can be regarded as a domain of increase of step size: if several consecutive values of x t   belong to R \ V [ k ]   and are close enough to each other, and if the first of the corresponding values of γ t   is small enough, then the sequence of their mean values E γ t   increases.
Note that if k > k + ( 0 )   then, by virtue of ( 8 ), Z V + ( k )   , that is, all the zeros of φ   belong to the region of decrease of step size. On the other hand, if k < inf z k ( z )   then V [ k ] =   , which means that the region of increase of step size coincides with R   .
It seems likely that in the first case the algorithm can converge, and in the second one, cannot. This conjecture is confirmed by the following theorem, which is the main result of the paper.
Theorem Let the assumptions A1–A6 be satisfied; consider the process { x t , γ t }   defined by ( 1 ), ( 2 ), ( 4 ). Recall that k = ln ( 1 / d ) ln ( u / d )   . Then (a) If k > k + ( 0 )   then { x t }   a.s. converges to a point from V [ k ]   .
(b) If k < inf z k ( z )   then { x t }   a.s. diverges.
Suppose that P ( ξ 1 = x ) = 0   for any real x   and that P ( ξ 1 > 0 ) = P ( ξ 1 < 0 )   .
Then the function k ( ) : = k + ( )   coincides with k ( )   , is continuous, and is given by k ( z ) = P ( ( z + ξ 1 ) ( z + ξ 2 ) > 0 ) ;   z = 0   is the unique minimum of k ( )   , and k ( 0 ) = inf z k ( z ) = 1 / 2   . After a simple algebra, one can rewrite the hypotheses of theorem in the form (a) u d < 1   , (b) u d > 1   . Denote U ( λ ) : = V [ 1 1 + λ ] = { x : k ( φ ( x ) ) 1 1 + λ }   ; U ( λ )   , 1 < λ < 1   is a monotone decreasing family of sets containing Z   and tending to Z   as λ 1   .
Thus, one comes to Corollary Let, in addition to assumptions A1–A6, P ( ξ 1 = x ) = 0   for any x R   , and P ( ξ 1 > 0 ) = P ( ξ 1 < 0 ) = 1 / 2   . Consider the process defined by ( 1 ), ( 2 ), ( 4 ). Then there exists a monotone decreasing family of sets U ( λ )   , 0 < λ < 1   such that U ( λ ) Z   , ( U ( λ ) , Z ) 0   as λ 1   , and (a) if u d < 1   then { x t }   a.s. converges to a point from U ( ln u ln d )   ; (b) if u d > 1   then { x t }   a.s. diverges.
Remark 3 Theorem does not give any information about behavior of the algorithm for the values u   , d   such that inf z k ( z ) ln ( 1 / d ) ln ( u / d ) k + ( 0 ) .   In particular, under the hypotheses of corollary, the case u d = 1   remains unexplored. These issues will be addressed elsewhere.

3 Proof of theorem

First we prove 10 auxiliary lemmas, and then, basing on them, we prove theorem.
Here all statements about random variables are supposed to be true almost surely.
In the sequel, we shall mainly designate random values by Greek letters, and real numbers and functions from R   to R   , by Latin ones; the letters t   , i   , j   , s   will denote integer non-negative numbers. The function φ   and the random values x t   , y t   are exceptions; also, traditional notation ε   , δ   for small positive numbers will be used.
Lemma 1 If t γ t <   then the sequence { x t }   converges.
Proof. Note that without loss of generality one can assume that x 0   is bounded. Indeed, replacing x 0   by x ~ 0 = x 0 I ( | x 0 | < X )   changes the process only with probability P ( | x 0 | > X )   . By taking X   large enough, one can make this probability arbitrarily small.
Let C > 0   ; define the stopping time τ C = inf { t : i = 0 t γ i > C }   and introduce the new process x t C   , γ t C   by
x t C = x t , γ t C = γ t as t < τ c , and
x t C = x τ C , γ t C = 0 as t τ c .
First, let us prove that the sequence { x t C }   is bounded. Designate M R : = sup | x | R φ ( x ) x   ; from A4 it follows that M R <   . One has
| x t C | | x t 1 C γ t 1 C φ ( x t 1 C ) | + γ t 1 C | ξ t | . (10)
Using that γ t 1 C C   and | φ ( x t 1 ) C | | φ ( 0 ) | + M | x t 1 C |   , one obtains
| x t C | | x t 1 C | ( 1 + C M ) + γ t 1 C ( | φ ( 0 ) | + | ξ t | ) . (11)
If γ t 1 C 2 / M R   , an even more precise estimate for x t C   can be obtained. We shall distinguish between two cases: (i) | x t 1 | R   and (ii) | x t 1 C | > R   .
In case (i), designating b ¯ : = sup | x | R | φ ( x ) |   , one has
| x t 1 C γ t 1 C φ ( x t 1 C ) | | x t 1 C | + γ t 1 C b ¯ . (12)
In the case (ii) one has 0 γ t 1 C φ ( x t 1 C ) x t 1 C 2 M R M R = 2 ,   hence
| x t 1 C γ t 1 C φ ( x t 1 C ) | | x t 1 C | . (13)
Thus, in both cases (i) and (ii), from ( 10 ), ( 12 ), and ( 13 ) one gets
| x t C | | x t 1 C | + γ t 1 C ( b ¯ + | ξ t | ) . (14)
The overall number of values of t   such that γ t 1 C 2 / M R   is less than C M R / 2   ; therefore, using ( 11 ) and ( 14 ), one concludes that
| x t C | ( | x 0 | + i = 1 t γ i 1 C ( b ¯ + | φ ( 0 ) | + | ξ i | ) ) ( 1 + C M ) C M R / 2 . (15)
Denote c 0 : = b ¯ + | φ ( 0 ) | + E | ξ 1 |   and ζ t : = | ξ t | E | ξ t |   ; using that 1 γ i 1 C C   one gets
| x t C | ( | x 0 | + C c 0 + i = 1 t γ i 1 C ζ i ) ( 1 + C M ) C M R / 2 . (16)
Using that 1 E ( γ t 1 C ζ t ) 2 = E ζ 1 2 1 E ( γ t 1 C ) 2 <   , one obtains that the martingale 1 t γ i 1 C ζ i   is bounded; the value x 0   is also bounded, so, by ( 16 ), one concludes that the sequence { x t C }   is bounded.
Now, let us show that { x t C }   converges. From the definition of x t C   and γ t C   it follows that x t C = x 0 1 t γ i 1 C φ ( x i 1 C ) 1 t γ i 1 C ξ i .   Using that the sequence { φ ( x i 1 C ) }   is bounded and that 1 γ i 1 C C   , one gets that the series 1 γ i 1 C φ ( x i 1 C )   converges. Further, one has 1 E ( γ t 1 C ξ t ) 2 = S 1 E ( γ t 1 C ) 2 < ,   hence the martingale 1 t γ i 1 C ξ i   converges. This implies that { x t C }   also converges.
Define the events A C = { t γ t C }   and A = { t γ t < }   . One has A = C A C   . If t γ t C   then x t C = x t   for any t   ; this means that I ( A C ) ( x t C x t ) = 0   for any t   and C   . The sequence { I ( A C ) x t C }   converges, therefore the sequence { I ( A C ) x t }   also converges, and passing to the limit C   one obtains that { I ( A ) x t }   converges. This means exactly that if t γ t <   then { x t }   converges.  
Lemma 2 If lim t x t = x   then x V [ k ]   .
Proof. Note that, using A3 (a), it is easy to show that there exists δ 0 > 0   such that P ( ξ 1 [ x L / 2 , x + L / 2 ] ) > δ 0   , whatever x R   .
Next, for any x V ( [ k ] )   there exist w ( x ) > 0   and 0 < ε ( x ) < L / 4   such that the following holds: for any two random variables φ 1   and φ 2   satisfying the relations | φ l φ ( x ) | ε ( x )   , l = 1 , 2   one has P ( ( φ 1 + ξ 1 ) ( φ 2 + ξ 2 ) > 0 ) > ln ( 1 / d ) + w ( x ) ln u + ln ( 1 / d ) .   Choose a countable set of intervals U i = ( φ ( x i ) ε ( x i ) , φ ( x i ) + ε ( x i ) )   covering the set φ ( R \ V [ k ] )   , and denote w i : = w ( x i )   . Fix i   and s { 0 , 1 , 2 , }   , and define the auxiliary process x t ( i s )   , γ t ( i s )   by formulas:
if t < s   then x t ( i s ) = x t   , and if t s   then
x t ( i s ) = { x t 1 ( i s ) γ t 1 ( i s ) y t ( i s ) if φ ( x t 1 ( i s ) γ t 1 ( i s ) y t ( i s ) ) U i , x i elsewhere ; (17)
y t ( i s ) = φ ( x t 1 ( i s ) ) + ξ t , (18)
γ t ( i s ) = { min { u γ t 1 ( i s ) , g ¯ } if y t 1 ( i s ) y t ( i s ) > 0 , d γ t 1 ( i s ) if y t 1 ( i s ) y t ( i s ) 0 . (19)
So, as t s   , φ ( x t ( i s ) )   is forced to be contained in U i   .
For t s + 2   , using that y t 1 ( i s ) = φ ( x t 2 ( i s ) ) + ξ t 1   , y t ( i s ) = φ ( x t 1 ( i s ) ) + ξ t   , φ ( x t 2 ( i s ) ) U i   , one obtains that P ( y t 1 ( i s ) y t ( i s ) > 0 ) > ln ( 1 / d ) + w i ln u + ln ( 1 / d )   and P ( y t 1 ( i s ) y t ( i s ) 0 ) < ln u w i ln u + ln ( 1 / d ) ,   hence E [ ln u I ( y t 1 ( i s ) y t ( i s ) > 0 ) + ln d I ( y t 1 ( i s ) y t ( i s ) 0 ) ] >   > ln u ln ( 1 / d ) + w i ln u + ln ( 1 / d ) + ln d ln u w i ln u + ln ( 1 / d ) = w i .   Consider variables φ 1 = f 1 ( ξ 1 , ξ 2 )   and φ 2 = f 2 ( ξ 1 , ξ 2 )   providing a solution of the (deterministic) minimization problem: ( φ 1 + ξ 1 ) ( φ 2 + ξ 2 ) min ,   subject to
| φ 1 φ ( x i ) | ε ( x i )
| φ 2 φ ( x i ) | ε ( x i ) ,
and denote Y t 1 1 = f 1 ( ξ t 1 , ξ t ) + ξ t 1   , Y t 2 = f 2 ( ξ t 1 , ξ t ) + ξ t   , η t = ln u I ( Y t 1 1 Y t 1 2 > 0 ) + ln d I ( Y t 1 1 Y t 1 2 0 )   . One has (i) η t ln u I ( y t 1 ( i s ) y t ( i s ) > 0 ) + ln d I ( y t 1 ( i s ) y t ( i s ) 0 )   ; (ii) η t   are identically distributed, and E η t w i   ; (iii) the set of random variables { η t , t even , t s + 2 }   as well as the set { η t , t odd , t s + 2 }   , are mutually independent.
From (ii)–(iii) it follows that almost surely t η t = +   , and from (i) it follows that t [ ln u I ( y t 1 ( i s ) y t ( i s ) > 0 ) + ln d I ( y t 1 ( i s ) y t ( i s ) 0 ) ] = + ,   so, by virtue of ( 19 ), γ ( i s )   does not go to zero.
Thus, there exists a random value χ > 0   such that for infinitely many values of t   , γ t ( i s ) χ   .
Define a sequence of stopping times τ 0   , τ 1   , τ 2 ,   inductively, letting τ 0 = 0   and τ j = inf { t > τ j 1 : γ t ( i s ) χ }   for j 1   . The events B j = { | ξ τ j + 1 + φ ( x i ) | > L / 2 }   happen with probability more that δ 0   (recall the remark done in the beginning of proof ), and every event B j   , j 2   does not depend on the set of events { B 1 , , B j 1 }   . Therefore, for infinitely many values of j   , B j   , takes place, i.e., | ξ τ j + 1 + φ ( x i ) | > L / 2   , and hence, taking into account that | y τ j + 1 | | ξ τ j + 1 + φ ( x i ) | | φ ( x τ j ) φ ( x i ) |   and | φ ( x τ j ) φ ( x i ) | < ε ( x i ) < L / 4   , for these values of j   one has | y τ j + 1 | L / 4   . Thus, one concludes that
for infinitely many values of j , | γ τ j y τ j + 1 | χ L / 4 . (20)
Suppose that x t   converges to a point from R \ V [ k ]   , then for some i   and s   one has x t U i   as t s   , hence the process x t ( i s )   , γ t ( i s )   coincides with x t   , γ t   , and therefore γ t y t + 1 0   as t   . The last relation contradicts ( 20 ), thus Lemma 2 is proved.  
Lemma 3 Let t γ t =   . Then for any open set O   containing Z   there exists a positive constant g = g ( O )   such that either (i) for some t   , x t O   , or (ii) for some t   , | x t | < R   and γ t > g   .
Proof. Designate by f   the primitive of φ   such that inf x f ( x ) = 0   . Define the stopping time τ = τ ( O , g ) : = inf { t : either (i) x t O , or (ii) | x t | < R and γ t g } .   The value of g ( 0 , g ¯ )   will be specified below.
Consider the sequence E t = E [ f ( x t ) I ( t < τ ) ]   . Introducing shorthand notation f ( x t ) = : f t   , I ( t < τ ) = : I t   , f ( x t ) = : f t = φ t   , and using that I t I t 1   , one gets
E t E t 1 = E [ f t I t f t 1 I t 1 ] E [ ( f t f t 1 ) I t 1 ] . (21)
Next, we utilize the Taylor decomposition f t = f ( x t 1 γ t 1 y t ) = f t 1 f t 1 γ t 1 y t + 1 2 f ( x ) γ t 1 2 y t 2 ,   x   being some point between x t 1   and x t   . Substituting y t = φ t 1 + ξ t   and recalling that f t 1 = φ t 1   and f ( x ) = φ ( x ) M   , one obtains
f t f t 1 γ t 1 φ t 1 ( φ t 1 + ξ t ) + M 2 γ t 1 2 ( φ t 1 + ξ t ) 2 . (22)
Using ( 21 ) and ( 22 ) and taking into account that each of the values γ t 1   , φ t 1   , I t 1   is mutually independent with ξ t   (see A1), one gets
E t E t 1 E [ ( γ t 1 φ t 1 2 γ t 1 φ t 1 ξ t + M 2 γ t 1 2 φ t 1 2 + M γ t 1 2 φ t 1 ξ t + M 2 γ t 1 2 ξ t 2 ) I t 1 ] = = E [ ( φ t 1 2 + M 2 γ t 1 φ t 1 2 + M 2 γ t 1 S ) γ t 1 I t 1 ] = = E [ ( φ t 1 2 ( 1 M γ t 1 / 2 ) + M γ t 1 S / 2 ) γ t 1 I t 1 ] . (23)
If I t 1 = 1   then either (i) x t 1 [ R , R ] \ O   and γ t 1 < g   , or (ii) | x t 1 | R   .
In the case (i) one has
φ t 1 2 ( 1 M γ t 1 / 2 ) + M γ t 1 S / 2 c 0 ( 1 M g / 2 ) + M g S / 2 = : c g , (24)
where c 0 : = inf { | φ ( x ) | : x [ R , R ] \ O }   ; obviously, c 0 > 0   . Let us fix a g ( 0 , g ¯ )   such that c g > 0   .
In the case (ii), designating b 0 : = inf | x | R φ 2 ( x )   , one has
φ t 1 2 ( 1 M γ t 1 / 2 ) + M γ t 1 S / 2 b 0 ( 1 M g ¯ / 2 ) + M g ¯ S / 2 = : c . (25)
Using A6, one gets that c > 0   .
Denote c = min { c g , c }   . The relations ( 24 ) and ( 25 ) imply that if I t 1 = 1   then φ t 1 2 ( 1 M γ t 1 / 2 ) + M γ t 1 S / 2 c < 0   , hence, by virtue of ( 23 ),
E t E t 1 c E [ γ t 1 I t 1 ] . (26)
Summing up both sides of ( 26 ) over t = 1 , , s   and denoting I = I ( τ = ) = min t I t   , one obtains E s E 0 c E [ i = 0 s 1 γ i I ] .   One has E s 0   , and x 0   is bounded, hence E 0 <   . Thus, for arbitrary s   E [ i = 0 s 1 γ i I ] E 0 c < .   This implies that a.s. either 0 γ i <   , or τ =   . Lemma 3 is proved.   Denote c 1 : = 1 M g ¯ / 2   . Recall that f   is the primitive of φ   such that inf x f ( x ) = 0   ; the assumption A6 implies that lim x ± f ( x ) = +   . Denote H : = sup | x | R f ( x )   . Denote also c 3 : = g ¯ sup { | φ ( x ) | : f ( x ) H } + 1   , z l : = inf { x : f ( x ) H } c 3   , z r : = sup { x : f ( x ) H } + c 3   , c 2 : = inf { | φ ( x ) | : x [ z l , z r ] \ O }   , and K : = sup { | φ ( x ) | : x [ z l , z r ] }   . Obviously, c 1 > 0   and K c 2 > 0   .
Fix an open set O   containing Z   . Let g > 0   , 0 < w < 1   . We shall say that a (finite or infinite) deterministic sequence { z 0 , z 1 , z 2 , }   is ( g , w )   -admissible if | z 0 | R   and there exist deterministic sequences { q t } ,   { h t }   such that 1) | h t | w   ; 2) if { z 0 , z 1 , , z t } [ z l , z r ] \ O   then g d 2 q s g ¯   , s = 0 , 1 , , t   ; 3) z t = z t 1 q t 1 φ ( z t 1 ) h t   , t = 1 , 2 ,   .
Proposition 1 There exists constants t 0   and w   such that any ( g , w )   -admissible sequence { z t , t = 0 , 1 , , t 0 }   has non-empty intersection with O   .
Proof. Let w : = min { 1 , g d 2 c 2 2 c 1 / ( 2 K ) }   . Designate t ~ = inf { t : z t O }   ; t ~   takes values from { 0 , 1 , , t 0 , + }   . We shall use shorthand notation f t : = f ( z t )   , f t = φ t : = φ ( z t )   . One has
f t = f ( z t 1 q t 1 φ t 1 h t ) = f ( z t 1 q t 1 φ t 1 ) f ( z ~ ) . h t , (27)
where z ~   is a point between z t 1 q t 1 φ t 1   and z t 1 q t 1 φ t 1 h t   .
Next, one has
f ( z t 1 q t 1 φ t 1 ) = f t 1 f t 1 q t 1 φ t 1 + 1 2 f ( z ^ ) q t 1 2 φ t 1 2 , (28)
where z ^   is a point between z t 1   and z t 1 q t 1 φ t 1   .
We are going to prove by induction that
if 0 s t ~ then f s H s g d 2 c 2 2 c 1 / 2 . (29)
For s = 0   , ( 29 ) follows from the condition | z 0 | R   and the definition of H   .
Now, let 1 t t ~   ; suppose that formula ( 29 ) is true for 0 s t 1   and prove it for s = t   . For 0 s t 1   , one has f ( z s ) H   , z s O   , therefore z s [ z l , z r ] \ O   ; hence, by virtue of 2), g d 2 q s g ¯   for 0 s t 1   . One has f ( z t 1 ) H   , | q t 1 φ t 1 | g ¯ sup { | φ ( x ) | : f ( x ) H }   , and | h t | w 1   , hence | q t 1 φ t 1 | c 3   , | q t 1 φ t 1 + h t | c 3   , and so, z t 1 q t 1 φ t 1 [ z l , z r ]   , z t 1 q t 1 φ t 1 h t [ z l , z r ]   , thus z ~   also belongs to [ z l , z r ]   . This implies that | φ ( z ~ ) | = | f ( z ~ ) | K   . Then, combining ( 27 ) and ( 28 ) and using that | h t | w   and | f ( z ^ ) | = | φ ( z ^ ) | M   , one obtains
f t f t 1 q t 1 φ t 1 2 ( 1 1 2 q t 1 M ) + w K . (30)
One has z t 1 [ z l , z r ] \ O   , hence | φ ( z t 1 ) | = | φ t 1 | c 2   . Using also that q t 1 g d 2   , 1 1 2 q t 1 M c 1   , and w K g d 2 c 2 2 c 1 / 2   , one gets from ( 30 ) that f t f t 1 g d 2 c 2 2 c 1 / 2 ,   and using the induction hypothesis, one concludes that f t H t g d 2 c 2 2 c 1 / 2 .   Formula ( 29 ) is proved.
Let t 0 : = 2 H / ( g d 2 c 2 2 c 1 ) + 1   ; here z   stands for the integral part of z   .
Then, taking into account that f s 0   , from ( 29 ) one concludes that t ~ < t 0   , thus Proposition 1 is proved.   .
Proposition 2 If γ t 1 < 1 / ( 3 M )   , | ξ t | < c 2   , | ξ t + 1 | < c 2   , x t 1   and x t   belong to [ z l , z r ] \ O   , then γ t + 1 γ t   .
Proof. Using notation φ t : = φ ( x t )   , one gets φ t = φ ( x t 1 γ t 1 ( φ t 1 + ξ t ) ) = φ t 1 φ ( x ~ ) γ t 1 ( φ t 1 + ξ t ) ,   where x ~   is a point between x t 1   and x t   . Therefore, φ t 1 φ t = φ t 1 2 [ 1 φ ( x ~ ) γ t 1 ( 1 + ξ t / φ t 1 ) ] .   Using that | φ ( x ~ ) | M   , γ t 1 < 1 / ( 3 M )   , | ξ t | < c 2   , | φ t 1 | c 2   , one obtains 1 φ ( x ~ ) γ t 1 ( 1 + ξ t / φ t 1 ) 1 / 3   , hence φ t 1 φ t > 0   . Further, using that | ξ t | < c 2   , | ξ t + 1 | < c 2   , | φ t 1 | c 2   , | φ t | c 2   , one gets y t y t + 1 = φ t 1 φ t ( 1 + ξ t / φ t 1 ) ( 1 + ξ t + 1 / φ t ) > 0 .   This implies that γ t + 1 = min { u γ t , g ¯ } γ t   .  
Lemma 4 For any open set O   , containing Z   , and any g > 0   there exists δ = δ ( O , g ) > 0   such that if | x 0 | R , γ 0 g then P ( for some t , x t O ) δ .  
Proof. Without loss of generality suppose that g < 1 / ( 3 M )   . Define the event A : = { | ξ i | < min { c 2 , w / g ¯ } , i = 1 , 2 , , t 0 } ,   where w   and t 0   are the same as in the proof of Proposition 1: w = min { 1 , g d 2 c 2 2 c 1 / ( 2 K ) }   , t 0 = 2 H / ( g d 2 c 2 2 c 1 ) + 1   .
Denote δ : = P ( A ) = ( P ( | ξ 1 | < min { c 2 , w / g ¯ } ) ) t 0 ;   by virtue of A3 (a), δ > 0   . Let us show that for any elementary event ω A   , the sequence { z t = x t ( ω ) , t = 0 , 1 , , t 0 }   is ( g , w )   -admissible.
One has | z 0 | = | x 0 ( ω ) | < R   . Further, one has z t = z t 1 q t 1 φ ( z t 1 ) h t   , with q t 1 = γ t 1 ( ω )   , h t = γ t 1 ( ω ) ξ t ( ω )   , and using that γ t 1 ( ω ) g ¯   and | ξ t ( ω ) | < ω / g ¯   , one gets | h t | w   . Thus, conditions 1) and 3) are verified.
Now, let { z 0 , z 1 , , z t } [ z l , z r ] \ O   , t t 0   . Let s 0 { 0 , 1 , 2 , , t }   be the minimal value such that q s 0 = min { q 0 , q 1 , , q t }   . If s 0 = 0   then min { q 0 , q 1 , , q t } = q 0 = γ 0 ( ω ) g g d 2   . If s 0 = 1   then min { q 0 , q 1 , , q t } = q 1 = γ 1 ( ω ) g d g d 2   . If s 0 2   then γ s 0 2 ( ω ) 1 / ( 3 M )   ; otherwise, using that | ξ s 0 1 | < c 2   , | ξ s 0 | < c 2   , x s 0 2 ( ω )   and x s 0 1 ( ω )   belong to [ z l , z r ] \ O   , and applying Proposition 2, one would conclude that γ s 0 ( ω ) γ s 0 1 ( ω )   , which contradicts the definition of s 0   .
Thus, γ s 0 ( ω ) 1 / ( 3 M ) d 2 g d 2   , and therefore, min { q 0 , q 1 , , q t } = γ s 0 ( ω ) g d 2   . So, the condition 2) is also verified.
Now, applying Proposition 1 to the ( g , w )   -admissible sequence { z t }   , one concludes that there exists a non-negative τ t 0   such that z τ = x τ ( ω ) O   .
This implies that P ( for some t , x t O ) P ( A ) = δ .    
Lemma 5 If t γ t =   then for any open set O   containing Z   there exists t   such that x t O   .
Proof. Let us fix an open set O Z   , and denote δ = δ ( O , g ( O ) )   . Combining Lemma 3 and Lemma 4, one concludes that for any O Z   there exists δ > 0   such that whatever the initial conditions x 0   , γ 0   , γ 1   , P ( for some t , x t O | t γ t = ) > δ .   Then one can choose a measurable integer-valued function n ( , , )   defined on R × ( 0 , g ¯ ] × ( 0 , g ¯ ]   such that for ν = n ( x 0 , γ 0 , γ 1 )   one will have P ( for some t ν , x t O | t γ t = ) > δ / 2   Designate p ¯ = sup P ( for all t , x t O | t γ t = ) ,   the supremum being taken over all the initial conditions x 0   , γ 0   , γ 1   . Fix x 0   , γ 0   , γ 1   , then
P ( for all t , x t O | t γ t = ) = = P ( for all t > ν , x t O | for all t ν , x t O and t γ t = ) P ( for all t ν , x t O | t γ t = ) p ¯ ( 1 δ / 2 ) . (31)
Taking supremum of the left hand side of ( 31 ) over all ( x 0 , γ 0 , γ 1 ) R × ( 0 , g ¯ ] × ( 0 , g ¯ ]   , one obtains p ¯ p ¯ ( 1 δ / 2 )   , hence p ¯ = 0   . Lemma 5 is proved.   .
Denote O * = { x : | φ ( x ) | < L / 2 }   .
Lemma 6 For any open bounded sets O   , O 1   such that O ¯ O 1 O *   and for any w > 0   there exists δ = δ ( O , O 1 , w ) > 0   such that if x 0 O then P ( for some n , x n O 1 and γ n < w ) δ .  
Proof. Denote n = ln g ¯ ln w ln ( 1 / d ) + 2   . Denote also ɛ = min { L 2 , ( O , R \ O 1 ) n g ¯ } ,   where ( A , B ) : = sup x A inf y B | x y |   for arbitrary sets of real numbers A   , B   . Using assumption A3 (a), one obtains that there exists δ 1 > 0   such that for any x O 1   and for any integer t   , P ( ( 1 ) t 1 φ ( x ) < ( 1 ) t ξ 1 < ( 1 ) t 1 φ ( x ) + ɛ ) δ 1 .   This implies that if x 0 O   then P ( 0 < ( 1 ) t y t < ɛ , dist ( x t 1 , O ) < ( t 1 ) g ¯ ɛ , t = 1 , 2 , , n + 1 ) δ 1 n + 1 .   Denoting δ = δ 1 n + 1   , one concludes that the following statements (i) and (ii) hold with probability at least δ   :
(i) dist ( x n , O ) < n g ¯ ɛ   dist ( O , R \ O 1 )   , hence x n O 1   ; (ii) as t = 2 , 3 , , n + 1   , one has y t 1 y t < 0   , hence γ t = d γ t 1   , therefore γ n = d n 1 γ 1 d n 1 g ¯ < w   .
Lemma 6 is proved.  
Lemma 7 If t γ t =   , O   is an open set containing Z   , and w > 0   then for some t   , x t 1 O   and γ t < w   .
Proof. Without loss of generality, suppose that O   is bounded and O O *   .
Choose an open set O 1   such that Z O 1   , O ¯ 1 O   ; applying Lemmas 5 and 6, one gets that for δ = δ ( O 1 , O , w )   and for arbitrary initial conditions, P ( for some t , x t O and γ t < w ) > δ .   Repeating the argument of Lemma 5, one concludes that there exists t   such that x t O   and γ t < w   .   From now on we suppose that k > k + ( 0 )   . Choose k   such that k + ( 0 ) < k < k   ; using A3 (b), one obtains that for some ɛ 0 > 0   , P ( ξ 1 ξ 2 > 0 , or | ξ 1 | < ɛ 0 , or | ξ 2 | < ɛ 0 ) k   . Denote O 0 = { x : | φ ( x ) | < ɛ 0 }   and τ = inf { t : x t O 0 }   .
Without loss of generality, suppose that O 0   is bounded.
Lemma 8 Suppose that k > k + ( 0 )   , then there exist a constant b > 0   and a monotone decreasing function p ( )   such that lim a + p ( a ) = 0   and if γ 0 < w then P ( ln γ t < ln v b t for all t < τ ) > 1 p ( v / w ) .  
Proof. Define the sequences { ρ t }   and { σ t }   by
ρ t = ln u I ( ξ t 1 ξ t > 0 , or | ξ t 1 | < ɛ 0 , or | ξ t | < ɛ 0 ) +
+ ln d I ( ξ t 1 ξ t 0 & | ξ t 1 | ɛ 0 & | ξ t | ɛ 0 ) ,
σ t = ln w + i = 1 t ρ i .   Using ( 5 ) and definition of τ   , one obtains that for all t < τ   , γ t σ t   . The variables ρ t   are identically distributed, take the values ln u   and ln d   , and
E ρ t = ln u P ( ξ t 1 ξ t > 0 , or | ξ t 1 | < ɛ 0 , or | ξ t | < ɛ 0 ) +
+ ln d P ( ξ t 1 ξ t 0 & | ξ t 1 | ɛ 0 & | ξ t | ɛ 0 )
ln u k + ln d ( 1 k ) < ln u k + ln d ( 1 k ) = 0 .
Moreover, the variables in the set { ρ t , t even }   , as well as the variables in the set { ρ t , t odd }   , are independent.
Denote b = E ρ t / 2   . One has P ( ln γ t < ln v b t for all t < τ ) P ( σ t < ln v b t for all t ) =   = P ( i = 1 t ( ρ i + 2 b ) < ln v ln w + b t for all t ) 1 p ( v / w ) ,   where p ( a ) = p 1 ( a ) + p 2 ( a )   , p 1 ( a ) = P ( 1 i t ( ρ i + 2 b ) ln a 2 + b 2 t for all t ) ,   p 2 ( a ) = P ( 1 i t ( ρ i + 2 b ) ln a 2 + b 2 t for all t ) ;   the sum   (   ) is taken over the even (odd) values of i   . Both   and   are sums of i.i.d.r.v. with zero mean, hence both p 1 ( a )   and p 2 ( a )   tend to zero as a +   . Lemma 8 is proved.   Define the stopping times τ v = inf { t : x t O 0 or ln γ t ln v b t }   . Recall that f   is the primitive of φ   such that inf x f ( x ) = 0   . Fix an open set O   such that Z O O 0   and sup x O f ( x ) < inf x O 0 f ( x )   , and denote δ = inf x O 0 f ( x ) sup x O f ( x )   .
Lemma 9 Let k > k + ( 0 )   , x 0 O   , and γ 0 < w   , then P ( τ v < ) K v 2 + p ( v / w ) ;   here K   is a positive constant, and p ( )   satisfies the statement of lemma 8.
Proof. We shall use shorthand notation of Lemma 3: f t : = f ( x t )   and φ t : = φ ( x t )   . According to ( 22 ), one has f t f t 1 γ t 1 φ t 1 ( φ t 1 + ξ t ) + M 2 γ t 1 2 ( φ t 1 + ξ t ) 2   γ t 1 φ t 1 ξ t + M γ t 1 2 ( φ t 1 2 + ξ t 2 ) .   This implies that f t f 1 Q t + Q t   , with Q t = | i = 2 t γ i 1 φ i 1 ξ i | , Q t = M i = 2 t γ i 1 2 ( φ i 1 2 + ξ i 2 ) .   Using Lemma 8, one gets P ( τ v < ) p ( v / w ) + P + P ,   where P = P ( Q τ v δ / 2 ) and P = P ( Q τ v δ / 2 ) .   According to the Chebyshev inequality, P 4 δ 2 E Q τ v 2 = 4 δ 2 i , j = 1 E i j ,   where E i j = E [ γ i 1 φ i 1 ξ i I ( i 1 < τ v ) γ j 1 φ j 1 ξ j I ( j 1 < τ v ) ] .   Using that the values γ i   , φ i   , ξ i   , and I ( i < τ v )   are i   -measurable, and using assumptions A1 and A2, one obtains that for i j   , E i j = 0   , and for i = j   , E i i = E [ γ i 1 2 φ i 1 2 I ( i 1 < τ v ) ξ i 2 ] v 2 e 2 b i sup x O 0 φ 2 ( x ) S .   Therefore, P 4 δ 2 i = 2 E i i 4 v 2 S δ 2 e 4 b 1 e 2 b sup x O 0 φ 2 ( x ) .   Similarly, P 2 δ E Q τ v = 2 M δ i = 2 E [ γ i 1 2 ( φ i 1 2 + ξ i 2 ) I ( i 1 < τ v ) ]   2 M v 2 δ i = 2 e 2 b i ( sup x O 0 φ 2 ( x ) + S ) = 2 M v 2 δ e 4 b 1 e 2 b ( sup x O 0 φ 2 ( x ) + S ) .   Taking K = [ 4 S δ 2 sup x O 0 φ 2 ( x ) + 2 M δ ( sup x O 0 φ 2 ( x ) + S ) ] e 4 b 1 e 2 b ,   one gets that P + P K v 2   . Lemma 9 is proved.  
Lemma 10 If k > k + ( 0 )   then t γ t <   .
Proof. From the definition of τ v   one easily sees that if τ v =   for some v > 0   , then t γ t <   . This implies that for any v > 0  
P ( γ t = ) P ( τ v = ) . (32)
Further, by virtue of Lemma 9, if x 0 O   and γ 0 < w   then
P ( τ w < ) K w + p ( 1 / w ) . (33)
Combining ( 32 ) and ( 33 ), one gets that for any w > 0  
P ( γ t = | x 0 O and γ 0 < w ) K w + p ( 1 / w ) . (34)
Define the event A w = { for some t , x t O and γ t < w }   , then by virtue of ( 34 ),
P ( γ t = | A w ) K w + p ( 1 / w ) . (35)
Denote by A ¯ w   the complementary event, A ¯ w = { for any t , x t O or γ t w }   .
By virtue of Lemma 7,
P ( γ t = & A ¯ w ) = 0 . (36)
Using ( 35 ) and ( 36 ), one gets
P ( γ t = ) = P ( γ t = & A w ) + P ( γ t = & A ¯ w )
( K w + p ( 1 / w ) ) P ( A w ) .   Taking into account that w   can be chosen arbitrarily small and that K w + p ( 1 / w ) 0   as w 0 +   , one concludes that P ( t γ t = ) = 0   .   Now, we are in a position to prove the theorem. Suppose that k < inf z k ( z )   , then V [ k ] =   , and by Lemma 2, { x t }   diverges. So, the statement (b) of Theorem is proved.
On the other hand, according to Lemma 10, if k > k + ( 0 )   then t γ t <   , and by Lemmas 1 and 2, the sequence { x t }   converges to a point from V [ k ]   .
Thus, the statement (a) of theorem is also established.
Acknowledgements This work was partially supported by the R&D Unit CEOC (Center for Research in Optimization and Control). The second author (PC) also gratefully acknowledges the financial support by the Portuguese program PRODEP `Medida 5 Acção 5.3 Formação Avançada de Docentes do Ensino Superior Concurso nr. 2/5.3/PRODEP/2001'.
References

  1. Harold J. Kushner and G.George Yin, Stochastic approximation algorithms and applications., Applications of Mathematics. 35. (1997) Berlin: Springer.
  2. M. Nevel'son and R. Has'minskii, Stochastic approximation and recursive estimation. Translated from the Russian by Israel Program for Scientific Translations. Translation edited by B. Silver., Translations of Mathematical Monographs. Vol. 47. (1976) Providence, R.I.: American Mathematical Society.
  3. Y. Fang and T. J. Sejnowski, Faster learning for dynamic recurrent backpropagation, Neural Computation, 2 (1990), pp. 270–273.
  4. Fernando M. Silva and Luis B. Almeida, Speeding up backpropagation, Advanced Neural Computers, (1990), R. Eckmiller (Ed.), Elsevier Science Publishers, Amsterdam pp. 151-158.
  5. L. B. Almeida, T. Langlois, J. D. Amaral, and A. Plakhov, Parameter adaptation in stochastic optimization, Online Learning in Neural Networks, Saad, D. (Ed.), (1998), pp. 111–134.
  6. P. J. Werbos, Neurocontrol and supervised learning: an overview and evaluation, in Handbook of Intelligent Control (Neural, Fuzzy, and Adaptive approaches), D. A. White and D. A. Sofge, eds., Van Nostrand Reinhold, New York, (1992), pp. 65–89.
  7. H. Kesten, Accelerated stochastic approximation., Ann. Math. Stat., 29 (1958), pp. 41–59.
  8. B. Delyon and A. Juditsky, Accelerated stochastic approximation., SIAM J. Optim., 3 (1993), pp. 868–881.
  9. R. Salomon and van J. L. Hemmen, Accelerating Backpropagation through Dynamic Self-Adaptation, Neural Networks, 9:4, (1996) Elsevier Science Publishers, pp 589–601.
  10. Roberto Battiti, Accelerated Backpropagation Learning: Two Optimization Methods, Complex Systems, Inc., (1989) pp. 331-342.
  11. Marcus Frean, A “thermal” perceptron learning rule, Neural Computation, 4:6 (1992), The MIT Press, Cambridge–Massachusetts, pp 946–957.