Stochastic Mechanics
Random Media
Signal Processing
and Image Synthesis
Mathematical Economics and Finance
Stochastic Optimization
Stochastic Control
Applications of
Mathematics
Stochastic Modelling
and Applied Probability
34
Edited by I. Karatzas
M. Yor
Advisory Board P. Bremaud
E. Carlen
W Fleming
D. Geman
G. Grimmett
G. Papanicolaou
1. Scheinkman
SpringerVerlag Berlin Heidelberg GmbH
Applications of Mathematics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
FlemingiRishel, Deterministic and Stochastic Optimal Control (1975)
Marchuk, Methods of Numerical Mathematics, Second Edition (1982)
Balakrishnan, Applied Functional Analysis, Second Edition (1981)
Borovkov, Stochastic Processes in Queueing Theory (1976)
LiptserlShiryayev, Statisties of Random Processes I: General Theory
(1977)
LiptserlShiryayev, Statistics of Random Processes 11: Applications (1978)
Vorab'ev, Game Theory: Lectures for Econornists and Systems Scientists
(1977)
Shiryayev, Optimal Stopping Rules (1978)
Ibragimov/Rozanov, Gaussian Random Processes (1978)
Wonham, Linear Multivariable Control: A Geometrie Approach,
Third Edition (1985)
Hida, Brownian Motion (1980)
Hestenes, Conjugate Direction Methods in Optimization (1980)
Kallianpur, Stochastic Filtering Theory (1980)
Krylov, Controlled Diffusion Processes (1980)
Prabhu, Stochastic Storage Processes: Queues, Insurance Risk, and Dams
(1980)
Ibragimov/Has' minskii, Statistical Estimation: Asymptotic Theory (1981 )
Cesari, Optirnization: Theory and Applications (1982)
Elliott, Stochastic Calculus and Applications (1982)
Marchuk/Shaidourov, Difference Methods and Their Extrapolations
(1983)
Hijab, Stabilization of Control Systems (1986)
Pratter, Stochastic Integration and Differential Equations (1990)
Benveniste/Metivier/Priouret, Adaptive Algorithms and Stochastic
Approximations (1990)
Kloeden/Platen, Numerical Solution of Stochastic Differential Equations
(1992)
Kushner/Dupuis, Numerieal Methods for Stochastic Control Problems
in Continuous Time (1992)
FlemingiSoner, Controlled Markov Processes and Viscosity Solutions
(1993)
Baccelli/Bremaud, Elements of Queueing Theory (1994)
Winkler, Image Analysis, Random Fields and Dynamic
Monte Carlo Methods (1995)
Kalpazidou, Cycle Representations of Markov Processes (1995)
ElliottlAggouniMoore, Hidden Markov Models: Estimation and Control (1995)
HernandezLermaiLasserre, DiscreteTime Markov Control Processes (I 995)
Devroye/GyörfilLugosi, A Probabilistie Theory of Pattern Recognition (1996)
MaitraiSudderth, Discrete Gambling and Stochastic Games (1996)
Embrechts/Klüppelberg/Mikosch, Modelling Extremal Events (1997)
Duflo, Random Iterative Models (1997)
Marie Duflo
Random Iterative
Models
Translated by Stephen S. Wilson
Springer
MarieDuflo
Universitt~ de MarnelaVallee
Equipe d' Analyse et de Matbematiques Appliquees
2, rue de la Butte Verte,
93166 NoisyLeGrand Cedex, France
Managing Editors
I. Karatzas
Departments of Mathematics and Statistics, Columbia University
New York, NY 10027, USA
M.Yor
Laboratoire de Probabilites, Universite Pierre et Marie Curie
4 Place Jussieu, Tour 56, F75230 Paris Cedex, France
Title ofthe French original edition: Methodes recursives aleatoires
Published by Masson, Paris 1990
Cover picture: From areport on Prediction ofElectricity Consumption drawn
up in 1993 for E.D.F by Misiti M., Misiti Y., Oppenheim G. and Poggi J.M.
Library of Congross CotaloginginPublication Oota
Duflo. Marle.
[Methode_ recur_lve_ aleotolrO_. Engli_hJ
Rando. iterative model_ I Marie Duflo : tran_lated by Stephen S.
Wil_on.
p.
em.  (Applleation_ of .othe.aties : 34)
Ineludes bibllographieal referenees (p.
) and index.
ISBN 9783642081750
ISBN 9783662128800 (eBook)
DOI 10.1007/9783662128800
1. Iterative methods (Mathematlesl 2. Stoeha_tie processes.
3. Adaptive eontrol syste.s~M.the •• tic.l models.
I. Title.
n. Series.
OA297.8.D8413 1997
003·.76·015114dc21
9645470
CIP
Mathematics Subject Classification (1991): 60F05/15, 60042, 60J10/20,
62J02/05, 62L20, 62M05/20, 93EI2115/20/25
ISSN 01724568
ISBN 9783642081750
This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduetion on microfilm or in any other way, and storage in data banks. Duplication
of this publieation or parts thereof is permitted only under the provisions of the German Copyright
Law of September 9,1965, in its current version, and permission for use must always be obtained
from SpringerVerlag Berlin Heidelberg GmbH.
Violations are liable for proseeution under the German Copyright Law.
© SpringerVerlag Berlin Heidelberg 1997
Originally published by SpringerVerlag Berlin Heidelberg New York in 1997
Softcover reprint of the hardcover 1st edition 1997
Typeset from the translator's LaTEX files with SpringerTEX style files
SPIN: 10070196
41/3143  5432 I 0  Printed on acidfree paper
Preface
Be they random or nonrandom, iterative methods have progressively gained sway
with the development of computer science and automatic control theory.
Thus, being easy to conceive and simulate, stochastic processes defined by an
iterative formula (linear or functional) have been the subject of many studies. The
iterative structure often leads to simpler and more explicit arguments than certain
classical theories of processes.
On the other hand, when it comes to choosing stepbystep decision algorithms
(sequential statistics, control, learning, ... ) recursive decision methods are
indispensable. They lend themselves naturally to problems of the identification and
control of iterative stochastic processes. In recent years, knowhow in this area
has advanced greatly; this is reflected in the corresponding theoretical problems,
many of which remain open.
At Whom Is This Book Aimed?
I thought it useful to present the basic ideas and tools relating to random iterative
models in a form accessible to a mathematician familiar with the classical concepts
of probability and statistics but lacking experience in automatic control theory.
Thus, the first aim of this book is to show young research workers that work
in this area is varied and interesting and to facilitate their initiation period. The
second aim is to present more seasoned probabilists with a number of recent
original techniques and arguments relating to iterative methods in a fairly classical
environment.
Very diverse problems (prediction of electricity consumption, production
control, satellite communication networks, industrial chemistry, neurons, ... ) lead
engineers to become interested in stochastic algorithms which can be used to
stabilize, identify or control increasingly complex models. Their experience and
the diversity of their techniques go far beyond our aims here. But the third aim
of the book is to provide them with a toolbox containing a quite varied range of
basic tools.
Lastly, it seems to me that many lectures on stochastic processes could be
centred around a particular chapter. The division into selfcontained parts described
below is intended to make it easy for undergraduate or postgraduate students and
their teachers to access selected and relevant material.
VI
Contents
The overall general foundations are laid in Part 1. The other three par~s can be read
independently of each other (apart from a number of easily locatable references
and optional examples). This facilitates partial use of this text as research material
or as teaching material on stochastic models or the statistics of processes.
Part I. Sources of Recursive Methods
Chapter 1 presents the first mathematical ideas about sequential statIstIcs and
about stochastic algorithms (RobbinsMonro). An outline sketch of the theory
of martingales is given together with certain complementary information about
recursive methods.
Chapter 2 summarizes the theory of convergence in distribution and that of
the central limit theorem for martingales, which is then applied to the RobbinsMonro algorithm. The AR(l) autoregressive vectorial model of order 1 is studied
in detail; this model will provide the essential link between the following three
parts.
Despite its abstract style, the development of this book has been heavily
inftuenced by dialogues with other research workers interested in highly specific
industrial problems. Chapter 3 gives an alltoobrief glimpse of such examples.
Part 11. Liuear Models
The mathematical foundations of automatie control theory, which were primed in
Chapter 2 based on the AR(l) model, are developed here.
Chapter 4 discusses the concepts of causality and excitation for ARMAX
models. The importance of transferring the excitation of the noise onto that of
the system is emphasized and algebraic criteria guaranteeing such a transfer are
established.
Identification and tracking problems are considered in Chapter 5, using
classical (gradient and least squares) or more recent (weighted least squares)
estimators.
Part 111. Noulinear Models
The first part of Chapter 6 describes the concept of 'stability' of an iterative Markov
Fellerian model. Simple criteria ensuring the almost sure weak convergence
of empirical distributions to a unique stationary distribution are obtained. This
concept of stability seems to me, pedagogically and practically, much more
manageable than the classical notion of recurrence; moreover, many models
(fractals, automatie control theory) can be stable without being recurrent. A number
of properties of rates of convergence in distribution and almost sure convergence
complete this chapter.
VII
The identification and tracking problems resolved in Chapter 5 for the linear
case are much more difficult for functional regression models. Some partial
solutions are given in Chapter 7, largely using the recursive kernel estimator.
Part IV. Markov Models
Paradoxically, Part IV of this book is the most classical. It involves abrief
presentation of probabilistic topics described in greater detail elsewhere, placing
them in the context of the preceding chapters.
The general theory of the recurrence of Markov chains is finally given in
Chapter 8. Readers will note that, in many cases, it provides a useful complement
to the stability theory of Chapter 7, but at the cost of much heavier techniques
(and stronger assumptions about the noise).
On the subject of learning, Chapter 9 outlines the theory of controlled Markov
chains and onaverage optimal controls. The chapter ends with a number of results
from the theory of stochastic approximation introduced in Chapter I: the ordinary
differential equation method, Markovian perturbation, traps, applications to visual
neurons and principal components analysis.
What
YOll
Will Not Find
Since the main aim was to present recursive methods wh ich are useful in adaptive
control theory, it was natural to emphasize the almost sure properties (laws of large
numbers, laws of the iterated logarithm, optimality of a strategy for the average
asymptotic cost, ... ). Convergence in distribution is thus only discussed in outline
and the principles of large deviations are not touched upon.
Iterative Markov models on finite spaces, the simulation of a particular model
with a given stationary distribution and simulated annealing are currently in vogue,
particularly in image processing circles. Although they come under the umbrella
of 'random iterative models', they are not dealt with here.
These gaps have been partially filled in my recent book 'Algorithmes
Stochastiques', 1966, SpringerVerlag.
History
The history of this book dates back to the end of the 1980s. It was developed
at that time within the statistical research team of the Universite ParisSud, in
particular, by the automatic control team. Its contents have been enriched by
numerous exchanges with the research workers of this team and its composition
has been smoothed by several years of postgraduate courses. The first French
edition of this book was published by Masson in 1990.
VIII
When, SpringerVerlag decided to commission an English translation in 1992,
I feIt it was appropriate to present a reworked text, taking into account the rapid
evolution of some of the subjects treated. This book is a translation of that
adaptation, which was carried out at the beginning of 1993 (with a number of
additions and alterations to the Bibliography).
Acknowledgments
It is impossible to thank all those research workers and students at the Universite
ParisSud and at the Universite de MarneIaVallee where I have worked since
1993, who have contributed to this book through their dialogue. Their contributions
will be acknowledged in the Bibliography.
Three research workers who have read and critically reviewed previous drafts
deserve special mention: Bernard Bercu, Rachid SenoiIssi and Abderhamen Touati.
Lastly, Dr Stephen Wilson has been responsible for the English translation. He
deserves hearty thanks for the intelligent and most useful critical way in which he
achieved it.
Notation
Numbering System
• Within a chapter, a continuous numbering system is used far the Exercises on
the one hand and for the Theorems, Propositions, Corollaries, Lemmas and
Definitions on the other hand. The references indicate the chapter, section and
number: Theorem 1.3.10 (or Exercise 1.3.10) occurs in Section 3 of Chapter
1 and is the tenth of that chapter.
• 0 marks the end of a Proof; 0 marks the end of the statement of an Exercise
or aRemark.
Standard Mathematical Symbols
• Abbreviations. Constant is abbreviated to const. and In(ln x) to LL.
• Sets. N = integers 2: 0; Z = relative integers; Q = rational numbers; IR = real
numbers; C = complex numbers.
lA is the characteristic function for A,
lA(x) =
{ I
0
if xE A
if x rf. A
• Sequences. If (u n ) is areal monotonic sequence, U oo is its limit, either finite
or infinite.
If (u n ) and (vn ) are two positive sequences, U n =ü(vn ) (resp. o(vn )) means
that (un/v n ) is bounded (resp. tends to 0).
• Vectors. u, tu, ·u, (u, v), Ilull  see Section 4.2.1.
• Matrices d x d. A = (Aij), Ior I d identity, tA, • A, Tr A, IIAII, detA  see
Seetion 4.2.1; p(A)  see Seetion 2.3.1.
• Positive Hermitian Matrices. AminC, AmaxC, ,;c, C l , Cl ~ C 2  see Seetion
4.2.1; Cl 0 C2  see Section 6.3.2.
Norm of a rectangular matrix B, IIBII  see Section 4.2.1.
• Excitation of a Sequence of Vectors Y = (Yn). en(y) = 2:Z=o Yk ·Yk. We also
set (see Section 4.2) sn(Y) =2:Z=o IIYk 11 2 ,
fn(Y) = ·Yn(en(y))IYn and 9n(Y) = *(Yn(en_l(y))IYn.
x
• Functions. If ep is differentiable from jRP to jRq, we denote its Jaeobian matrix
by Dep. When q = 1, 'Vep = tDep is its gradient.
• Lipschitzjunction. Li(r, s)  Seetion 6.3.2.
• ODE  Seetion 9.2.
Standard Probabilistic Symbols
• Measure.
(n, A, P) probability spaee; lF = (Fn ) filtration  see Section 1.1.5;
(En,e®n) = (E,e)n; BE Borel afield for E.
For f measurable from (n, A) to (E, e) and r E e, we denote {J E r} =
{w; f(w) Er}.
For two sequenees of positive random variables (an) and (ßn), we denote
{an=O(ßn)}
=
{w;an(w)=O(ßn(w))}
{an = o (ßn)}
=
{w; an(w) = o (ßn(w»}
a.s. = almost surely
(M) = inereasing proeess, hook of a martingale  see Seetions 1.3.1, 2.1.3
and 4.3.2.
• Convergence.
~ = eonverges almost surely
~ = eonverges in probability
~ = eonverges in distribution.
Symbols for Linear Models
• Models. ARMAX, ARMA, ARX, AR, MA  see Seetion 4.1.1; RMA  see
Seetion 5.4.1
• Estimators. LS, RLS  Seetion 5.2.1; SG  Seetion 5.3.1; WLS  Seetion
5.3.2; ELS, AML  Seetion 5.4.1
• R for the delay operator  Seetion 4.1.1
Symbols for Nonlinear Models
ARF  Seetion 6.2.3; ARXF  Seetion 6.2.4; ARCH  Seetion 6.3.3
Table of Contents
Preface ......... ...............................................
v
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. IX
Part I. Sources of Recursive Methods
1. Traditional Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. 1 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Dosage: RobbinsMonro Procedure . . . . . . . . . . . . . . . . . . .
1.1.2 Search for a Maximum: KieferWolfowitz Procedure ....
1.1.3 The Twoarmed Bandit. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.4 Tracking.........................................
1.1.5 Decisions Adapted to a Sequence of Observations. . . . . . .
1.1.6 Recursive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Deterministic Recursive Approximation. . . . . . . . . . . . . . . . . . . . ..
1.2.1 Search for a Zero of a Continuous Function . . . . . . . . . . ..
1.2.2 Search for Extrema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
1.3 Random Sequences and Series . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
1.3.1 Martingales......................................
1.3.2 Stopping and Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . ..
1.3.3 RobbinsSiegmund Theorem . . . . . . . . . . . . . . . . . . . . . . ..
1.3.4 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . ..
1.3.5 Laws of Large Numbers and Limits of Maxima . . . . . . . ..
1.3.6 Noise and Regressive Series . . . . . . . . . . . . . . . . . . . . . . . ..
1.4 Stochastic Recursive Approximation .... . . . . . . . . . . . . . . . . . . ..
1.4.1 RobbinsMonro Algorithm . . . . . . . . . . . . . . . . . . . . . . . . ..
1.4.2 Control..........................................
1.4.3 The Twoarmed Bandit. . . . . . . . . . . . . . . . . . . . . . . . . . . ..
1.4.4 Tracking.........................................
1.4.5 Recursive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
3
3
3
4
5
5
6
7
11
11
13
14
14
16
18
19
23
24
29
29
30
32
33
35
XII
Table of Contents
2. Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
2.1 Convergenee in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
2.1.1 Weak Convergenee on aMetrie Space. . . . . . . . . . . . . . . ..
2.1.2 Convergence in distribution of random veetors . . . . . . . . ..
2.1.3 Central Limit Theorem for Martingales. . . . . . . . . . . . . . ..
2.1.4 Lindeberg's Condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
2.1.5 Applieations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
2.2 Rate of Convergenee of the RobbinsMonro Algorithm. . . . . . . ..
2.2.1 Convergenee in distribution of the RobbinsMonro
Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
2.2.2 Rate of Convergence of Newton's Estimator ...........
2.3 Autoregressive Models ............ . . . . . . . . . . . . . . . . . . . . . ..
2.3.1 Speetral Radius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
2.3.2 Stability.........................................
2.3.3 Random Geometrie Series ..........................
2.3.4 Explosive Autoregressive Model. . . . . . . . . . . . . . . . . . . ..
2.3.5 Jordan Deeomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
39
39
39
40
45
47
48
52
52
57
59
59
61
64
68
70
3. Current Problems. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
3.1 Linear Regression .......................................
3.1.1 Multiple Regression ...............................
3.1.2 Time Series ............... . . . . . . . . . . . . . . . . . . . . . ..
3.1.3 Tuning..........................................
3.2 Nonlinear Regression ....................................
3.2.1 A Tank. . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . . ..
3.2.2 Predietion of Electricity Consumption . . . . . . . . . . . . . . . ..
3.3 Satellite Communication: Markov Models. . . . . . . . . . . . . . . . . . ..
3.4 Neurons: Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
3.4.1 From Neurobiology to Learning Algorithms . . . . . . . . . . ..
3.4.2 Artifieial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . ..
75
75
75
76
77
79
79
80
82
85
85
86
Part 11. Linear Models
4. Causality and Excitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
4.1 ARMAX Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
4.1.1 Definitions and Notation. . . . . . . . . . . . . . . . . . . . . . . . . . ..
4.1.2 Companion Matrix ................................
4.1.3 Causality and Stability .............................
4.2 Excitation..............................................
4.2.1 Positive Hermitian Matrices . . . . . . . . . . . . . . . . . . . . . . . ..
4.2.2 Sequenees and Series of Positive Hermitian Matriees ....
4.2.3 Exeitation of a Sequenee of Veetors ..................
4.2.4 Inversion of the Exeitation ..........................
4.3 Laws of Large Numbers ..................................
89
89
89
91
92
95
95
97
100
101
104
Table of Contents
XIII
4.3.1 Preliminary Calculations ............................
4.3.2 Vector Martingales ................................
4.3.3 Multidimensional Regressive Series . . . . . . . . . . . . . . . . . ..
4.3.4 Counterexamples ..................................
Transfers of Excitation ...................................
4.4.1 Transfers of the Noise Excitation .....................
4.4.2 Irreducibility .....................................
4.4.3 Transfers of Excitation to an ARMAX Model ..........
4.4.4 Excitation of an AR(l) Model .......................
4.4.5 Excitation of an ARMA Model ......................
104
106
108
111
116
116
119
120
124
129
5. Linear Identification and Tracking ..............................
5.1 Predict, Estimate, Track ..................................
5.1.1 Estimators and Predictors . . . . . . . . . . . . . . . . . . . . . . . . . ..
5.1.2 Tracking.........................................
5.2 Identification of Regression Models .........................
5.2.1 Leastsquares Estimator ............................
5.2.2 Identification of an AR(P) Model .....................
5.3 Tracking by Regression Models ............................
5.3.1 Gradient Estimator ................................
5.3.2 Weighted Leastsquares Estimator ....................
5.3.3 Tracking .........................................
5.4 ARMAX Model: Estimators and Predictors ...................
5.4.1 Description of Estimators and Predictors ...............
5.4.2 Fourier Series: Passivity ............................
5.4.3 Consistency and Prediction Errors ....................
5.5 ARMAX Model: Identification and Tracking .................
5.5.1 Identification of an ARMA Model ....................
5.5.2 Identification of an ARMAX Model ..................
5.5.3 Tracking: Optimality ...............................
5.5.4 Tracking: Optimality and Identification . . . . . . . . . . . . . . ..
133
133
133
135
136
136
140
144
144
145
147
150
150
152
158
166
166
168
169
173
4.4
Part IH. Nonlinear Models
6. Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
6.1 Stability and Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
6.1.1 Stability and Recurrence of a System .................
6.1.2 Markov Chain ....................................
6.1.3 Stationary Distribution .............................
6.2 Lyapunov's Method ......................................
6.2.1 Stabilization Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
6.2.2 Stability Criteria ..................................
6.2.3 Stable ARF(P) Models .............................
6.2.4 Markov Representations . . . . . . . . . . . . . . . . . . . . . . . . . . ..
181
181
181
183
186
189
189
191
193
196
XIV
Table of Contents
6.3
Lipschitz Mixing ........................................
6.3.1 Stability of Iterative Lipschitz Models. . . . . . . . . . . . . . . ..
6.3.2 Lipschitz Mixing ..................................
6.3.3 Stable Iterative Model of Order p ....................
Rates..................................................
6.4.1 Law of the Iterated Logarithm .......................
6.4.2 Rates for a Stable AR(p) Model. .....................
6.4.3 Convergence in Distribution of Continuous Processes ....
6.4.4 Functional Central Limit Theorem ....................
6.4.5 Uniform Laws of Large Numbers ....................
199
199
200
205
208
209
213
216
218
218
7. Nonlinear Identification and Control ............................
7.1 Estimation of the Stationary Distribution of a Stable Model .....
7.1.1 Empirical Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.2 Regularized Empirical Estimators ....................
7.1.3 Estimation of the Density of the Stationary Distribution ..
7.2 Estimation of a Regression Function . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Empirical Estimators of a Regression Function ..........
7.2.2 Regression with a Stable Explicative Variable ..........
7.2.3 Identification of a Stable ARF(p) Model ...............
7.2.4 Regression with a Stabilized Explicative Variable .......
7.2.5 Prediction ErrOfS ..................................
7.3 Controlled Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Modelling and Examples ...........................
7.3.2 Likelihood of a Controlled Markov Chain ..............
7.3.3 Stabilization ......................................
7.3.4 Optimal control ...................................
7.3.5 Optimal Quadratic Cost of an ARX( 1,1) Model . . . . . . . . .
227
227
227
228
230
238
238
241
244
245
247
251
251
254
256
260
263
6.4
Part IV. Markov Models
8. Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1 Markov Chain ..........................................
8.1.1 Data and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Markov Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.3 Return Times .....................................
8.1.4 Coupling ........................................
8.2 Recurrence and Transience ................................
8.2.1 Concept of Recurrence .............................
8.2.2 Atomic Markov Chains .............................
8.2.3 Random Walks on JE •••••••••••••••••••.•.••.•••••.
8.2.4 From Atoms to Small Sets ..........................
8.3 Rate of Convergence to the Stationary Distribution ............
269
269
269
271
272
274
276
276
278
281
286
291
Table of Contents
8.3.1
8.3.2
8.3.3
8.3.4
Convergenee of Transition Probabilities to the Stationary
Distribution ......................................
Central Limit Theorem .............................
Orey's Theorem ...................................
Riemannian and Geometrie Reeurrenee ................
XV
292
295
298
299
9. Learning .................................................... 305
9.1 Controlled Markov Chains ................................ 305
9.1.1 Markov Property .................................. 305
9.1.2 Return Times and Reeurrenee ........................ 306
9.1.3 Average Costs .................................... 310
9.2 Stoehastie Algorithms .................................... 315
9.2.1 Regions of Attraetion of a Differential Equation ........ 316
9.2.2 The Ordinary Differential Equation Method ............ 317
9.2.3 Markovian Perturbations ............................ 321
9.3 Seareh for a Strongly Attraetive Point ....................... 327
9.3.1 Gradient Estimator for a Linear Model ................ 327
9.3.2 Strongly Attraetive Point ........................... 331
9.3.3 Visual Neurons ................................... 334
9.3.4 Seareh for the Minimum of a Potential ................ 337
9.4 How Algorithms Avoid Traps .............................. 340
9.4.1 Negligible Sets for Regressive Series ................. 340
9.4.2 Reeursive Prineipal Components Analysis .............. 343
Bibliography .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Index .. ........................................................ 381
Part I
Sources of Recursive Methods
1. Traditional Problems
We begin by introducing a number of very classical examples to illustrate the main themes
of this book. We then recaU, without proofs, the basic notions of analysis and probability
theory which figure in numerous reference works; we also describe the first tools specific
to recursive methods which we apply to the examples given at the beginning of the chapter.
1.1 Themes
1.1.1 Dosage: RobbinsMonro Procedure
1.1.2 Search for a Maximum: KieferWolfowitz Procedure
1.1.3 The Twoarmed Bandit
1.1.4 Tracking
1.1.5 Decisions Adapted to a Sequence of Observations
1.1.6 Recursive Estimation
1.2 Deterministic Recursive Approximation
1.2.1 Search for a Zero of a Continuous Function
1.2.2 Search for Extrema
1.3 Random Sequences and Series
1.3.1 Martingales
1.3.2 Stopping and Inequalities
1.3.3 RobbinsSiegmund Theorem
1.3.4 Laws of Large Numbers
1.3.5 Laws of Large Numbers and Limits of Maxima
1.3.6 Noise and Regressive Series
1.4 Stochastic Recursive Approximation
1.4.1 RobbinsMonro Aigorithm
1.4.2 Control
1.4.3 The Twoarmed Bandit
1.4.4 Tracking
1.4.5 Recursive Estimation
1.1 Themes
1.1.1 Dosage: RobbinsMonro Procedure
A dose u of a chernical product creates a randorn effect rneasured by J(u, E),
where E is a randorn variable and J is an unknown function frorn ]R2 to IR. The
rnean effect is assurned to be increasing; in other words, J(u, E) is assurned to be a
randorn variable with rnean cf>(u) = E[f(u, E)], where cf> is increasing (unknown).
We seek to deterrnine the dose u' which creates a rnean effect of a given level
a; in other words, to solve the equation cf>(u*) = a.
M. Duflo, Random Iterative Models
© SpringerVerlag Berlin Heidelberg 1997
4
1. Traditional Problems
The RobbinsMonro procedure is as follows. Uo is chosen arbitrarily and
administered to a subject which reacts with the effect XI = f(Uo, cd. The
procedure continues recursively. At time na dose Un is chosen, based on previous
observations, and administered to a subject independent of those already treated:
the effect is X n + 1 = f(Un , cn+l). The subjects treated are assumed to be of the
same type and mutually independent, which translates into the fact that (cn) is a
sequence of independent random variables with the same law as c. Thus, since
Un depends only upon the previous observations,
If X n +1 is greater than oe, it is sensible to reduce the dose; if X n + 1 is less than
oe, it is sensible to increase the dose. The RobbinsMonro algorithm for choosing
the Unis of the form:
Un+ 1 = Un  I'n(Xn+ 1

oe), where I'n ::::: O.
1.1.2 Search for a Maximum: KieferWolfowitz Procedure
We return to the above problem of dosage, but now our aim is to maximize the
mean effect. The function
equation:
u* = arg max
This reduces to finding u* such that F(u*) = 0, where
.
1
!Im (
2c
c>o
Since
u::::: u*.
:s: u*
and
:s: 0 if
The KieferWolfowitzprocedure is analogous to the RobbinsMonro procedure.
Let (c n ) be a suitable sequence which decreases towards O. To begin with, we
choose an arbitrary dose Uo and treat two identical independent subjects, one with
dose Uo + Co, the other with dose Uo  Co: the effects are Xi and X? We then
proceed recursively. At time n, we choose a dose Un , taking into the account
previous observations, and treat two subjects identical to the previous ones which
are mutually independent and independent of the subjects previously observed.
One of these subjects is treated with dose Un + Cn , the other with dose Un  Cn .
The results are:
and
All the random variables E~, n ::::: 1 and i = 1,2, are mutually independent with the
same law as E. Prom the above, it is natural to increase the dose if X ~+ I  X,;+ 1 ::::: 0
and reduce it otherwise. Thus, we have the KieferWolfowitz algorithm for
choosing doses:
Un+1 = Un + (I'n/Cn)(X~+1  X';+l)'
"In::::: O.
1.1
Themes
5
1.1.3 The Twoarmed Bandit
The twoarmed bandit is a slot machine with two levers A and B. For lever A
(resp. B), the gain is 1 with probability (JA and 0 with probability 1  (JA (resp. I
with probability (JB and 0 with probability 1 _ (JB).
We suppose that 0 < (JA < 1 and 0 < (JB < 1, (JA =I (JB. We denote:
V (JB
=
AvB
=
A/\B
=
(JA
sup «(JA, (JB)
{~
{~
if (JA > (JB
otherwise
if (JA > (JB
otherwise.
At time n, the player chooses the lever Un , Un = A or Un = B. This choice
is determined by the previous experiences (Uo is arbitrary). The player obtains a
gain X n + l • Conditionally to the choice of the machine, X n +1 is a random variable,
independent of the past with a Bernoulli law with parameter (JA if Un = A and
parameter (JB if Un = B.
Let G n = XI + ... + X n . According to the law of large numbers, if lever A is
always used, Gn/n ~ (JA, while if Bis always used, Gn/n ~ (JB.
The player wishes to optimize his average gain Gn/n. If he knows (JA and
(JB, he will always use the lever corresponding to A vB. But, if he does not know
these parameters, he will have to estimate them; in other words, at each time n,
he will take two decisions based on the previous observations:
• choice of an 'estimator' Bn = (B;;, B:!) of (J = «(JA, (JB);
• choice of the lever Un to use for the next try.
Thus, there are two problems:
• to achieve 'strong consistency' of the estimator Bn ~ (};
• to achieve 'optimality on average' of the gain Gn/n ~ (JA V (JB, in other
words an asymptotic gain which is as good as if (J were known.
1.1.4 Tracking
We assurne we have a deterministic sequence (zn) which we seek to track using
a device the result of which at any moment in time depends upon an adjustment
carried out at the previous moment and upon a 'noise'.
More precisely, we suppose that C = (c n ) is a sequence of independent random
variables, identically distributed with mean zero and variance 0 2 ; this is the
sequence of errors called the 'noise'.
At the start, we choose areal adjustment Uo; the observation is
6
1. Traditional Problems
At time n, we choose areal adjustment Un as a function of the previous
observations; the observation is then
It is now a matter of tracking the trajectory (zn). One natural quadratic criterion
involves seeking to make the cost until time n
n
Cn
= 2:(Xk

Zk)2
k=l
small. If all the terms of the sequence (zn) are equal to some z, the problem
reduces to a problem of adaptation to a level (or to a target).
If 8 is known, 8 =I 0, it is natural to choose Un = zn+J/8, so that X n+1 is a
random variable with mean Zn+l. In this case, Cn/n ~ u 2 .
If 8 is unknown, () =I 0, at any moment in time, the problem reduces to taking
two decisions based on the previous observations:
• choice of an estimator On of ();
• choice of an adjustment Un , for example, Un
= Zn+l/On.
As in Section 1.1.3, there are two problems:
• to achieve 'strong consistency' of the estimator On ~ ();
• to ac hieve optimal tuning Cn/n ~ u 2 , as in the case when 8 is known.
1.1.5 Decisions Adapted to a Sequence of Observations
All the problems wh ich we shall meet in this book involve carrying out a sequence
of observations and taking an optimal decision in a stepbystep manner based on
these observations according to various criteria. The decision at time n (estimation,
adjustment, dosage ... ) is based solelyon the observations carried out up to that
time. A common presentation of these problems is facilitated by the following
model.
Definition 1.1.1
1. A sequence of observations is defined by
• a probability space (D, A, P),
• a filtration lF' = (Fn)nEN where for all n E N, F n is a ufield with F n C
F n + 1 cA.
The ufield F n is the ufield of events prior to n.
2. A sequence (Xn)nEN of measurable functions from (D, A) to another
measurable space (E, [) is said to be adapted to lF' if, for all n, X n is F n measurable. It is predictable for lF' if for all n, X n is Fn_1measurable.
1.1 Themes
7
Example
Most often, the observation at time n is a measurable function Zn
from (n, A, P) to astate space (E, [). For example, taking E = ]R.d with the
Borel O"field BlRd, Zn is a random vector of dimension d. The O"field Fn is
the O"field generated by (Zo, ... , Zn); an Fnmeasurable random variable has the
form F(Zo, ... , Zn) where F is a Borel function on (E, [t+ 1 = (En+l, [®n+I),
[®n being the product of n fields equal to [. Thus, since it only takes observations
carried out up to time n into account, the decision taken at time n must be F n measurable. A sequence of decisions adapted to IF is then taken step by step. Most
often, the fields [ or [®n will be understood, particularly in the case of Borel
O"fields of ]R.d.
0
1.1.6 Recursive Estimation
We consider a sequence of observations with the notation of Definition 1.1.1; we
assurne that it depends upon an unknown parameter () E ]R.6 to be estimated.
Definition 1.1.2 An estimator of the parameter () E ]R.6 adapted to the sequence
of observations of Definition 1.1.1 is a sequence of random vectors of dimension
6, (Bn ), adapted to IF. Bn is the estimator of () at time n.
This estimator is strongly consistent (resp. weakly consistent) if the sequence
(B n ) converges almost surely (resp. in probability) to ().
Jt is recursive if Bn can be calculated recursively using a simple function of
Bn  I and the result of the observation at time n.
Example 1 Given a probability distribution F on ]R.d, a sampie from the
distribution F is a sequence (X n ) of independent observations with distribution F.
At time n, the nsample (XI, ... , X n ) has been observed; Fn is the field generated
by this nsample and X k also denotes the associated column matrix.
If F has mean p, and covariance
the following empirical estimators of p,
and
at time n are often used:
r,
r
Empirical mean
Empirical covariance
Xn
=
1
(XI + ... +Xn )
2
=
;~(XkXn)(XkXn).
En
n
1~

t

k=1
The empirical mean is a strongly consistent estimator of p,: this result is
the cIassical law of large numbers. It is also a recursive estimator, since
nII
X n = nXnI + r;;Xn.
In addition, it is known that vn(X n  p,) converges in distribution to a Gaussian
random vector with mean zero and covariance
this is the centrallimit theorem,
to which we shall return in Chapter 2.
0
r:
Exercise 1.1.1
Check that the empirical covariance
consistent estimator of
r.
IS
a recursive, strongly
0
8
I. Traditional Problems
We consider a distribution F on R. with density f with respect to
the Lebesgue measure and a sampIe (Yn ) from this distribution. We observe a
translation model (X n ), where X n = Y n + (), for some real, unknown parameter
B. In other words, the observation is a sampIe from the distribution with density
Example 2
x
t
fex  ()).
Without knowing the function f, we may assurne that F is integrable with zero
mean; then () is the mean of X n and we may use the empirical mean as estimator.
We mayaIso assurne that f is an even, continuous and strictly positive function.
We then consider the function
p(x)
=
=
P(Xn
:::;
P{Yn:::;
x)  1/2
X 
()) 
= [X~(J f(t)dt _
1/2
1/2.
=
=
If we assurne that fis even, then P(Yn :::; 0) 1/2 and P«()) O. Thus, we must
approximate the point at which P has value O. We suppose that we have chosen
an estimator OnI at time n  1. We observe X n :
Since P is increasing, we propose (based on an idea analogous to that of dosage)
a recursive estimator
On = OnI 1'n1 (l(Xn$/Jn_Il  1/2)
o
for some suitable positive sequence (1'n)'
Example 3
Still in the framework of the sampIe described in Example 1, we
suppose that F = F«(), .), where F«(), .) is the distribution with density f«(),') with
respect to a given measure A on ]Rd (most often, A is the Lebesgue measure or,
when d = 1, the measure which gives weight 1 to each integer). The function f is
known, but it remains to estimate the parameter () taken in a subset of]Rc. Based
on an nsample of F«(), .), we define a c1assical statistic, the likelihood Vn «()):
r
and its logarithm, the loglikelihood:
Then a maximum likelihood estimator at time n satisfies the equation
On = argmax Vn «()) = argmax vn(B).
If the function f is sufficiently regular (the model is then said to be regular) we
have the following properties:
a)
J dF«(), x) Di In f«(), x) = o.
1.1 Themes
J
9
J
b) Iij(B) = dF(B, x) Di In f(B, x)Dj In f(B, x) =  dF(B, x) DiDj In f(B, x),
where we have denoted D i = ß jßBi . The matrix I(B) = (Iij(B» is the Fisher
information matrix.
c) If I(B) is invertible, the maximum likelihood estimator is strongly consistent
and, in addition, ,fii(On  B) converges in distribution towards a Gaussian
distribution with mean zero and covariance (I(B»I.
Thus, this maximum likelihood estimator has good asymptotic properties, at
least in the case of regular models. It is sometimes easy to ca1culate, for example,
when F is the Gaussian distribution on lR with mean /L and variance a 2 ,
f(B, x)
=(211"a 2) 1/2 exp( (x 
2
2
/L) /2a )
and the maximum likelihood estimator is On = (X n, E~).
However, when F is more complicated, at each moment in time we must find
the point at which a new function V n attains its maximum. Such a ca1culation is
possible, but expensive; it is useful to obtain equally good recursive estimators.
o
Special Cases. In the following two examples d = 8 = 1 and we observe a
translation model f(B, x) = f(x  B) where f is areal function of cIass CI.
Here, the Fisher information statistic is independent of B, J = J~oo Y(f,)2(x)dx.
Logistic distribution
f(x)
=
In f(x)
=
Ij(2(1+coshx»=eXj(1+eX)2
x  21n(1 + e X)
vn(B)
=
L(Xk
n

B)  21n(1 +exp(Xk  B».
k=1
Ca1culation of the maximum likelihood estimator is not easy.
Exercise 1.1.2
distribution.
o
In this example we consider the translation model of the logistic
a) Check that the Fisher information statistic has value J = 4/3.
b) Check that B is the sampie mean. Thus, the empirical mean is a strongly
consistent recursive estimator. Check that ,fii(X n  B) converges in distribution
to a Gaussian distribution with mean zero and variance > 3/4. In this sense, the
empirical mean is a worse estimator than the maximum likelihood estimator. 0
Cauchy distribution
n
vn(B)
n In 11" 
L In(1 + (Xk _ B)2).
k=1
10
I. Traditional Problems
Here again we have a regular model; however, calculation of the maximum
Iikelihood estimator is not easy.
0
Exercise 1.1.3
We consider the translation model of the Cauchy distribution.
a) Check that the Fisher information statistic is J = 1/2.
b) Check that the Cauchy distribution is not integrable.
c) Assuming that if Y has density J, its characteristic function is E[exp(itY)] =
exp( Iti), prove that X n  B converges in distribution to a Cauchy distribution,
whence the empirical mean is not a strongly consistent estimator of B.
0
Conclnsion of Section 1.1 This seetion has raised various types of problem.
• Problems associated with dosage or the search for an extremum
(Sections 1.1.12). This is contro!.
• The search for strongly consistent recursive estimators (Section 1.1.6).
• A mixed problem involving choice of both the control and the estimation
(Sections 1.1.3 and 1.1.4). This is adaptive control.
For each of these problems, we look for recursive algorithms, namely iterations
where the decision at time n is a simple function of the decision at time n  1
and the observation at time n.
Sources
The mathematical statements of decisions taken step by step are comparatively
recent.
The prehistory is difficult to date. Two old sources were brought to my attention
by B. Bru. In La theorie des jugements, Condorcet (1785) described a scheme
used to reach the truth by a sequence of suitably graduated imperfect judgements.
After the war of 1870, the French artillery was modernized; in particular, it drew
up a theory 0/ the dispersion 0/ fire, which suggests a procedure of progressive
adjustment analogous to that of RobbinsMonro.
The pioneers of recursive statistics were H.F. Dodge and H.G. Roming, who
in 1929 proposed a quality control technique involving double sampling ((1929):
A method of sampling inspection. Bell. Sys. Tech. J., 8, 613631). Sequential
statistics was initiated later by A. Wald in the 1940s ((1947): Sequential Analysis.
Wiley, London). See Siegmund ((1985) [GI]) for further steps.
An article by H. Hotelling ((1941): Experimental determination of the
maximum of a function. Ann. Math. Stat., 12, 2046) for the case of polynomials
of degree two is a first study of the dosage problem.
The choice between two types of experiment (twoarmed bandit) appears in
biometry in papers by W.R. Thompson ((1935): On the theory of apportionment.
Am. J. Math., 57, 450457) and P.c. Mahalanobis ((1940): A sampie survey of
the acreage under jute in Bengal. Sankhya, 4, 511531).
But stochastic algorithms were born in 1951, with the publication of a
fundamental sevenpage article (Robbins and Monro 1951 [R]) followed by two
l.2 Deterministic Recursive Approximation
11
four and five page articles (Wolfowitz 1952 [R], Kiefer and Wolfowitz 1952
[R]). Current solutions and referenees to the themes deseribed above are given in
Seetion 1.4.
Papers whieh study the asymptotie statisties of sampies and, notably, those
whieh include assumptions guaranteeing the regularity of the model, are legion.
These include, for example, Cox and Hinkley (1974 [GI]), Ferguson (1967 [GI]),
Ibragimov and Has'minskii (1983 [GI]), Daeunha Castelle and Duflo (1986a, b
[GI]), GenonCatalot and Pieard (1991 [GI]) and, for more generality, Le Cam
(1986 [GI]).
1.2 Deterministic Recursive Approximation
Computerbased ealculation has made iterative methods fundamental in analysis.
Here, we shall merely present a number of results whieh are simple to generalize
to stoehastie problems. Other, slightly more sophisticated deterministie methods
are given in Seetion 9.2.
1.2.1 Search for a Zero of a Continuous Function
The problem of deterministie dosage involves finding a point at whieh a monotonie
eontinuous funetion f eros ses a given level.
Proposition 1.2.3 Suppose that fis a continuous realJunction such that f(x*)
a and such that, Jor alt x,
(f(x)  a)(x  x*)
If(x)1
=
< 0
:$
K(l + Ix\)
Jor some constant K. Suppose that (rn) is a positive sequence decreasing towards
o such that 2:: rn diverges and that (cn) is another sequence such that 2:: rncn+1
converges. Then the sequence (x n ) defined by
converges to x*, Jor all initial values Xo.
Special Case. For a deterministie study, we naturally take
above assumptions on f, we may take
Cn+1
= O.
With the
ProoJ Replaeing f by x + f(x+x*)  a, we may assurne that a and x* are zero.
We have Xn+1 = Xn + rn(f(xn) + cn+d and xf(x) < 0 if x =I O. When rnK < I
we have: